From wilde at mcs.anl.gov Mon Oct 8 09:40:12 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 8 Oct 2012 09:40:12 -0500 Subject: [Swift-devel] CFP: 2nd IEEE International Workshop on Workflow Models, Systems, Services and Applications in the Cloud (CloudFlow) 2013 In-Reply-To: References: Message-ID: * * *Second IEEE International Workshop on Workflow Models, Systems, Services and Applications in the Cloud (CloudFlow) 2013* *To be held in conjunction with the 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2013, Cambridge, Boston, Massachusetts, USA, May 20-24, 2013.* http://www.cloud-uestc.cn/cloudflow/home.html *Overview* Cloud computing is gaining tremendous momentum in both academia and industry, more and more people are migrating their data and applications into the Cloud. We have observed wide adoption of the MapReduce computing model and the open source Hadoop system for large scale distributed data processing, and a variety of ad hoc mashup techniques that weave together Web applications. However, these are just first steps towards managing complex task and data dependencies in the Cloud, as there are more challenging issues such as large parameter space exploration, data partitioning and distribution, scheduling and optimization, smart reruns, and provenance tracking associated with workflow execution. Cloud needs structured and mature workflow technologies to handle such issues, and vice versa, as Cloud offers unprecedented scalability to workflow systems, and could potentially change the way we perceive and conduct research and experiments. The scale and complexity of the science and data analytics problems that can be handled can be greatly increased on the Cloud, and the on-demand nature of resource allocation on the Cloud will also help improve resource utilization and user experience. As Cloud computing provides a paradigm-shifting utility-oriented computing model in terms of the unprecedented size of datacenter-level resource pool and the on-demand resource provisioning mechanism, there are lots of challenges in bringing Cloud and workflows together. We need high level languages and computing models for large scale workflow specification; we need to adapt existing workflow architectures into the Cloud, and integrate workflow systems with Cloud infrastructure and resources; we also need to leverage Cloud data storage technologies to efficiently distribute data over a large number of nodes and explore data locality during computation etc. We organize the CloudFlow workshop as a venue for the workflow and Cloud communities to define models and paradigms, present their state-of-the-art work, share their thoughts and experiences, and explore new directions in realizing workflows in the Cloud. *Topics:* We welcome the submission of original work related to the topics listed below, which include (in the context of Cloud): ? Models and Languages for Large Scale Workflow Specification ? Workflow Architecture and Framework ? Large Scale Workflow Systems ? Service Workflow ? Workflow Composition and Orchestration ? Workflow Migration into the Cloud ? Workflow Scheduling and Optimization ? Cloud Middleware in Support of Workflow ? Virtualized Environment ? Workflow Applications and Case Studies ? Performance and Scalability Analysis ? Peta-Scale Data Processing ? Event Processing and Messaging ? Real-Time Analytics ? Provenance *Paper Submission* Authors are invited to submit papers with unpublished, original work. The papers should not exceed 10 single-spaced double-column pages using 10-point size font on 8.5x11 inch pages (IEEE conference style), including figures, tables, and references. Paper submission should be done via the online CMT system, Microsoft?s Academic Conference Management Service (* https://cmt.research.microsoft.com/CF2013*) by midnight January 9th, 2013 Pacific Time. The final format should be in PDF. Proceedings of the workshop will be published by the IEEE Digital Library (indexed by EI) and distributed at the conference. Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters. Submission implies the willingness of at least one of the authors to register and present the paper. *Important Dates* ** Paper submission: January 9th, 2013 Acceptance notification: February 8th, 2013 Final paper due: Feb 19th, 2013 *Organization* Workshop Chairs: Dr. Yong Zhao University of Electronic Science and Technology of China, China yongzh04 at gmail.com Dr. Cui Lin California State University, Fresno, USA clin at csufresno.edu Dr. Shiyong Lu Wayne State University, USA shiyong at wayne.edu Program Chair: Dr. Wenhong Tian University of Electronic Science and Technology of China, China Publicity Chair: Dr. Ruini Xue University of Electronic Science and Technology of China, China *Steering Committee * ? Daniel S. Katz, University of Chicago, U.S.A. ? Mike Wilde, University of Chicago, U.S.A. ? Ewa Deelman, University of South California, U.S.A. ? Tevfik Kosar, University at Buffalo, U.S.A. ? Ilkay Altintas, San Diego Supercomputer Center, U.S.A. ? Ioan Raicu, Illinois Institute of Technology, U.S.A. ? Yogesh Simmhan, University of Southern California, U.S.A. ? Ian Taylor, Cardiff University, U.K. ? Weimin Zheng, Tsinghua University, China ? Hai Jin, Huazhong University of Science and Engineering, China ? Wanchun Dou, Nanjing University, China ? Hui Zhang, National Science and Technology Infrastructure, China *Program Committee * ? Shawn Bowers, Gonzaga University, U.S.A. ? Douglas Thain, University of Notre Dame, U.S.A. ? Ian Gorton, Pacific Northwest National Laboratory, U.S.A. ? Artem Chebotko, University of Texas at Pan American, U.S.A. ? Weisong Shi, Wayne State University, U.S.A. ? Paolo Missier, Newcastle University, U.K. ? Wei Tan, IBM T. J. Watson Research Center, U.S.A. ? Jianwu Wang, San Diego Super Computer Center, U.S.A. ? Ping Yang, Binghamton University, U.S.A. ? Jian Guo, Harvard University, U.S.A. ? Liqiang Wang, University of Wyoming, U.S.A. ? Paul Groth, VU University Amsterdam, the Netherlands ? Zhiming Zhao, University of Amsterdam, the Netherlands ? Marta Mattoso, Federal University of Rio de Janeiro, Brazil ? Wenhong Tian, University of Electronic Science and Technology of China, China ? Ruini Xue, Tsinghua University, China ? Jian Cao, Shanghai Jiaotong University, China ? Jianxun Liu, Hunan University of Science and Technology, China ? Song Zhang, Chinese Academy of Sciences, China ? Hua Hu, Hangzhou Dianzi University, China -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun Oct 14 22:48:54 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Oct 2012 20:48:54 -0700 Subject: [Swift-devel] [swift-support] Channel multiplexer error In-Reply-To: References: <73B3B09E-B386-4B6D-A9D6-58EEB00B747C@uchicago.edu> <931001436.34144.1349713641059.JavaMail.root@zimbra.anl.gov> <1350000735.8355.0.camel@blabla> Message-ID: <1350272934.15993.5.camel@blabla> I spoke to Mike on the phone on Friday, and we agreed that foreach.max.threads is a bit difficult to use. So I removed the throttling for foreach and added it to app invocations. It might help with memory consumption if you set it to around the number of cpus you have access to. I also did some small optimizations to improve memory use. I can run about 50K jobs with your script on a 32 bit system with 1GB of heap space. I suspect that on 64 bit systems this might require more heap. I also added an option to the swift executable to automatically dump a copy of the heap when an out of memory condition occurs. Hopefully that will help us troubleshoot such problems in the future. This is all in trunk. Mihael On Sat, 2012-10-13 at 04:47 -0500, Kazutaka Takahashi wrote: > Hi All, > > Sorry for a late reply, but I am already at a conference and had not > had a chance to try what Mike proposed. I will try later during the > second half of the conference, if not after the conference ending next > wed. > Taka > > > On Thu, Oct 11, 2012 at 7:12 PM, Mihael Hategan > wrote: > Did you try what Mike proposed? > > Mihael > > On Thu, 2012-10-11 at 17:49 -0500, Kazutaka Takahashi wrote: > > OK... > > The last one died with the same error msg... Please check > the both > > directories... > > > > > taka at login4:/lustre/beagle/GCNet/RG/Athena/a080521_BS_new/Cond2> > > > taka at login4:/lustre/beagle/GCNet/RG/Athena/a080521_BS_new/Cond3> > > > > /lustre/beagle/GCNet/bin/Swift/swift: line 164: 8527 Killed > > java -Xmx5120M > -Djava.endorsed.dirs=/soft/swift/0.93//lib/endorsed > > -DUID=3023 -DGLOBUS_HOSTNAME=login4.beagle.ci.uchicago.edu > > -DCOG_INSTALL_PATH=/soft/swift/0.93/ > -Dswift.home=/soft/swift/0.93/ > > -Duser.home=/lustre/beagle/GCNet > > -Djava.security.egd=file:///dev/urandom -XX:+UseParallelGC > > -XX:ParallelGCThreads=1 > > > -classpath /soft/swift/0.93//etc:/soft/swift/0.93//libexec:/soft/swift/0.93//lib/addressing-1.0.jar:/soft/swift/0.93//lib/ant.jar:/soft/swift/0.93//lib/antlr-2.7.5.jar:/soft/swift/0.93//lib/axis.jar:/soft/swift/0.93//lib/axis-url.jar:/soft/swift/0.93//lib/castor-0.9.6.jar:/soft/swift/0.93//lib/coaster-bootstrap.jar:/soft/swift/0.93//lib/cog-abstraction-common-2.4.jar:/soft/swift/0.93//lib/cog-axis.jar:/soft/swift/0.93//lib/cog-grapheditor-0.47.jar:/soft/swift/0.93//lib/cog-jglobus-1.7.0.jar:/soft/swift/0.93//lib/cog-karajan-0.36-dev.jar:/soft/swift/0.93//lib/cog-provider-clref-gt4_0_0.jar:/soft/swift/0.93//lib/cog-provider-coaster-0.3.jar:/soft/swift/0.93//lib/cog-provider-dcache-0.1.jar:/soft/swift/0.93//lib/cog-provider-gt2-2.4.jar:/soft/swift/0.93//lib/cog-provider-gt4_0_0-2.5.jar:/soft/swift/0.93//lib/cog-provider-local-2.2.jar:/soft/swift/0.93//lib/cog-provider-localscheduler-0.4.jar:/soft/swift/0.93//lib/cog-provider-ssh-2.4.jar:/soft/swift/0.93//lib/cog-provider-webdav-2.1.jar:/soft/swift/0.93//lib/cog-resources-1.0.jar:/soft/swift/0.93//lib/cog-swift-svn.jar:/soft/swift/0.93//lib/cog-trap-1.0.jar:/soft/swift/0.93//lib/cog-url.jar:/soft/swift/0.93//lib/cog-util-0.92.jar:/soft/swift/0.93//lib/commonj.jar:/soft/swift/0.93//lib/commons-beanutils.jar:/soft/swift/0.93//lib/commons-collections-3.0.jar:/soft/swift/0.93//lib/commons-digester.jar:/soft/swift/0.93//lib/commons-discovery.jar:/soft/swift/0.93//lib/commons-httpclient.jar:/soft/swift/0.93//lib/commons-logging-1.1.jar:/soft/swift/0.93//lib/concurrent.jar:/soft/swift/0.93//lib/cryptix32.jar:/soft/swift/0.93//lib/cryptix-asn1.jar:/soft/swift/0.93//lib/cryptix.jar:/soft/swift/0.93//lib/globus_delegation_service.jar:/soft/swift/0.93//lib/globus_delegation_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_mds_aggregator_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_rendezvous_service.jar:/soft/swift/0.93//lib/globus_wsrf_rendezvous_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_rft_stubs.jar:/soft/swift/0.93//lib/gram-client.jar:/soft/swift/0.93//lib/gram-stubs.jar:/soft/swift/0.93//lib/gram-utils.jar:/soft/swift/0.93//lib/j2ssh-common-0.2.2.jar:/soft/swift/0.93//lib/j2ssh-core-0.2.2-patch-b.jar:/soft/swift/0.93//lib/jakarta-regexp-1.2.jar:/soft/swift/0.93//lib/jakarta-slide-webdavlib-2.0.jar:/soft/swift/0.93//lib/jaxrpc.jar:/soft/swift/0.93//lib/jce-jdk13-131.jar:/soft/swift/0.93//lib/jgss.jar:/soft/swift/0.93//lib/jline-0.9.94.jar:/soft/swift/0.93//lib/jsr173_1.0_api.jar:/soft/swift/0.93//lib/jug-lgpl-2.0.0.jar:/soft/swift/0.93//lib/junit.jar:/soft/swift/0.93//lib/log4j-1.2.16.jar:/soft/swift/0.93//lib/naming-common.jar:/soft/swift/0.93//lib/naming-factory.jar:/soft/swift/0.93//lib/naming-java.jar:/soft/swift/0.93//lib/naming-resources.jar:/soft/swift/0.93//lib/opensaml.jar:/soft/swift/0.93//lib/puretls.jar:/soft/swift/0.93//lib/resolver.jar:/soft/swift/0.93//lib/saaj.jar:/soft/swift/0.93//lib/stringtemplate.jar:/soft/swift/0.93//lib/vdldefinitions.jar:/soft/swift/0.93//lib/wsdl4j.jar:/soft/swift/0.93//lib/wsrf_core.jar:/soft/swift/0.93//lib/wsrf_core_stubs.jar:/soft/swift/0.93//lib/wsrf_mds_index_stubs.jar:/soft/swift/0.93//lib/wsrf_mds_usefulrp_schema_stubs.jar:/soft/swift/0.93//lib/wsrf_provider_jce.jar:/soft/swift/0.93//lib/wsrf_tools.jar:/soft/swift/0.93//lib/wss4j.jar:/soft/swift/0.93//lib/xalan.jar:/soft/swift/0.93//lib/xbean.jar:/soft/swift/0.93//lib/xbean_xpath.jar:/soft/swift/0.93//lib/xercesImpl.jar:/soft/swift/0.93//lib/xml-apis.jar:/soft/swift/0.93//lib/xmlsec.jar:/soft/swift/0.93//lib/xpp3-1.1.3.4d_b4_min.jar:/soft/swift/0.93//lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader '-config' 'demo_realcf.cf' '-sites.file' 'demo_realSites.xml' '-tc.file' 'demo_realtc.tc' 'demo_real.swift' > > > > > > > > > > > -- > What is essential is invisible to the eye > From davidk at ci.uchicago.edu Tue Oct 16 11:04:43 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 16 Oct 2012 11:04:43 -0500 (CDT) Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <2090783833.125077.1350402510973.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <555650193.125244.1350403483404.JavaMail.root@zimbra-mb2.anl.gov> Hello, I have noticed that since the foreach.max.threads changes, the DSSAT script is now running out of memory. I have the heap size set to 4 gigabytes. There are 120K items in gridLists. The main foreach loop of the script looks like this: foreach g,i in gridLists { file tar_output ; file part_output ; file in1[] ; // Scenario files file in2[] ; // Weather files file in3[] ; // Common data file in4[] ; // Binaries file in5[] ; // Perl scripts file wrapper ; // RunDSSAT wrapper (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4, in5, wrapper); } Is there any way to throttle foreach again, or any other workarounds I could use to avoid this? Thanks, David From hategan at mcs.anl.gov Tue Oct 16 13:44:47 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Oct 2012 11:44:47 -0700 Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <555650193.125244.1350403483404.JavaMail.root@zimbra-mb2.anl.gov> References: <555650193.125244.1350403483404.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1350413087.31891.0.camel@blabla> What was foreach.max.threads set to before? On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote: > Hello, > > I have noticed that since the foreach.max.threads changes, the DSSAT script is now running out of memory. I have the heap size set to 4 gigabytes. There are 120K items in gridLists. > The main foreach loop of the script looks like this: > > foreach g,i in gridLists { > file tar_output ; > file part_output ; > > file in1[] ; // Scenario files > file in2[] ; // Weather files > file in3[] ; // Common data > file in4[] ; // Binaries > file in5[] ; // Perl scripts > file wrapper ; // RunDSSAT wrapper > > (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4, in5, wrapper); > } > > Is there any way to throttle foreach again, or any other workarounds I could use to avoid this? > > Thanks, > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Tue Oct 16 13:56:53 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 16 Oct 2012 13:56:53 -0500 (CDT) Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <1350413087.31891.0.camel@blabla> Message-ID: <433307766.126520.1350413813202.JavaMail.root@zimbra-mb2.anl.gov> Previously it was not explicitly set, so I am assuming it would have been 1024. As a test I tried setting it to 520 (the maximum number of available cores), but that did not seem to help. ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "swift-devel Devel" > Sent: Tuesday, October 16, 2012 1:44:47 PM > Subject: Re: [Swift-devel] foreach.max.threads question > What was foreach.max.threads set to before? > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote: > > Hello, > > > > I have noticed that since the foreach.max.threads changes, the DSSAT > > script is now running out of memory. I have the heap size set to 4 > > gigabytes. There are 120K items in gridLists. > > The main foreach loop of the script looks like this: > > > > foreach g,i in gridLists { > > file tar_output > gridLists[i], "output.tar.gz")>; > > file part_output > gridLists[i], ".part")>; > > > > file in1[] > "/", gridLists[i]), pattern="*">; // Scenario files > > file in2[] > "/", gridLists[i]), pattern="*">; // Weather files > > file in3[] > pattern="*">; // Common data > > file in4[] > pattern="*.EXE">; // Binaries > > file in5[] > pattern="*.pl">; // Perl scripts > > file wrapper ; // > > RunDSSAT wrapper > > > > (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4, > > in5, wrapper); > > } > > > > Is there any way to throttle foreach again, or any other workarounds > > I could use to avoid this? > > > > Thanks, > > David > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Oct 16 13:56:56 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 16 Oct 2012 13:56:56 -0500 (CDT) Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <1350413087.31891.0.camel@blabla> Message-ID: <1014451256.48685.1350413816558.JavaMail.root@zimbra.anl.gov> Mihael, I though that when we discussed this last FRiday, the plan was to add the app() throttle but also leave the foreach throttle in place, just in case we needed it. Was there a reason that you needed to remove the foreach throttle to do the app throttle? - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "swift-devel Devel" > Sent: Tuesday, October 16, 2012 1:44:47 PM > Subject: Re: [Swift-devel] foreach.max.threads question > What was foreach.max.threads set to before? > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote: > > Hello, > > > > I have noticed that since the foreach.max.threads changes, the DSSAT > > script is now running out of memory. I have the heap size set to 4 > > gigabytes. There are 120K items in gridLists. > > The main foreach loop of the script looks like this: > > > > foreach g,i in gridLists { > > file tar_output > gridLists[i], "output.tar.gz")>; > > file part_output > gridLists[i], ".part")>; > > > > file in1[] > "/", gridLists[i]), pattern="*">; // Scenario files > > file in2[] > "/", gridLists[i]), pattern="*">; // Weather files > > file in3[] > pattern="*">; // Common data > > file in4[] > pattern="*.EXE">; // Binaries > > file in5[] > pattern="*.pl">; // Perl scripts > > file wrapper ; // > > RunDSSAT wrapper > > > > (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4, > > in5, wrapper); > > } > > > > Is there any way to throttle foreach again, or any other workarounds > > I could use to avoid this? > > > > Thanks, > > David > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Oct 16 14:17:06 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Oct 2012 12:17:06 -0700 Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <433307766.126520.1350413813202.JavaMail.root@zimbra-mb2.anl.gov> References: <433307766.126520.1350413813202.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1350415026.32618.0.camel@blabla> How many cores do you run this on? On Tue, 2012-10-16 at 13:56 -0500, David Kelly wrote: > Previously it was not explicitly set, so I am assuming it would have been 1024. As a test I tried setting it to 520 (the maximum number of available cores), but that did not seem to help. > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: "swift-devel Devel" > > Sent: Tuesday, October 16, 2012 1:44:47 PM > > Subject: Re: [Swift-devel] foreach.max.threads question > > What was foreach.max.threads set to before? > > > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote: > > > Hello, > > > > > > I have noticed that since the foreach.max.threads changes, the DSSAT > > > script is now running out of memory. I have the heap size set to 4 > > > gigabytes. There are 120K items in gridLists. > > > The main foreach loop of the script looks like this: > > > > > > foreach g,i in gridLists { > > > file tar_output > > gridLists[i], "output.tar.gz")>; > > > file part_output > > gridLists[i], ".part")>; > > > > > > file in1[] > > "/", gridLists[i]), pattern="*">; // Scenario files > > > file in2[] > > "/", gridLists[i]), pattern="*">; // Weather files > > > file in3[] > > pattern="*">; // Common data > > > file in4[] > > pattern="*.EXE">; // Binaries > > > file in5[] > > pattern="*.pl">; // Perl scripts > > > file wrapper ; // > > > RunDSSAT wrapper > > > > > > (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, in4, > > > in5, wrapper); > > > } > > > > > > Is there any way to throttle foreach again, or any other workarounds > > > I could use to avoid this? > > > > > > Thanks, > > > David > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Oct 16 14:19:23 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Oct 2012 12:19:23 -0700 Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <1014451256.48685.1350413816558.JavaMail.root@zimbra.anl.gov> References: <1014451256.48685.1350413816558.JavaMail.root@zimbra.anl.gov> Message-ID: <1350415163.32618.2.camel@blabla> On Tue, 2012-10-16 at 13:56 -0500, Michael Wilde wrote: > Mihael, I though that when we discussed this last FRiday, the plan was to add the app() throttle but also leave the foreach throttle in place, just in case we needed it. Was there a reason that you needed to remove the foreach throttle to do the app throttle? Too much complexity. I can add it back if needed, but I'd rather fix the problem here. From davidk at ci.uchicago.edu Tue Oct 16 14:25:25 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 16 Oct 2012 14:25:25 -0500 (CDT) Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <1350415026.32618.0.camel@blabla> Message-ID: <270799010.126716.1350415525047.JavaMail.root@zimbra-mb2.anl.gov> Swift and coaster-service are running on communicado/bridled and the work is being done on UC3 via persistent coasters. UC3 has 520 cores, 1 core per node. In most cases I will get all 520 cores. ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "swift-devel Devel" > Sent: Tuesday, October 16, 2012 2:17:06 PM > Subject: Re: [Swift-devel] foreach.max.threads question > How many cores do you run this on? > > On Tue, 2012-10-16 at 13:56 -0500, David Kelly wrote: > > Previously it was not explicitly set, so I am assuming it would have > > been 1024. As a test I tried setting it to 520 (the maximum number > > of available cores), but that did not seem to help. > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" > > > Sent: Tuesday, October 16, 2012 1:44:47 PM > > > Subject: Re: [Swift-devel] foreach.max.threads question > > > What was foreach.max.threads set to before? > > > > > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote: > > > > Hello, > > > > > > > > I have noticed that since the foreach.max.threads changes, the > > > > DSSAT > > > > script is now running out of memory. I have the heap size set to > > > > 4 > > > > gigabytes. There are 120K items in gridLists. > > > > The main foreach loop of the script looks like this: > > > > > > > > foreach g,i in gridLists { > > > > file tar_output > > > gridLists[i], "output.tar.gz")>; > > > > file part_output > > > gridLists[i], ".part")>; > > > > > > > > file in1[] > > > location=@strcat(@arg("scenarios"), > > > > "/", gridLists[i]), pattern="*">; // Scenario files > > > > file in2[] > > > "/", gridLists[i]), pattern="*">; // Weather files > > > > file in3[] > > > pattern="*">; // Common data > > > > file in4[] > > > pattern="*.EXE">; // Binaries > > > > file in5[] > > > pattern="*.pl">; // Perl scripts > > > > file wrapper ; // > > > > RunDSSAT wrapper > > > > > > > > (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, > > > > in4, > > > > in5, wrapper); > > > > } > > > > > > > > Is there any way to throttle foreach again, or any other > > > > workarounds > > > > I could use to avoid this? > > > > > > > > Thanks, > > > > David > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Oct 16 14:44:59 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Oct 2012 12:44:59 -0700 Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <270799010.126716.1350415525047.JavaMail.root@zimbra-mb2.anl.gov> References: <270799010.126716.1350415525047.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1350416699.32618.3.camel@blabla> The default is 16384. Can you set foreach.max.threads to 520 and try again? On Tue, 2012-10-16 at 14:25 -0500, David Kelly wrote: > Swift and coaster-service are running on communicado/bridled and the work is being done on UC3 via persistent coasters. UC3 has 520 cores, 1 core per node. In most cases I will get all 520 cores. > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: "swift-devel Devel" > > Sent: Tuesday, October 16, 2012 2:17:06 PM > > Subject: Re: [Swift-devel] foreach.max.threads question > > How many cores do you run this on? > > > > On Tue, 2012-10-16 at 13:56 -0500, David Kelly wrote: > > > Previously it was not explicitly set, so I am assuming it would have > > > been 1024. As a test I tried setting it to 520 (the maximum number > > > of available cores), but that did not seem to help. > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "David Kelly" > > > > Cc: "swift-devel Devel" > > > > Sent: Tuesday, October 16, 2012 1:44:47 PM > > > > Subject: Re: [Swift-devel] foreach.max.threads question > > > > What was foreach.max.threads set to before? > > > > > > > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote: > > > > > Hello, > > > > > > > > > > I have noticed that since the foreach.max.threads changes, the > > > > > DSSAT > > > > > script is now running out of memory. I have the heap size set to > > > > > 4 > > > > > gigabytes. There are 120K items in gridLists. > > > > > The main foreach loop of the script looks like this: > > > > > > > > > > foreach g,i in gridLists { > > > > > file tar_output > > > > gridLists[i], "output.tar.gz")>; > > > > > file part_output > > > > gridLists[i], ".part")>; > > > > > > > > > > file in1[] > > > > location=@strcat(@arg("scenarios"), > > > > > "/", gridLists[i]), pattern="*">; // Scenario files > > > > > file in2[] > > > > "/", gridLists[i]), pattern="*">; // Weather files > > > > > file in3[] > > > > pattern="*">; // Common data > > > > > file in4[] > > > > pattern="*.EXE">; // Binaries > > > > > file in5[] > > > > pattern="*.pl">; // Perl scripts > > > > > file wrapper ; // > > > > > RunDSSAT wrapper > > > > > > > > > > (tar_output, part_output) = RunDSSAT(xfile, in1, in2, in3, > > > > > in4, > > > > > in5, wrapper); > > > > > } > > > > > > > > > > Is there any way to throttle foreach again, or any other > > > > > workarounds > > > > > I could use to avoid this? > > > > > > > > > > Thanks, > > > > > David > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Oct 16 14:45:24 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 16 Oct 2012 14:45:24 -0500 (CDT) Subject: [Swift-devel] Fixing Cobalt provider to allow multi-node coaster jobs on Eureka In-Reply-To: <1774919427.48822.1350416438502.JavaMail.root@zimbra.anl.gov> Message-ID: <968451630.48843.1350416724538.JavaMail.root@zimbra.anl.gov> Mihael, Is this a bug that you could fix while waiting for info from David on the DSSAT problems? https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=746 I ask because ParVis people are starting a large set of campaigns on Eureka. At the moment, the cluster is idle, but they can only use 32 of 100 nodes because of the 1-node-per-job limitation of the Cobalt provider. You (or we) can test on Gadzooks if you can add in the multi-node-job code. Thanks, - Mike From davidk at ci.uchicago.edu Tue Oct 16 16:19:40 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 16 Oct 2012 16:19:40 -0500 (CDT) Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <1350416699.32618.3.camel@blabla> Message-ID: <1996203325.127536.1350422380287.JavaMail.root@zimbra-mb2.anl.gov> I set the max threads to 520 and saw the same behavior. The only difference is that I had to do the work on OSG rather than UC3 since all nodes were in use. Is it possible that since there is no foreach throttle, all the files and file arrays needed for each of the 120K tasks are getting set at once? All of the log files, configuration files, and a heap dump are at /scratch/local/davidk/DSSAT/run067 on communicado. Thanks, David ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "swift-devel Devel" > Sent: Tuesday, October 16, 2012 2:44:59 PM > Subject: Re: [Swift-devel] foreach.max.threads question > The default is 16384. Can you set foreach.max.threads to 520 and try > again? > > On Tue, 2012-10-16 at 14:25 -0500, David Kelly wrote: > > Swift and coaster-service are running on communicado/bridled and the > > work is being done on UC3 via persistent coasters. UC3 has 520 > > cores, 1 core per node. In most cases I will get all 520 cores. > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" > > > Sent: Tuesday, October 16, 2012 2:17:06 PM > > > Subject: Re: [Swift-devel] foreach.max.threads question > > > How many cores do you run this on? > > > > > > On Tue, 2012-10-16 at 13:56 -0500, David Kelly wrote: > > > > Previously it was not explicitly set, so I am assuming it would > > > > have > > > > been 1024. As a test I tried setting it to 520 (the maximum > > > > number > > > > of available cores), but that did not seem to help. > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" > > > > > To: "David Kelly" > > > > > Cc: "swift-devel Devel" > > > > > Sent: Tuesday, October 16, 2012 1:44:47 PM > > > > > Subject: Re: [Swift-devel] foreach.max.threads question > > > > > What was foreach.max.threads set to before? > > > > > > > > > > On Tue, 2012-10-16 at 11:04 -0500, David Kelly wrote: > > > > > > Hello, > > > > > > > > > > > > I have noticed that since the foreach.max.threads changes, > > > > > > the > > > > > > DSSAT > > > > > > script is now running out of memory. I have the heap size > > > > > > set to > > > > > > 4 > > > > > > gigabytes. There are 120K items in gridLists. > > > > > > The main foreach loop of the script looks like this: > > > > > > > > > > > > foreach g,i in gridLists { > > > > > > file tar_output > > > > > file=@strcat("output/", > > > > > > gridLists[i], "output.tar.gz")>; > > > > > > file part_output > > > > > file=@strcat("parts/", > > > > > > gridLists[i], ".part")>; > > > > > > > > > > > > file in1[] > > > > > location=@strcat(@arg("scenarios"), > > > > > > "/", gridLists[i]), pattern="*">; // Scenario files > > > > > > file in2[] > > > > > location=@strcat(@arg("weather"), > > > > > > "/", gridLists[i]), pattern="*">; // Weather files > > > > > > file in3[] > > > > > pattern="*">; // Common data > > > > > > file in4[] > > > > > pattern="*.EXE">; // Binaries > > > > > > file in5[] > > > > > pattern="*.pl">; // Perl scripts > > > > > > file wrapper ; // > > > > > > RunDSSAT wrapper > > > > > > > > > > > > (tar_output, part_output) = RunDSSAT(xfile, in1, in2, > > > > > > in3, > > > > > > in4, > > > > > > in5, wrapper); > > > > > > } > > > > > > > > > > > > Is there any way to throttle foreach again, or any other > > > > > > workarounds > > > > > > I could use to avoid this? > > > > > > > > > > > > Thanks, > > > > > > David > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Oct 16 18:41:18 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Oct 2012 16:41:18 -0700 Subject: [Swift-devel] foreach.max.threads question In-Reply-To: <1996203325.127536.1350422380287.JavaMail.root@zimbra-mb2.anl.gov> References: <1996203325.127536.1350422380287.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1350430878.32618.6.camel@blabla> On Tue, 2012-10-16 at 16:19 -0500, David Kelly wrote: > I set the max threads to 520 and saw the same behavior. The only > difference is that I had to do the work on OSG rather than UC3 since > all nodes were in use. Is it possible that since there is no foreach > throttle, all the files and file arrays needed for each of the 120K > tasks are getting set at once? That's correct. I guess that was a bit of wishful thinking on my side. I will revert the change. In the mean-time, please use the revision before the change. Mihael From lpesce at uchicago.edu Wed Oct 17 09:21:40 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Wed, 17 Oct 2012 09:21:40 -0500 Subject: [Swift-devel] [swift-support] Channel multiplexer error In-Reply-To: <1350272934.15993.5.camel@blabla> References: <73B3B09E-B386-4B6D-A9D6-58EEB00B747C@uchicago.edu> <931001436.34144.1349713641059.JavaMail.root@zimbra.anl.gov> <1350000735.8355.0.camel@blabla> <1350272934.15993.5.camel@blabla> Message-ID: OK, I am trying to rerun the scripts i have using the official version of swift on beagle (that should be trunk) and put the value of foreach... to 200 Silly problem: RunID: 20121017-1420-55eqmvlb Execution failed: Could not aquire exclusive lock on log file: /lustre/beagle/GCNet/RG/Oreo/o080522_BS1/demo_real-20121004-0239-gcqgkdrf.0.rlog I had to kill swift with -9 because it did not want to die before. What is the lock file I need to remove? On Oct 14, 2012, at 10:48 PM, Mihael Hategan wrote: > I spoke to Mike on the phone on Friday, and we agreed that > foreach.max.threads is a bit difficult to use. > > So I removed the throttling for foreach and added it to app invocations. > It might help with memory consumption if you set it to around the number > of cpus you have access to. > > I also did some small optimizations to improve memory use. I can run > about 50K jobs with your script on a 32 bit system with 1GB of heap > space. I suspect that on 64 bit systems this might require more heap. > > I also added an option to the swift executable to automatically dump a > copy of the heap when an out of memory condition occurs. Hopefully that > will help us troubleshoot such problems in the future. > > This is all in trunk. > > Mihael > > On Sat, 2012-10-13 at 04:47 -0500, Kazutaka Takahashi wrote: >> Hi All, >> >> Sorry for a late reply, but I am already at a conference and had not >> had a chance to try what Mike proposed. I will try later during the >> second half of the conference, if not after the conference ending next >> wed. >> Taka >> >> >> On Thu, Oct 11, 2012 at 7:12 PM, Mihael Hategan >> wrote: >> Did you try what Mike proposed? >> >> Mihael >> >> On Thu, 2012-10-11 at 17:49 -0500, Kazutaka Takahashi wrote: >>> OK... >>> The last one died with the same error msg... Please check >> the both >>> directories... >>> >>> >> taka at login4:/lustre/beagle/GCNet/RG/Athena/a080521_BS_new/Cond2> >>> >> taka at login4:/lustre/beagle/GCNet/RG/Athena/a080521_BS_new/Cond3> >>> >>> /lustre/beagle/GCNet/bin/Swift/swift: line 164: 8527 Killed >>> java -Xmx5120M >> -Djava.endorsed.dirs=/soft/swift/0.93//lib/endorsed >>> -DUID=3023 -DGLOBUS_HOSTNAME=login4.beagle.ci.uchicago.edu >>> -DCOG_INSTALL_PATH=/soft/swift/0.93/ >> -Dswift.home=/soft/swift/0.93/ >>> -Duser.home=/lustre/beagle/GCNet >>> -Djava.security.egd=file:///dev/urandom -XX:+UseParallelGC >>> -XX:ParallelGCThreads=1 >>> >> -classpath /soft/swift/0.93//etc:/soft/swift/0.93//libexec:/soft/swift/0.93//lib/addressing-1.0.jar:/soft/swift/0.93//lib/ant.jar:/soft/swift/0.93//lib/antlr-2.7.5.jar:/soft/swift/0.93//lib/axis.jar:/soft/swift/0.93//lib/axis-url.jar:/soft/swift/0.93//lib/castor-0.9.6.jar:/soft/swift/0.93//lib/coaster-bootstrap.jar:/soft/swift/0.93//lib/cog-abstraction-common-2.4.jar:/soft/swift/0.93//lib/cog-axis.jar:/soft/swift/0.93//lib/cog-grapheditor-0.47.jar:/soft/swift/0.93//lib/cog-jglobus-1.7.0.jar:/soft/swift/0.93//lib/cog-karajan-0.36-dev.jar:/soft/swift/0.93//lib/cog-provider-clref-gt4_0_0.jar:/soft/swift/0.93//lib/cog-provider-coaster-0.3.jar:/soft/swift/0.93//lib/cog-provider-dcache-0.1.jar:/soft/swift/0.93//lib/cog-provider-gt2-2.4.jar:/soft/swift/0.93//lib/cog-provider-gt4_0_0-2.5.jar:/soft/swift/0.93//lib/cog-provider-local-2.2.jar:/soft/swift/0.93//lib/cog-provider-localscheduler-0.4.jar:/soft/swift/0.93//lib/cog-provider-ssh-2.4.jar:/soft/swift/0.93//lib/cog-provider-webdav-2.1.jar:/soft/swift/0.93//lib/cog-resources-1.0.jar:/soft/swift/0.93//lib/cog-swift-svn.jar:/soft/swift/0.93//lib/cog-trap-1.0.jar:/soft/swift/0.93//lib/cog-url.jar:/soft/swift/0.93//lib/cog-util-0.92.jar:/soft/swift/0.93//lib/commonj.jar:/soft/swift/0.93//lib/commons-beanutils.jar:/soft/swift/0.93//lib/commons-collections-3.0.jar:/soft/swift/0.93//lib/commons-digester.jar:/soft/swift/0.93//lib/commons-discovery.jar:/soft/swift/0.93//lib/commons-httpclient.jar:/soft/swift/0.93//lib/commons-logging-1.1.jar:/soft/swift/0.93//lib/concurrent.jar:/soft/swift/0.93//lib/cryptix32.jar:/soft/swift/0.93//lib/cryptix-asn1.jar:/soft/swift/0.93//lib/cryptix.jar:/soft/swift/0.93//lib/globus_delegation_service.jar:/soft/swift/0.93//lib/globus_delegation_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_mds_aggregator_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_rendezvous_service.jar:/soft/swift/0.93//lib/globus_wsrf_rendezvous_stubs.jar:/soft/swift/0.93//lib/globus_wsrf_rft_stubs.jar:/soft/swift/0.93//lib/gram-client.jar:/soft/swift/0.93//lib/gram-stubs.jar:/soft/swift/0.93//lib/gram-utils.jar:/soft/swift/0.93//lib/j2ssh-common-0.2.2.jar:/soft/swift/0.93//lib/j2ssh-core-0.2.2-patch-b.jar:/soft/swift/0.93//lib/jakarta-regexp-1.2.jar:/soft/swift/0.93//lib/jakarta-slide-webdavlib-2.0.jar:/soft/swift/0.93//lib/jaxrpc.jar:/soft/swift/0.93//lib/jce-jdk13-131.jar:/soft/swift/0.93//lib/jgss.jar:/soft/swift/0.93//lib/jline-0.9.94.jar:/soft/swift/0.93//lib/jsr173_1.0_api.jar:/soft/swift/0.93//lib/jug-lgpl-2.0.0.jar:/soft/swift/0.93//lib/junit.jar:/soft/swift/0.93//lib/log4j-1.2.16.jar:/soft/swift/0.93//lib/naming-common.jar:/soft/swift/0.93//lib/naming-factory.jar:/soft/swift/0.93//lib/naming-java.jar:/soft/swift/0.93//lib/naming-resources.jar:/soft/swift/0.93//lib/opensaml.jar:/soft/swift/0.93//lib/puretls.jar:/soft/swift/0.93//lib/resolver.jar:/soft/swift/0.93//lib/saaj.jar:/soft/swift/0.93//lib/stringtemplate.jar:/soft/swift/0.93//lib/vdldefinitions.jar:/soft/swift/0.93//lib/wsdl4j.jar:/soft/swift/0.93//lib/wsrf_core.jar:/soft/swift/0.93//lib/wsrf_core_stubs.jar:/soft/swift/0.93//lib/wsrf_mds_index_stubs.jar:/soft/swift/0.93//lib/wsrf_mds_usefulrp_schema_stubs.jar:/soft/swift/0.93//lib/wsrf_provider_jce.jar:/soft/swift/0.93//lib/wsrf_tools.jar:/soft/swift/0.93//lib/wss4j.jar:/soft/swift/0.93//lib/xalan.jar:/soft/swift/0.93//lib/xbean.jar:/soft/swift/0.93//lib/xbean_xpath.jar:/soft/swift/0.93//lib/xercesImpl.jar:/soft/swift/0.93//lib/xml-apis.jar:/soft/swift/0.93//lib/xmlsec.jar:/soft/swift/0.93//lib/xpp3-1.1.3.4d_b4_min.jar:/soft/swift/0.93//lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader '-config' 'demo_realcf.cf' '-sites.file' 'demo_realSites.xml' '-tc.file' 'demo_realtc.tc' 'demo_real.swift' >>> >>> >> >> >> >> >> >> >> -- >> What is essential is invisible to the eye >> > > From tim.g.armstrong at gmail.com Thu Oct 18 10:20:06 2012 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 18 Oct 2012 10:20:06 -0500 Subject: [Swift-devel] Minor question about semantics Message-ID: I'm just looking at app function semantics and had a few questions about how some things worked that weren't clear from the user manual. There were a few things I didn't quite get and wanted to make sure I wasn't missing something. Is there any difference between @x and @filename(x) in an app command line? E.g. app (binaryfile bf) myproc (int i, string s="foo") { myapp i s @filename(bf); } versus app (binaryfile bf) myproc (int i, string s="foo") { myapp i s @bf; } Also, what is the intended behaviour if you omit the @ in front of the variable name? Is this valid? E.g. app (binaryfile bf) myproc (int i, string s="foo") { myapp i s bf; } Cheers, Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Oct 18 13:58:10 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 18 Oct 2012 13:58:10 -0500 (CDT) Subject: [Swift-devel] Minor question about semantics In-Reply-To: Message-ID: <728906983.52895.1350586690342.JavaMail.root@zimbra.anl.gov> Tim, here's what I think the answers are, but others should verify what I say here. > Is there any difference between @x and @filename(x) in an app command > line? E.g. > > app (binaryfile bf) myproc (int i, string s="foo") { > myapp i s @filename(bf); > } > > versus > > app (binaryfile bf) myproc ( int i, string s= "foo" ) { > myapp i s @ bf; > } These should behave identically, except that I don't know if lexically one can leave a space between @ and the variable (i.e. I think it needs to be @bf vs @ bf). I assume you weren't asking about the space. The place where I believe that @bf and @filename(bf) behave differently is that the shorthand @bf is not accepted syntactically everywhere. I thought we have an open ticket on this but I cant locate it. We need to test, but I think @f doesn't work in ordinary expressions, where you would expect it to be equivalent to @filename(f). I.e. it only works on an app() command line template. I think that @f should always be identical to @filename(f) in all cases. If we deprecate the use of @ as a prefix for intrinsic functions (as I agree we should) then we perhaps want to retain the use of @f as a syntactic shorthand for filename(f). In swift/t, for a transition period, can we just ignore the @ in all expressions of the form @fname(args)? Perhaps with a deprecation warning? > Also, what is the intended behaviour if you omit the @ in front of the > variable name? Is this valid? E.g. > > app (binaryfile bf) myproc ( int i, string s= "foo" ) { > myapp i s bf; > } As I recall this inserts a string representation of the file object on the command line, as if one is doing a tracef() on the file object instead of its mapped filename string. That seems a somewhat useful behavior for debugging but is seldom what you want to running an app() command. - Mike > > Cheers, > Tim > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Oct 18 14:07:34 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Oct 2012 12:07:34 -0700 Subject: [Swift-devel] Minor question about semantics In-Reply-To: <728906983.52895.1350586690342.JavaMail.root@zimbra.anl.gov> References: <728906983.52895.1350586690342.JavaMail.root@zimbra.anl.gov> Message-ID: <1350587254.2303.3.camel@blabla> I'll add that at least at one point we discussed the fact that the symbol "@" as a mechanism to distinguish between an app/compound and a built-in function is not really necessary (and a bit distasteful), as well as limiting (one may want to be able to have user-defined functions that can be invoked as part of the app command line). Mihael On Thu, 2012-10-18 at 13:58 -0500, Michael Wilde wrote: > Tim, here's what I think the answers are, but others should verify what I say here. > > > Is there any difference between @x and @filename(x) in an app command > > line? E.g. > > > > app (binaryfile bf) myproc (int i, string s="foo") { > > myapp i s @filename(bf); > > } > > > > versus > > > > app (binaryfile bf) myproc ( int i, string s= "foo" ) { > > myapp i s @ bf; > > } > > These should behave identically, except that I don't know if lexically one can leave a space between @ and the variable (i.e. I think it needs to be @bf vs @ bf). I assume you weren't asking about the space. > > The place where I believe that @bf and @filename(bf) behave differently is that the shorthand @bf is not accepted syntactically everywhere. I thought we have an open ticket on this but I cant locate it. > > We need to test, but I think @f doesn't work in ordinary expressions, where you would expect it to be equivalent to @filename(f). I.e. it only works on an app() command line template. > > I think that @f should always be identical to @filename(f) in all cases. If we deprecate the use of @ as a prefix for intrinsic functions (as I agree we should) then we perhaps want to retain the use of @f as a syntactic shorthand for filename(f). > > In swift/t, for a transition period, can we just ignore the @ in all expressions of the form @fname(args)? Perhaps with a deprecation warning? > > > Also, what is the intended behaviour if you omit the @ in front of the > > variable name? Is this valid? E.g. > > > > app (binaryfile bf) myproc ( int i, string s= "foo" ) { > > myapp i s bf; > > } > > As I recall this inserts a string representation of the file object on the command line, as if one is doing a tracef() on the file object instead of its mapped filename string. > > That seems a somewhat useful behavior for debugging but is seldom what you want to running an app() command. > > - Mike > > > > > Cheers, > > Tim > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Fri Oct 19 15:10:01 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Oct 2012 15:10:01 -0500 (CDT) Subject: [Swift-devel] coaster error on Midway Message-ID: <827474706.55305.1350677401813.JavaMail.root@zimbra.anl.gov> Im getting the error below on Midway (running the swift from module load swift) Is this a know issue, with a known fix? I'll try latest trunk next. This happened both without and with provider staging. I will post bug with log files if it recurs with latest trunk. - Mike Swift trunk swift-r5939 cog-r3472 RunID: 20121019-2005-yptzr8l6 Progress: time: Fri, 19 Oct 2012 20:05:54 +0000 Progress: time: Fri, 19 Oct 2012 20:05:56 +0000 Stage in:1 Submitted:99 Progress: time: Fri, 19 Oct 2012 20:05:57 +0000 Stage in:33 Submitted:67 Progress: time: Fri, 19 Oct 2012 20:05:58 +0000 Stage in:56 Submitted:36 Active:8 Progress: time: Fri, 19 Oct 2012 20:05:59 +0000 Stage in:65 Submitted:3 Active:32 Progress: time: Fri, 19 Oct 2012 20:06:00 +0000 Stage in:51 Active:49 Progress: time: Fri, 19 Oct 2012 20:06:08 +0000 Active:99 Stage out:1 Progress: time: Fri, 19 Oct 2012 20:06:24 +0000 Active:96 Finished successfully:4 Progress: time: Fri, 19 Oct 2012 20:06:28 +0000 Active:95 Stage out:1 Finished successfully:4 Progress: time: Fri, 19 Oct 2012 20:06:29 +0000 Active:74 Stage out:12 Finished successfully:14 Progress: time: Fri, 19 Oct 2012 20:06:30 +0000 Active:52 Stage out:3 Finished successfully:45 Progress: time: Fri, 19 Oct 2012 20:06:31 +0000 Active:25 Stage out:24 Finished successfully:51 Exception caught while processing reply java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z at org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) at org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) at org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) Exception caught while processing reply java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z at org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) at org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) at org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) Exception caught while processing reply java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z at org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) at org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) at org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) at org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) Progress: time: Fri, 19 Oct 2012 20:06:33 +0000 Stage out:10 Finished successfully:90 Final status: Fri, 19 Oct 2012 20:06:33 +0000 Finished successfully:100 mid$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Fri Oct 19 17:59:56 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 19 Oct 2012 17:59:56 -0500 (CDT) Subject: [Swift-devel] coaster error on Midway In-Reply-To: <827474706.55305.1350677401813.JavaMail.root@zimbra.anl.gov> Message-ID: <241664185.145376.1350687596439.JavaMail.root@zimbra-mb2.anl.gov> I remember something like this happening in one of the versions of trunk - I believe it has been fixed in the latest version. ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" , "Mihael Hategan" > Cc: "swift-devel" > Sent: Friday, October 19, 2012 3:10:01 PM > Subject: coaster error on Midway > Im getting the error below on Midway (running the swift from module > load swift) > > Is this a know issue, with a known fix? > > I'll try latest trunk next. > > This happened both without and with provider staging. > > I will post bug with log files if it recurs with latest trunk. > > - Mike > > Swift trunk swift-r5939 cog-r3472 > > RunID: 20121019-2005-yptzr8l6 > Progress: time: Fri, 19 Oct 2012 20:05:54 +0000 > Progress: time: Fri, 19 Oct 2012 20:05:56 +0000 Stage in:1 > Submitted:99 > Progress: time: Fri, 19 Oct 2012 20:05:57 +0000 Stage in:33 > Submitted:67 > Progress: time: Fri, 19 Oct 2012 20:05:58 +0000 Stage in:56 > Submitted:36 Active:8 > Progress: time: Fri, 19 Oct 2012 20:05:59 +0000 Stage in:65 > Submitted:3 Active:32 > Progress: time: Fri, 19 Oct 2012 20:06:00 +0000 Stage in:51 Active:49 > Progress: time: Fri, 19 Oct 2012 20:06:08 +0000 Active:99 Stage out:1 > Progress: time: Fri, 19 Oct 2012 20:06:24 +0000 Active:96 Finished > successfully:4 > Progress: time: Fri, 19 Oct 2012 20:06:28 +0000 Active:95 Stage out:1 > Finished successfully:4 > Progress: time: Fri, 19 Oct 2012 20:06:29 +0000 Active:74 Stage out:12 > Finished successfully:14 > Progress: time: Fri, 19 Oct 2012 20:06:30 +0000 Active:52 Stage out:3 > Finished successfully:45 > Progress: time: Fri, 19 Oct 2012 20:06:31 +0000 Active:25 Stage out:24 > Finished successfully:51 > Exception caught while processing reply > java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z > at > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) > at > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) > at > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) > at > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) > at > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) > Exception caught while processing reply > java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z > at > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) > at > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) > at > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) > at > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) > at > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) > Exception caught while processing reply > java.lang.IllegalArgumentException: Wrong data size: 4. Data was @].z > at > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) > at > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) > at > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) > at > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) > at > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) > Progress: time: Fri, 19 Oct 2012 20:06:33 +0000 Stage out:10 Finished > successfully:90 > Final status: Fri, 19 Oct 2012 20:06:33 +0000 Finished > successfully:100 > mid$ > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From wilde at mcs.anl.gov Fri Oct 19 18:04:49 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Oct 2012 18:04:49 -0500 (CDT) Subject: [Swift-devel] coaster error on Midway In-Reply-To: <241664185.145376.1350687596439.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1077353399.55624.1350687889533.JavaMail.root@zimbra.anl.gov> Indeed, when I used latest trunk, the error has not recurred, and Ive sone several 1000-job runs of the EpiSnp app. But that means that the trunk version in Midway's default swift module has this bug. Lets talk on Monday about how to make and test an 0.94 release, which I would define simply as "a trustable trunk snapshot", and then to get that out to all systems and users. - Mike ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "swift-devel" , "Mihael Hategan" > Sent: Friday, October 19, 2012 5:59:56 PM > Subject: Re: coaster error on Midway > I remember something like this happening in one of the versions of > trunk - I believe it has been fixed in the latest version. > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "David Kelly" , "Mihael Hategan" > > > > Cc: "swift-devel" > > Sent: Friday, October 19, 2012 3:10:01 PM > > Subject: coaster error on Midway > > Im getting the error below on Midway (running the swift from module > > load swift) > > > > Is this a know issue, with a known fix? > > > > I'll try latest trunk next. > > > > This happened both without and with provider staging. > > > > I will post bug with log files if it recurs with latest trunk. > > > > - Mike > > > > Swift trunk swift-r5939 cog-r3472 > > > > RunID: 20121019-2005-yptzr8l6 > > Progress: time: Fri, 19 Oct 2012 20:05:54 +0000 > > Progress: time: Fri, 19 Oct 2012 20:05:56 +0000 Stage in:1 > > Submitted:99 > > Progress: time: Fri, 19 Oct 2012 20:05:57 +0000 Stage in:33 > > Submitted:67 > > Progress: time: Fri, 19 Oct 2012 20:05:58 +0000 Stage in:56 > > Submitted:36 Active:8 > > Progress: time: Fri, 19 Oct 2012 20:05:59 +0000 Stage in:65 > > Submitted:3 Active:32 > > Progress: time: Fri, 19 Oct 2012 20:06:00 +0000 Stage in:51 > > Active:49 > > Progress: time: Fri, 19 Oct 2012 20:06:08 +0000 Active:99 Stage > > out:1 > > Progress: time: Fri, 19 Oct 2012 20:06:24 +0000 Active:96 Finished > > successfully:4 > > Progress: time: Fri, 19 Oct 2012 20:06:28 +0000 Active:95 Stage > > out:1 > > Finished successfully:4 > > Progress: time: Fri, 19 Oct 2012 20:06:29 +0000 Active:74 Stage > > out:12 > > Finished successfully:14 > > Progress: time: Fri, 19 Oct 2012 20:06:30 +0000 Active:52 Stage > > out:3 > > Finished successfully:45 > > Progress: time: Fri, 19 Oct 2012 20:06:31 +0000 Active:25 Stage > > out:24 > > Finished successfully:51 > > Exception caught while processing reply > > java.lang.IllegalArgumentException: Wrong data size: 4. Data was > > @].z > > at > > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) > > at > > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) > > at > > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) > > at > > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) > > at > > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) > > Exception caught while processing reply > > java.lang.IllegalArgumentException: Wrong data size: 4. Data was > > @].z > > at > > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) > > at > > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) > > at > > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) > > at > > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) > > at > > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) > > Exception caught while processing reply > > java.lang.IllegalArgumentException: Wrong data size: 4. Data was > > @].z > > at > > org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) > > at > > org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) > > at > > org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) > > at > > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) > > at > > org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) > > Progress: time: Fri, 19 Oct 2012 20:06:33 +0000 Stage out:10 > > Finished > > successfully:90 > > Final status: Fri, 19 Oct 2012 20:06:33 +0000 Finished > > successfully:100 > > mid$ > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From lpesce at uchicago.edu Mon Oct 22 10:35:06 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 22 Oct 2012 10:35:06 -0500 Subject: [Swift-devel] coaster error on Midway In-Reply-To: <1077353399.55624.1350687889533.JavaMail.root@zimbra.anl.gov> References: <1077353399.55624.1350687889533.JavaMail.root@zimbra.anl.gov> Message-ID: <38F69461-BA4B-4D2C-97C4-4C26932286D2@uchicago.edu> Is there a current trunk module installed on Beagle? It would be handy, since usually our users end up needing to use trunk.... On Oct 19, 2012, at 6:04 PM, Michael Wilde wrote: > Indeed, when I used latest trunk, the error has not recurred, and Ive sone several 1000-job runs of the EpiSnp app. > > But that means that the trunk version in Midway's default swift module has this bug. > > Lets talk on Monday about how to make and test an 0.94 release, which I would define simply as "a trustable trunk snapshot", and then to get that out to all systems and users. > > - Mike > > > > > ----- Original Message ----- >> From: "David Kelly" >> To: "Michael Wilde" >> Cc: "swift-devel" , "Mihael Hategan" >> Sent: Friday, October 19, 2012 5:59:56 PM >> Subject: Re: coaster error on Midway >> I remember something like this happening in one of the versions of >> trunk - I believe it has been fixed in the latest version. >> >> ----- Original Message ----- >>> From: "Michael Wilde" >>> To: "David Kelly" , "Mihael Hategan" >>> >>> Cc: "swift-devel" >>> Sent: Friday, October 19, 2012 3:10:01 PM >>> Subject: coaster error on Midway >>> Im getting the error below on Midway (running the swift from module >>> load swift) >>> >>> Is this a know issue, with a known fix? >>> >>> I'll try latest trunk next. >>> >>> This happened both without and with provider staging. >>> >>> I will post bug with log files if it recurs with latest trunk. >>> >>> - Mike >>> >>> Swift trunk swift-r5939 cog-r3472 >>> >>> RunID: 20121019-2005-yptzr8l6 >>> Progress: time: Fri, 19 Oct 2012 20:05:54 +0000 >>> Progress: time: Fri, 19 Oct 2012 20:05:56 +0000 Stage in:1 >>> Submitted:99 >>> Progress: time: Fri, 19 Oct 2012 20:05:57 +0000 Stage in:33 >>> Submitted:67 >>> Progress: time: Fri, 19 Oct 2012 20:05:58 +0000 Stage in:56 >>> Submitted:36 Active:8 >>> Progress: time: Fri, 19 Oct 2012 20:05:59 +0000 Stage in:65 >>> Submitted:3 Active:32 >>> Progress: time: Fri, 19 Oct 2012 20:06:00 +0000 Stage in:51 >>> Active:49 >>> Progress: time: Fri, 19 Oct 2012 20:06:08 +0000 Active:99 Stage >>> out:1 >>> Progress: time: Fri, 19 Oct 2012 20:06:24 +0000 Active:96 Finished >>> successfully:4 >>> Progress: time: Fri, 19 Oct 2012 20:06:28 +0000 Active:95 Stage >>> out:1 >>> Finished successfully:4 >>> Progress: time: Fri, 19 Oct 2012 20:06:29 +0000 Active:74 Stage >>> out:12 >>> Finished successfully:14 >>> Progress: time: Fri, 19 Oct 2012 20:06:30 +0000 Active:52 Stage >>> out:3 >>> Finished successfully:45 >>> Progress: time: Fri, 19 Oct 2012 20:06:31 +0000 Active:25 Stage >>> out:24 >>> Finished successfully:51 >>> Exception caught while processing reply >>> java.lang.IllegalArgumentException: Wrong data size: 4. Data was >>> @].z >>> at >>> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) >>> at >>> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) >>> at >>> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) >>> at >>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) >>> at >>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) >>> at >>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) >>> at >>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) >>> Exception caught while processing reply >>> java.lang.IllegalArgumentException: Wrong data size: 4. Data was >>> @].z >>> at >>> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) >>> at >>> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) >>> at >>> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) >>> at >>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) >>> at >>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) >>> at >>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) >>> at >>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) >>> Exception caught while processing reply >>> java.lang.IllegalArgumentException: Wrong data size: 4. Data was >>> @].z >>> at >>> org.globus.cog.karajan.workflow.service.RequestReply.unpackLong(RequestReply.java:237) >>> at >>> org.globus.cog.karajan.workflow.service.RequestReply.getInDataAsLong(RequestReply.java:232) >>> at >>> org.globus.cog.karajan.workflow.service.commands.HeartBeatCommand.replyReceived(HeartBeatCommand.java:40) >>> at >>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleReply(AbstractKarajanChannel.java:401) >>> at >>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:234) >>> at >>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97) >>> at >>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56) >>> Progress: time: Fri, 19 Oct 2012 20:06:33 +0000 Stage out:10 >>> Finished >>> successfully:90 >>> Final status: Fri, 19 Oct 2012 20:06:33 +0000 Finished >>> successfully:100 >>> mid$ >>> >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Tue Oct 23 00:28:39 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 23 Oct 2012 00:28:39 -0500 (CDT) Subject: [Swift-devel] Swift 0.94 release planning In-Reply-To: <1192836350.154803.1350968155597.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <40830745.154854.1350970119511.JavaMail.root@zimbra-mb2.anl.gov> Hello all, I just wanted to let you all know that I've created branches to prepare for an eventual 0.94 release. The SVN paths are: https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.10/src/cog https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.94 I will be working to add more tests to the suite, and to make sure that any known issues are documented in bugzilla. Regards, David From ketancmaheshwari at gmail.com Tue Oct 23 10:34:25 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Oct 2012 11:34:25 -0400 Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider Message-ID: Hi, I am trying to run an experiment on a 32-core machine with the hope of running 8, 16, 24 and 32 jobs in parallel. I am trying to control these numbers of parallel jobs by setting the Karajan jobthrottle values in sites.xml to 0.07, 0.15, and so on. However, it seems that the values are not corresponding to what I see in the Swift progress text. Initially, when I set jobthrottle to 0.07, only 2 jobs started in parallel. Then I added the line setting "Initialscore" value to 10000, which improved the jobs to 5. After this a 10-fold increase in "initialscore" did not improve the jobs count. Furthermore, a new batch of 5 jobs get started only when *all* jobs from the old batch are over as opposed to a continuous supply of jobs from "site selection" to "stage out" state which happens in the case of coaster and other providers. The behavior is same in Swift 0.93.1 and latest trunk. Thank you for any clues on how to set the expected number of parallel jobs to these values. Please find attached one such log of this run. Thanks, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mars-localproviderlog.tgz Type: application/x-gzip Size: 112743 bytes Desc: not available URL: From davidk at ci.uchicago.edu Tue Oct 23 11:47:02 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 23 Oct 2012 11:47:02 -0500 (CDT) Subject: [Swift-devel] Sourceforge down Message-ID: <1681942315.157497.1351010822027.JavaMail.root@zimbra-mb2.anl.gov> Just an FYI - sourceforge has been having issues today and we seem to be unable to access the CoG SVN. https://sourceforge.net/blog/various-sourceforge-services-down/ From wilde at mcs.anl.gov Tue Oct 23 12:23:34 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Oct 2012 12:23:34 -0500 (CDT) Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: Message-ID: <601640800.60237.1351013014109.JavaMail.root@zimbra.anl.gov> Hi Ketan, In the log you attached I see this: 0.10 100000 You should leave initialScore constant, and set to a large number, no matter what level of manual throttling you want to specify via sites.xml. We always use 10000 for this value. Don't attempt to vary the initialScore value for manual throttle: just use jobThrottle to set what you want. A jobThrottle value of 0.10 should run 11 jobs in parallel (jobThrottle * 100) + 1 (for historical reasons related to the automatic throttling algorithm). If you are seeing less than that, one common cause is that the ratio of your input staging times to your job run times is so high as to make it impossible for Swift to keep the expected/desired number of jobs in active state at once. I suggest you test the throttle behavior with a simple app script like "catsnsleep" (catsn with an artificial sleep to increase job duration). If your settings (sites + cf) work for that test, then they should work for the real app, within the staging constraints. Using CDM "direct" mode is likely what you want here to eliminate unnecessary staging on a local cluster. In your test, what was this ratio? Can you also post your cf file and the progress log from stdout/stderr? - Mike ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Swift Devel" > Sent: Tuesday, October 23, 2012 10:34:25 AM > Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider > Hi, > > > I am trying to run an experiment on a 32-core machine with the hope of > running 8, 16, 24 and 32 jobs in parallel. I am trying to control > these numbers of parallel jobs by setting the Karajan jobthrottle > values in sites.xml to 0.07, 0.15, and so on. > > > However, it seems that the values are not corresponding to what I see > in the Swift progress text. > > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in > parallel. Then I added the line setting "Initialscore" value to 10000, > which improved the jobs to 5. After this a 10-fold increase in > "initialscore" did not improve the jobs count. > > > Furthermore, a new batch of 5 jobs get started only when *all* jobs > from the old batch are over as opposed to a continuous supply of jobs > from "site selection" to "stage out" state which happens in the case > of coaster and other providers. > > > The behavior is same in Swift 0.93.1 and latest trunk. > > > > Thank you for any clues on how to set the expected number of parallel > jobs to these values. > > > Please find attached one such log of this run. > Thanks, -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Oct 23 13:36:42 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Oct 2012 13:36:42 -0500 (CDT) Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: <601640800.60237.1351013014109.JavaMail.root@zimbra.anl.gov> Message-ID: <1140982512.60437.1351017402648.JavaMail.root@zimbra.anl.gov> Ketan, looking further I see that your app has a large number of output files, O(100). Depending on their size, and the speed of the filesystem on which you are testing, that re-inforces my suspicion that low concurrency you are seeing is due to staging IO. If this is a local 32-core host, try running with your input and output data and workdirectory all on a local hard disk (or even /dev/shm if it has sufficient RAM/space). Then try using CDM direct as explained at: http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Ketan Maheshwari" > Cc: "Swift Devel" > Sent: Tuesday, October 23, 2012 12:23:34 PM > Subject: Re: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider > Hi Ketan, > > In the log you attached I see this: > > 0.10 > 100000 > > You should leave initialScore constant, and set to a large number, no > matter what level of manual throttling you want to specify via > sites.xml. We always use 10000 for this value. Don't attempt to vary > the initialScore value for manual throttle: just use jobThrottle to > set what you want. > > A jobThrottle value of 0.10 should run 11 jobs in parallel > (jobThrottle * 100) + 1 (for historical reasons related to the > automatic throttling algorithm). > > If you are seeing less than that, one common cause is that the ratio > of your input staging times to your job run times is so high as to > make it impossible for Swift to keep the expected/desired number of > jobs in active state at once. > > I suggest you test the throttle behavior with a simple app script like > "catsnsleep" (catsn with an artificial sleep to increase job > duration). If your settings (sites + cf) work for that test, then they > should work for the real app, within the staging constraints. Using > CDM "direct" mode is likely what you want here to eliminate > unnecessary staging on a local cluster. > > In your test, what was this ratio? Can you also post your cf file and > the progress log from stdout/stderr? > > - Mike > > ----- Original Message ----- > > From: "Ketan Maheshwari" > > To: "Swift Devel" > > Sent: Tuesday, October 23, 2012 10:34:25 AM > > Subject: [Swift-devel] jobthrottle value does not correspond to > > number of parallel jobs on local provider > > Hi, > > > > > > I am trying to run an experiment on a 32-core machine with the hope > > of > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control > > these numbers of parallel jobs by setting the Karajan jobthrottle > > values in sites.xml to 0.07, 0.15, and so on. > > > > > > However, it seems that the values are not corresponding to what I > > see > > in the Swift progress text. > > > > > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in > > parallel. Then I added the line setting "Initialscore" value to > > 10000, > > which improved the jobs to 5. After this a 10-fold increase in > > "initialscore" did not improve the jobs count. > > > > > > Furthermore, a new batch of 5 jobs get started only when *all* jobs > > from the old batch are over as opposed to a continuous supply of > > jobs > > from "site selection" to "stage out" state which happens in the case > > of coaster and other providers. > > > > > > The behavior is same in Swift 0.93.1 and latest trunk. > > > > > > > > Thank you for any clues on how to set the expected number of > > parallel > > jobs to these values. > > > > > > Please find attached one such log of this run. > > Thanks, -- > > Ketan > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Tue Oct 23 14:02:15 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Oct 2012 15:02:15 -0400 Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: <1140982512.60437.1351017402648.JavaMail.root@zimbra.anl.gov> References: <601640800.60237.1351013014109.JavaMail.root@zimbra.anl.gov> <1140982512.60437.1351017402648.JavaMail.root@zimbra.anl.gov> Message-ID: Mike, Thank you for your answers. I tried catsnsleep with n=100 and s=10 and indeed the number of parallel jobs corresponded to the jobthrottle value. Surprisingly, when I started the mars application immediately after this, it also started 32 jobs in parallel. However, the run failed with "too many open files" error after a while. Now, I am trying cdm method. Will keep you posted. On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde wrote: > Ketan, looking further I see that your app has a large number of output > files, O(100). Depending on their size, and the speed of the filesystem on > which you are testing, that re-inforces my suspicion that low concurrency > you are seeing is due to staging IO. > > If this is a local 32-core host, try running with your input and output > data and workdirectory all on a local hard disk (or even /dev/shm if it has > sufficient RAM/space). Then try using CDM direct as explained at: > > > http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases > > - Mike > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Ketan Maheshwari" > > Cc: "Swift Devel" > > Sent: Tuesday, October 23, 2012 12:23:34 PM > > Subject: Re: [Swift-devel] jobthrottle value does not correspond to > number of parallel jobs on local provider > > Hi Ketan, > > > > In the log you attached I see this: > > > > 0.10 > > 100000 > > > > You should leave initialScore constant, and set to a large number, no > > matter what level of manual throttling you want to specify via > > sites.xml. We always use 10000 for this value. Don't attempt to vary > > the initialScore value for manual throttle: just use jobThrottle to > > set what you want. > > > > A jobThrottle value of 0.10 should run 11 jobs in parallel > > (jobThrottle * 100) + 1 (for historical reasons related to the > > automatic throttling algorithm). > > > > If you are seeing less than that, one common cause is that the ratio > > of your input staging times to your job run times is so high as to > > make it impossible for Swift to keep the expected/desired number of > > jobs in active state at once. > > > > I suggest you test the throttle behavior with a simple app script like > > "catsnsleep" (catsn with an artificial sleep to increase job > > duration). If your settings (sites + cf) work for that test, then they > > should work for the real app, within the staging constraints. Using > > CDM "direct" mode is likely what you want here to eliminate > > unnecessary staging on a local cluster. > > > > In your test, what was this ratio? Can you also post your cf file and > > the progress log from stdout/stderr? > > > > - Mike > > > > ----- Original Message ----- > > > From: "Ketan Maheshwari" > > > To: "Swift Devel" > > > Sent: Tuesday, October 23, 2012 10:34:25 AM > > > Subject: [Swift-devel] jobthrottle value does not correspond to > > > number of parallel jobs on local provider > > > Hi, > > > > > > > > > I am trying to run an experiment on a 32-core machine with the hope > > > of > > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control > > > these numbers of parallel jobs by setting the Karajan jobthrottle > > > values in sites.xml to 0.07, 0.15, and so on. > > > > > > > > > However, it seems that the values are not corresponding to what I > > > see > > > in the Swift progress text. > > > > > > > > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in > > > parallel. Then I added the line setting "Initialscore" value to > > > 10000, > > > which improved the jobs to 5. After this a 10-fold increase in > > > "initialscore" did not improve the jobs count. > > > > > > > > > Furthermore, a new batch of 5 jobs get started only when *all* jobs > > > from the old batch are over as opposed to a continuous supply of > > > jobs > > > from "site selection" to "stage out" state which happens in the case > > > of coaster and other providers. > > > > > > > > > The behavior is same in Swift 0.93.1 and latest trunk. > > > > > > > > > > > > Thank you for any clues on how to set the expected number of > > > parallel > > > jobs to these values. > > > > > > > > > Please find attached one such log of this run. > > > Thanks, -- > > > Ketan > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Tue Oct 23 14:52:48 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Oct 2012 15:52:48 -0400 Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: References: <601640800.60237.1351013014109.JavaMail.root@zimbra.anl.gov> <1140982512.60437.1351017402648.JavaMail.root@zimbra.anl.gov> Message-ID: Now trying with cdm. My cdm policy file contains a single line as follows: rule .* DEFAULT / This seems to be working at stage in because I immediately see my jobs starting. However, it fails immediately after with a message: "Execution failed: The following output files were not created by the application:" Followed by a list of outputs. I recall this could happen if absolute pathnames are not provided, so I updated my mappers.sh scripts with absolute pathnames including a double // in the beginning without success. The run log do not show any specific indicators of error other than the above message. I see a bunch of CDM_POLICY CDM_ACTION lines in the wrapper.log in one of the many jobdirs as follows: CDM_POLICY: /home/train07/ketan_mars/swift/result52/mars.ot48 -> DEFAULT / CDM_ACTION: /home/train07/ketan_mars/swift/swift.workdir/mars-20121023-1240-vbptd8i9/jobs/g/mars-gtln0yzk OUTPUT /home/train07/ketan_mars/swift/result52/mars.ot48 DEFAULT / Not sure if something could've gone wrong here. Attaching the log file and one of the job dirs. Regards, Ketan On Tue, Oct 23, 2012 at 3:02 PM, Ketan Maheshwari < ketancmaheshwari at gmail.com> wrote: > Mike, > > Thank you for your answers. > > I tried catsnsleep with n=100 and s=10 and indeed the number of parallel > jobs corresponded to the jobthrottle value. > Surprisingly, when I started the mars application immediately after this, > it also started 32 jobs in parallel. However, the run failed with "too many > open files" error after a while. > > Now, I am trying cdm method. Will keep you posted. > > > On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde wrote: > >> Ketan, looking further I see that your app has a large number of output >> files, O(100). Depending on their size, and the speed of the filesystem on >> which you are testing, that re-inforces my suspicion that low concurrency >> you are seeing is due to staging IO. >> >> If this is a local 32-core host, try running with your input and output >> data and workdirectory all on a local hard disk (or even /dev/shm if it has >> sufficient RAM/space). Then try using CDM direct as explained at: >> >> >> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases >> >> - Mike >> >> ----- Original Message ----- >> > From: "Michael Wilde" >> > To: "Ketan Maheshwari" >> > Cc: "Swift Devel" >> > Sent: Tuesday, October 23, 2012 12:23:34 PM >> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to >> number of parallel jobs on local provider >> > Hi Ketan, >> > >> > In the log you attached I see this: >> > >> > 0.10 >> > 100000 >> > >> > You should leave initialScore constant, and set to a large number, no >> > matter what level of manual throttling you want to specify via >> > sites.xml. We always use 10000 for this value. Don't attempt to vary >> > the initialScore value for manual throttle: just use jobThrottle to >> > set what you want. >> > >> > A jobThrottle value of 0.10 should run 11 jobs in parallel >> > (jobThrottle * 100) + 1 (for historical reasons related to the >> > automatic throttling algorithm). >> > >> > If you are seeing less than that, one common cause is that the ratio >> > of your input staging times to your job run times is so high as to >> > make it impossible for Swift to keep the expected/desired number of >> > jobs in active state at once. >> > >> > I suggest you test the throttle behavior with a simple app script like >> > "catsnsleep" (catsn with an artificial sleep to increase job >> > duration). If your settings (sites + cf) work for that test, then they >> > should work for the real app, within the staging constraints. Using >> > CDM "direct" mode is likely what you want here to eliminate >> > unnecessary staging on a local cluster. >> > >> > In your test, what was this ratio? Can you also post your cf file and >> > the progress log from stdout/stderr? >> > >> > - Mike >> > >> > ----- Original Message ----- >> > > From: "Ketan Maheshwari" >> > > To: "Swift Devel" >> > > Sent: Tuesday, October 23, 2012 10:34:25 AM >> > > Subject: [Swift-devel] jobthrottle value does not correspond to >> > > number of parallel jobs on local provider >> > > Hi, >> > > >> > > >> > > I am trying to run an experiment on a 32-core machine with the hope >> > > of >> > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control >> > > these numbers of parallel jobs by setting the Karajan jobthrottle >> > > values in sites.xml to 0.07, 0.15, and so on. >> > > >> > > >> > > However, it seems that the values are not corresponding to what I >> > > see >> > > in the Swift progress text. >> > > >> > > >> > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in >> > > parallel. Then I added the line setting "Initialscore" value to >> > > 10000, >> > > which improved the jobs to 5. After this a 10-fold increase in >> > > "initialscore" did not improve the jobs count. >> > > >> > > >> > > Furthermore, a new batch of 5 jobs get started only when *all* jobs >> > > from the old batch are over as opposed to a continuous supply of >> > > jobs >> > > from "site selection" to "stage out" state which happens in the case >> > > of coaster and other providers. >> > > >> > > >> > > The behavior is same in Swift 0.93.1 and latest trunk. >> > > >> > > >> > > >> > > Thank you for any clues on how to set the expected number of >> > > parallel >> > > jobs to these values. >> > > >> > > >> > > Please find attached one such log of this run. >> > > Thanks, -- >> > > Ketan >> > > >> > > >> > > >> > > _______________________________________________ >> > > Swift-devel mailing list >> > > Swift-devel at ci.uchicago.edu >> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > >> > -- >> > Michael Wilde >> > Computation Institute, University of Chicago >> > Mathematics and Computer Science Division >> > Argonne National Laboratory >> > >> > _______________________________________________ >> > Swift-devel mailing list >> > Swift-devel at ci.uchicago.edu >> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> > > > -- > Ketan > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mars-debug.tgz Type: application/x-gzip Size: 101159 bytes Desc: not available URL: From wilde at mcs.anl.gov Tue Oct 23 15:14:41 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Oct 2012 15:14:41 -0500 (CDT) Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: Message-ID: <199523820.60743.1351023281527.JavaMail.root@zimbra.anl.gov> > Now trying with cdm. My cdm policy file contains a single line as > follows: > > > rule .* DEFAULT / Change DEFAULT to DIRECT Look at the example at: http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases sec 20.5.1 and 20.5.2 Its best to test this first with simple "catsn-like" examples before you try your science app with it, to make sure that the direct processing is behaving as you expect. - Mike From ketancmaheshwari at gmail.com Tue Oct 23 15:23:16 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Tue, 23 Oct 2012 16:23:16 -0400 Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: <199523820.60743.1351023281527.JavaMail.root@zimbra.anl.gov> References: <199523820.60743.1351023281527.JavaMail.root@zimbra.anl.gov> Message-ID: I tried CDM DIRECT policy with catsnsleep which gives the following: CDM[DIRECT]: Linking to /home/train07/ketan_mars/swift/data.txt failed! the file exists as mentioned in the path above. When specifying relative path to data.txt, I get /data.txt doesn't exist, presumably because cdm DIRECT assumes paths to be absolute. Log attached. On Tue, Oct 23, 2012 at 4:14 PM, Michael Wilde wrote: > > Now trying with cdm. My cdm policy file contains a single line as > > follows: > > > > > > rule .* DEFAULT / > > Change DEFAULT to DIRECT > > Look at the example at: > > > http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases > > sec 20.5.1 and 20.5.2 > > Its best to test this first with simple "catsn-like" examples before you > try your science app with it, to make sure that the direct processing is > behaving as you expect. > > - Mike > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: catsnsleep-20121023-1317-cd9bgebg.log Type: application/octet-stream Size: 22557 bytes Desc: not available URL: From wilde at mcs.anl.gov Tue Oct 23 16:00:31 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Oct 2012 16:00:31 -0500 (CDT) Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: Message-ID: <1301166271.60909.1351026031769.JavaMail.root@zimbra.anl.gov> Ketan, I tested this with a recent trunk rev and it seems to work. Can you try to replicate the test below and proceed from there to make your specific app pattern work? This was run on pads. - Mike login2$ swift -config cf -cdm.file direct -tc.file tc -sites.file sites.xml catsndirect.swift -n=1 Swift trunk swift-r5920 cog-r3471 (cog modified locally) RunID: 20121023-1554-vese3gh6 Progress: time: Tue, 23 Oct 2012 15:54:47 -0500 Final status: Tue, 23 Oct 2012 15:54:47 -0500 Finished successfully:1 login2$ cat catsndirect.swift type file; app (file o) cat (file i) { cat @i stdout=@o; } file out[]; foreach j in [1:@toint(@arg("n","1"))] { file data<"/tmp/wilde/indir/data.txt">; out[j] = cat(data); } login2$ cat cf wrapperlog.always.transfer=true sitedir.keep=true execution.retries=0 lazy.errors=false status.mode=provider use.provider.staging=false provider.staging.pin.swiftfiles=false use.wrapper.staging=false login2$ cat direct rule .* DIRECT / login2$ cat sites.xml /scratch/local/wilde/pstest/swiftwork login2$ ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Tuesday, October 23, 2012 3:23:16 PM > Subject: Re: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider > I tried CDM DIRECT policy with catsnsleep which gives the following: > > > CDM[DIRECT]: Linking to /home/train07/ketan_mars/swift/data.txt > failed! > > > the file exists as mentioned in the path above. > > > When specifying relative path to data.txt, I get /data.txt doesn't > exist, presumably because cdm DIRECT assumes paths to be absolute. > > > Log attached. > > > On Tue, Oct 23, 2012 at 4:14 PM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > > > Now trying with cdm. My cdm policy file contains a single line as > > follows: > > > > > > rule .* DEFAULT / > > Change DEFAULT to DIRECT > > Look at the example at: > > http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases > > sec 20.5.1 and 20.5.2 > > Its best to test this first with simple "catsn-like" examples before > you try your science app with it, to make sure that the direct > processing is behaving as you expect. > > - Mike > > > > > -- > Ketan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Oct 23 18:14:02 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Oct 2012 18:14:02 -0500 (CDT) Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: Message-ID: <1605523736.61205.1351034042528.JavaMail.root@zimbra.anl.gov> I just noticed your mention here of a "too many open files" problem. Can you tell me what "ulimit -n" (max # of open files) reports for your system? Can you alter your app script to return the 100+ files in a tarball instead of individually? What may be happening here is: - if you have low -n limit (eg 1024) and - you are using provider staging, meaning the swift or coaster service jvm will be writing the final output files directly and - you are writing 32 jobs x 100 files files concurrently then -> you will exceed your limit of open files. Just a hypothesis - you'll need to dig deeper and see if you can extend the ulimit for -n. - Mike ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Tuesday, October 23, 2012 2:02:15 PM > Subject: Re: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider > Mike, > > > Thank you for your answers. > > > I tried catsnsleep with n=100 and s=10 and indeed the number of > parallel jobs corresponded to the jobthrottle value. > Surprisingly, when I started the mars application immediately after > this, it also started 32 jobs in parallel. However, the run failed > with "too many open files" error after a while. > > > Now, I am trying cdm method. Will keep you posted. > > > On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > Ketan, looking further I see that your app has a large number of > output files, O(100). Depending on their size, and the speed of the > filesystem on which you are testing, that re-inforces my suspicion > that low concurrency you are seeing is due to staging IO. > > If this is a local 32-core host, try running with your input and > output data and workdirectory all on a local hard disk (or even > /dev/shm if it has sufficient RAM/space). Then try using CDM direct as > explained at: > > http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases > > > - Mike > > ----- Original Message ----- > > > > From: "Michael Wilde" < wilde at mcs.anl.gov > > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > Sent: Tuesday, October 23, 2012 12:23:34 PM > > Subject: Re: [Swift-devel] jobthrottle value does not correspond to > > number of parallel jobs on local provider > > Hi Ketan, > > > > In the log you attached I see this: > > > > 0.10 > > 100000 > > > > You should leave initialScore constant, and set to a large number, > > no > > matter what level of manual throttling you want to specify via > > sites.xml. We always use 10000 for this value. Don't attempt to vary > > the initialScore value for manual throttle: just use jobThrottle to > > set what you want. > > > > A jobThrottle value of 0.10 should run 11 jobs in parallel > > (jobThrottle * 100) + 1 (for historical reasons related to the > > automatic throttling algorithm). > > > > If you are seeing less than that, one common cause is that the ratio > > of your input staging times to your job run times is so high as to > > make it impossible for Swift to keep the expected/desired number of > > jobs in active state at once. > > > > I suggest you test the throttle behavior with a simple app script > > like > > "catsnsleep" (catsn with an artificial sleep to increase job > > duration). If your settings (sites + cf) work for that test, then > > they > > should work for the real app, within the staging constraints. Using > > CDM "direct" mode is likely what you want here to eliminate > > unnecessary staging on a local cluster. > > > > In your test, what was this ratio? Can you also post your cf file > > and > > the progress log from stdout/stderr? > > > > - Mike > > > > ----- Original Message ----- > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com > > > > To: "Swift Devel" < swift-devel at ci.uchicago.edu > > > > Sent: Tuesday, October 23, 2012 10:34:25 AM > > > Subject: [Swift-devel] jobthrottle value does not correspond to > > > number of parallel jobs on local provider > > > Hi, > > > > > > > > > I am trying to run an experiment on a 32-core machine with the > > > hope > > > of > > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control > > > these numbers of parallel jobs by setting the Karajan jobthrottle > > > values in sites.xml to 0.07, 0.15, and so on. > > > > > > > > > However, it seems that the values are not corresponding to what I > > > see > > > in the Swift progress text. > > > > > > > > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in > > > parallel. Then I added the line setting "Initialscore" value to > > > 10000, > > > which improved the jobs to 5. After this a 10-fold increase in > > > "initialscore" did not improve the jobs count. > > > > > > > > > Furthermore, a new batch of 5 jobs get started only when *all* > > > jobs > > > from the old batch are over as opposed to a continuous supply of > > > jobs > > > from "site selection" to "stage out" state which happens in the > > > case > > > of coaster and other providers. > > > > > > > > > The behavior is same in Swift 0.93.1 and latest trunk. > > > > > > > > > > > > Thank you for any clues on how to set the expected number of > > > parallel > > > jobs to these values. > > > > > > > > > Please find attached one such log of this run. > > > Thanks, -- > > > Ketan > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > > -- > Ketan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Wed Oct 24 08:25:47 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Wed, 24 Oct 2012 09:25:47 -0400 Subject: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider In-Reply-To: <1605523736.61205.1351034042528.JavaMail.root@zimbra.anl.gov> References: <1605523736.61205.1351034042528.JavaMail.root@zimbra.anl.gov> Message-ID: Hi Mike, Seems it is resolved now. There were multiple issues: In my config file use provider staging was set to true and in sites file staging method was set to file. This was conflicting with cdm link creation because the file with link name was already present. This was resolved by setting the above option to false and removing the staging method line from sites.xml Turns out that Mars only works when the licence file is present in the same dir as data. It does not like licence file symlinked for some reason. So, it had to be excluded from getting cdm'd. I use individual patterns to cdm inputs. In one of the configuration, where I set all my output file mappings to absolute paths in source swift script as well as mappers.sh, I was getting falsely successful jobs: swift did not complain but only blank output files were touch'd (by cdm?). It complained in the end when the files were not found to the last job which accepts them as input. Another issue was with the workdir in my sites.xml. It was a relative path in mine whereas was absolute path in your case. Swift complained with exit status 127 in my case and worked when I provide absolute path. I am not sure if this was trunk or 0.93.1. I will check again. In an earlier issue where I mentioned Swift not starting the number of parallel jobs for local provider corresponding to the jobthrottle value, I observe that indeed this is true for the local provider but does not seem to be true when using coasters *locally*. Consequently, I tried both approaches on a 32-core machine and found that in the case of coaster provider the performance was better compared to the local provider *with* CDM (Although only the inputs were cdm'd: 7M per job). Here are the results for different throttle values (intended to use different number of cpus) with coasters: 8 cores -- 13m 25sec 16 cores -- 12m 40sec 24 cores -- 10m 51sec 32 cores -- 10m 57sec With local provider, some inputs cdm'd: 8 cores -- 15m 8sec 16 cores -- 12m 4sec 24 cores -- 12m 37sec 32 cores -- 11m 39sec It looks like coaster provider does not take the datamovement to jobs ratio into account and in this case it turns out to be faster. I observe that local provider starts with a much less number of jobs and slowly picks up with more jobs and reached the peak intended number almost always after 25% of jobs completes. Regards, Ketan On Tue, Oct 23, 2012 at 7:14 PM, Michael Wilde wrote: > I just noticed your mention here of a "too many open files" problem. > > Can you tell me what "ulimit -n" (max # of open files) reports for your > system? > > Can you alter your app script to return the 100+ files in a tarball > instead of individually? > > What may be happening here is: > > - if you have low -n limit (eg 1024) and > > - you are using provider staging, meaning the swift or coaster service jvm > will be writing the final output files directly and > > - you are writing 32 jobs x 100 files files concurrently then > > -> you will exceed your limit of open files. > > Just a hypothesis - you'll need to dig deeper and see if you can extend > the ulimit for -n. > > - Mike > > ----- Original Message ----- > > From: "Ketan Maheshwari" > > To: "Michael Wilde" > > Cc: "Swift Devel" > > Sent: Tuesday, October 23, 2012 2:02:15 PM > > Subject: Re: [Swift-devel] jobthrottle value does not correspond to > number of parallel jobs on local provider > > Mike, > > > > > > Thank you for your answers. > > > > > > I tried catsnsleep with n=100 and s=10 and indeed the number of > > parallel jobs corresponded to the jobthrottle value. > > Surprisingly, when I started the mars application immediately after > > this, it also started 32 jobs in parallel. However, the run failed > > with "too many open files" error after a while. > > > > > > Now, I am trying cdm method. Will keep you posted. > > > > > > On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde < wilde at mcs.anl.gov > > > wrote: > > > > > > Ketan, looking further I see that your app has a large number of > > output files, O(100). Depending on their size, and the speed of the > > filesystem on which you are testing, that re-inforces my suspicion > > that low concurrency you are seeing is due to staging IO. > > > > If this is a local 32-core host, try running with your input and > > output data and workdirectory all on a local hard disk (or even > > /dev/shm if it has sufficient RAM/space). Then try using CDM direct as > > explained at: > > > > > http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases > > > > > > - Mike > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" < wilde at mcs.anl.gov > > > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu > > > > Sent: Tuesday, October 23, 2012 12:23:34 PM > > > Subject: Re: [Swift-devel] jobthrottle value does not correspond to > > > number of parallel jobs on local provider > > > Hi Ketan, > > > > > > In the log you attached I see this: > > > > > > 0.10 > > > 100000 > > > > > > You should leave initialScore constant, and set to a large number, > > > no > > > matter what level of manual throttling you want to specify via > > > sites.xml. We always use 10000 for this value. Don't attempt to vary > > > the initialScore value for manual throttle: just use jobThrottle to > > > set what you want. > > > > > > A jobThrottle value of 0.10 should run 11 jobs in parallel > > > (jobThrottle * 100) + 1 (for historical reasons related to the > > > automatic throttling algorithm). > > > > > > If you are seeing less than that, one common cause is that the ratio > > > of your input staging times to your job run times is so high as to > > > make it impossible for Swift to keep the expected/desired number of > > > jobs in active state at once. > > > > > > I suggest you test the throttle behavior with a simple app script > > > like > > > "catsnsleep" (catsn with an artificial sleep to increase job > > > duration). If your settings (sites + cf) work for that test, then > > > they > > > should work for the real app, within the staging constraints. Using > > > CDM "direct" mode is likely what you want here to eliminate > > > unnecessary staging on a local cluster. > > > > > > In your test, what was this ratio? Can you also post your cf file > > > and > > > the progress log from stdout/stderr? > > > > > > - Mike > > > > > > ----- Original Message ----- > > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com > > > > > To: "Swift Devel" < swift-devel at ci.uchicago.edu > > > > > Sent: Tuesday, October 23, 2012 10:34:25 AM > > > > Subject: [Swift-devel] jobthrottle value does not correspond to > > > > number of parallel jobs on local provider > > > > Hi, > > > > > > > > > > > > I am trying to run an experiment on a 32-core machine with the > > > > hope > > > > of > > > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control > > > > these numbers of parallel jobs by setting the Karajan jobthrottle > > > > values in sites.xml to 0.07, 0.15, and so on. > > > > > > > > > > > > However, it seems that the values are not corresponding to what I > > > > see > > > > in the Swift progress text. > > > > > > > > > > > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in > > > > parallel. Then I added the line setting "Initialscore" value to > > > > 10000, > > > > which improved the jobs to 5. After this a 10-fold increase in > > > > "initialscore" did not improve the jobs count. > > > > > > > > > > > > Furthermore, a new batch of 5 jobs get started only when *all* > > > > jobs > > > > from the old batch are over as opposed to a continuous supply of > > > > jobs > > > > from "site selection" to "stage out" state which happens in the > > > > case > > > > of coaster and other providers. > > > > > > > > > > > > The behavior is same in Swift 0.93.1 and latest trunk. > > > > > > > > > > > > > > > > Thank you for any clues on how to set the expected number of > > > > parallel > > > > jobs to these values. > > > > > > > > > > > > Please find attached one such log of this run. > > > > Thanks, -- > > > > Ketan > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > > > -- > > Ketan > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From lpesce at uchicago.edu Thu Oct 25 10:05:52 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Thu, 25 Oct 2012 10:05:52 -0500 Subject: [Swift-devel] Experiments on Beagle Message-ID: <7E532B0D-81A5-4DBA-A688-4F4A5D61F7C8@uchicago.edu> Hi -- I am running on more than 10,000 cores because there are a good number of users having problems running their jobs, which left the machine for me =) I am doing work for a user, so don't worry too much. I am just writing in case you want to take a look at how the simulations are proceeding, how the memory is used (login5, user lpesce) and how the number of tasks goes up and down as jobs are completed. (For example, I asked for 500 nodes and I am getting only a little over 400, but the machine has available the nodes I asked for, swift is just not taking them as far as I can tell, the number first spiked up to the requested number of nodes --I think-- then winded down) From wilde at mcs.anl.gov Thu Oct 25 10:22:28 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 25 Oct 2012 10:22:28 -0500 (CDT) Subject: [Swift-devel] Experiments on Beagle In-Reply-To: <7E532B0D-81A5-4DBA-A688-4F4A5D61F7C8@uchicago.edu> Message-ID: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> Lorenzo, 10K cores sounds great. Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores. Its also possible that one or more throttle settings are holding back your runs. Can you point us to the run directory where we can watch your log file and see your config files and your script? - Mike ----- Original Message ----- > From: "Lorenzo Pesce" > To: "swift-devel Devel" > Sent: Thursday, October 25, 2012 10:05:52 AM > Subject: [Swift-devel] Experiments on Beagle > Hi -- > I am running on more than 10,000 cores because there are a good number > of users having problems running their jobs, which left the machine > for me =) > > I am doing work for a user, so don't worry too much. > > I am just writing in case you want to take a look at how the > simulations are proceeding, how the memory is used (login5, user > lpesce) and how the number of tasks goes up and down as jobs are > completed. > (For example, I asked for 500 nodes and I am getting only a little > over 400, but the machine has available the nodes I asked for, swift > is just not taking them as far as I can tell, the number first spiked > up to the requested number of nodes --I think-- then winded down) > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From lpesce at uchicago.edu Thu Oct 25 10:25:35 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Thu, 25 Oct 2012 10:25:35 -0500 Subject: [Swift-devel] Experiments on Beagle In-Reply-To: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> References: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> Message-ID: <6A52B499-3249-4CBD-A7EF-DB369C3A690D@uchicago.edu> Sorry for the lapse: /lustre/beagle/GCNet/RG/Oreo/o080522_BS1 On Oct 25, 2012, at 10:22 AM, Michael Wilde wrote: > Lorenzo, 10K cores sounds great. > > Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores. Its also possible that one or more throttle settings are holding back your runs. > > Can you point us to the run directory where we can watch your log file and see your config files and your script? > > - Mike > > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: "swift-devel Devel" >> Sent: Thursday, October 25, 2012 10:05:52 AM >> Subject: [Swift-devel] Experiments on Beagle >> Hi -- >> I am running on more than 10,000 cores because there are a good number >> of users having problems running their jobs, which left the machine >> for me =) >> >> I am doing work for a user, so don't worry too much. >> >> I am just writing in case you want to take a look at how the >> simulations are proceeding, how the memory is used (login5, user >> lpesce) and how the number of tasks goes up and down as jobs are >> completed. >> (For example, I asked for 500 nodes and I am getting only a little >> over 400, but the machine has available the nodes I asked for, swift >> is just not taking them as far as I can tell, the number first spiked >> up to the requested number of nodes --I think-- then winded down) >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From lpesce at uchicago.edu Thu Oct 25 10:50:04 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Thu, 25 Oct 2012 10:50:04 -0500 Subject: [Swift-devel] Experiments on Beagle In-Reply-To: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> References: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> Message-ID: Run started to falter and I killed it. I resent it out capturing the screen to *.screenlog (lots of failed wrappers, it might be the app itself or the system, I don't know). I hope now I am capturing all the info. On Oct 25, 2012, at 10:22 AM, Michael Wilde wrote: > Lorenzo, 10K cores sounds great. > > Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores. Its also possible that one or more throttle settings are holding back your runs. > > Can you point us to the run directory where we can watch your log file and see your config files and your script? > > - Mike > > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: "swift-devel Devel" >> Sent: Thursday, October 25, 2012 10:05:52 AM >> Subject: [Swift-devel] Experiments on Beagle >> Hi -- >> I am running on more than 10,000 cores because there are a good number >> of users having problems running their jobs, which left the machine >> for me =) >> >> I am doing work for a user, so don't worry too much. >> >> I am just writing in case you want to take a look at how the >> simulations are proceeding, how the memory is used (login5, user >> lpesce) and how the number of tasks goes up and down as jobs are >> completed. >> (For example, I asked for 500 nodes and I am getting only a little >> over 400, but the machine has available the nodes I asked for, swift >> is just not taking them as far as I can tell, the number first spiked >> up to the requested number of nodes --I think-- then winded down) >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From lpesce at uchicago.edu Thu Oct 25 11:33:54 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Thu, 25 Oct 2012 11:33:54 -0500 Subject: [Swift-devel] Experiments on Beagle In-Reply-To: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> References: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> Message-ID: <748781B7-1E2A-4B9B-9EF7-DF9250C590AE@uchicago.edu> Running smoothly on 500 nodes exactly (12,000 jobs). So far the problems of explosions hasn't materialized yet. On Oct 25, 2012, at 10:22 AM, Michael Wilde wrote: > Lorenzo, 10K cores sounds great. > > Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores. Its also possible that one or more throttle settings are holding back your runs. > > Can you point us to the run directory where we can watch your log file and see your config files and your script? > > - Mike > > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: "swift-devel Devel" >> Sent: Thursday, October 25, 2012 10:05:52 AM >> Subject: [Swift-devel] Experiments on Beagle >> Hi -- >> I am running on more than 10,000 cores because there are a good number >> of users having problems running their jobs, which left the machine >> for me =) >> >> I am doing work for a user, so don't worry too much. >> >> I am just writing in case you want to take a look at how the >> simulations are proceeding, how the memory is used (login5, user >> lpesce) and how the number of tasks goes up and down as jobs are >> completed. >> (For example, I asked for 500 nodes and I am getting only a little >> over 400, but the machine has available the nodes I asked for, swift >> is just not taking them as far as I can tell, the number first spiked >> up to the requested number of nodes --I think-- then winded down) >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From lpesce at uchicago.edu Thu Oct 25 11:52:17 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Thu, 25 Oct 2012 11:52:17 -0500 Subject: [Swift-devel] Experiments on Beagle In-Reply-To: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> References: <964396745.64603.1351178548304.JavaMail.root@zimbra.anl.gov> Message-ID: <4A4AF125-7A97-49AE-B321-DB8B61D3D00C@uchicago.edu> Failures are starting to appear and dominate the results. I am letting it run anyway so you get a chance to look at it live. On Oct 25, 2012, at 10:22 AM, Michael Wilde wrote: > Lorenzo, 10K cores sounds great. > > Regarding not using all the nodes: I have seen that on Cray test runs, but only at >16K cores. Its also possible that one or more throttle settings are holding back your runs. > > Can you point us to the run directory where we can watch your log file and see your config files and your script? > > - Mike > > > ----- Original Message ----- >> From: "Lorenzo Pesce" >> To: "swift-devel Devel" >> Sent: Thursday, October 25, 2012 10:05:52 AM >> Subject: [Swift-devel] Experiments on Beagle >> Hi -- >> I am running on more than 10,000 cores because there are a good number >> of users having problems running their jobs, which left the machine >> for me =) >> >> I am doing work for a user, so don't worry too much. >> >> I am just writing in case you want to take a look at how the >> simulations are proceeding, how the memory is used (login5, user >> lpesce) and how the number of tasks goes up and down as jobs are >> completed. >> (For example, I asked for 500 nodes and I am getting only a little >> over 400, but the machine has available the nodes I asked for, swift >> is just not taking them as far as I can tell, the number first spiked >> up to the requested number of nodes --I think-- then winded down) >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From iraicu at cs.iit.edu Fri Oct 26 12:43:40 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 26 Oct 2012 13:43:40 -0400 Subject: [Swift-devel] Call for Workshops: ACM HPDC 2013 -- deadline extended to November 1, 2012 Message-ID: <508ACBCC.706@cs.iit.edu> Call for Workshops The organizers of the /22nd International ACM Symposium on High-Performance Parallel and Distributed Computing/ (HPDC'13) *call for proposals for workshops* to be held with HPDC'13. The workshops will be held on June 17-18, 2013. Workshops should provide forums for discussion among researchers and practitioners on focused topics or emerging research areas relevant to the HPDC community. Organizers may structure workshops as they see fit, including invited talks, panel discussions, presentations of work in progress, fully peer-reviewed papers, or some combination. Workshops could be scheduled for half a day or a full day, depending on interest, space constraints, and organizer preference. Organizers should design workshops for approximately 20-40 participants, to balance impact and effective discussion. *Workshop proposals* must be sent in PDF format to the HPDC'13 Workshops Chair, Abhishek Chandra (Email: chandra AT cs DOT umn DOT edu ) with the subject line *"HPDC 2013 Workshop Proposal"*, and should include: * The name and acronym of the workshop * A description (0.5-1 page) of the theme of the workshop * A description (one paragraph) of the relation between the theme of the workshop and of HPDC * A list of topics of interest * The names and affiliations of the workshop organizers, and if applicable, of a significant portion of the program committee * A description of the expected structure of the workshop (papers, invited talks, panel discussions, etc.) * Data about previous offerings of the workshop (if any), including the attendance, the numbers of papers or presentations submitted and accepted, and the links to the corresponding websites * A publicity plan for attracting submissions and attendees. Please also include expected number of submissions, accepted papers, and attendees that you anticipate for a successful workshop. Due to publication deadlines, workshops must operate within roughly the following timeline: papers due mid February (2-3 weeks after the HPDC deadline), and selected and sent to the publisher by mid April. Important dates: *Workshop Proposals Due: * *November 1, 2012* Notifications: November 7, 2012 Workshop CFPs Online and Distributed: November 25, 2012 -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Fri Oct 26 13:13:51 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 26 Oct 2012 14:13:51 -0400 Subject: [Swift-devel] CFP: 22nd Int. ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'13) Message-ID: <508AD2DF.8060609@cs.iit.edu> **** CALL FOR PAPERS **** The 22nd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'13) New York City, USA - June 17-21, 2013 http://www.hpdc.org/2013 The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) is the premier annual conference for presenting the latest research on the design, implementation, evaluation, and the use of parallel and distributed systems for high-end computing. In 2013, the 22nd HPDC and affiliated workshops will take place in the heart of iconic New York City from June 17-21. **** SUBMISSION DEADLINES **** Abstracts: 14 January 2013 Papers: 21 January 2013 (no extensions) **** HPDC'13 GENERAL CO-CHAIRS **** Manish Parashar, Rutgers University Jon Weissman, University of Minnesota **** HPDC'13 PROGRAM CO-CHAIRS **** Dick Epema, Delft University of Technology Renato Figueiredo, University of Florida **** HPDC'13 WORKSHOPS CHAIR **** Abhishek Chandra, University of Minnesota **** HPDC'13 LOCAL ARRANGEMENTS CHAIR **** Daniele Scarpazza, DEShaw Research **** HPDC'13 SPONSORSHIP CHAIR **** Dean Hildebrand, IBM Almaden **** HPDC'13 PUBLICITY CO-CHAIRS **** Alexandru Iosup, Delft University of Technology, the Netherlands Ioan Raicu, Illinois Institute of Technology, USA Kenjiro Taura, University of Tokyo, Japan Bruno Schulze, National Laboratory for Scientific Computing, Brazil **** SCOPE AND TOPICS **** Submissions are welcomed on high-performance parallel and distributed computing topics including but not limited to: clusters, clouds, grids, data-intensive computing, massively multicore, and global-scale computing systems. New scholarly research showing empirical and reproducible results in architectures, systems, and networks is strongly encouraged, as are experience reports of operational deployments that can provide insights for future research on HPDC applications and systems. All papers will be evaluated for their originality, technical depth and correctness, potential impact, relevance to the conference, and quality of presentation. Research papers must clearly demonstrate research contributions and novelty, while experience reports must clearly describe lessons learned and demonstrate impact. In the context of high-?????performance parallel and distributed computing, the topics of interest include, but are not limited to: * Systems, networks, and architectures for high-end computing * Massively multicore systems * Resource virtualization * Programming languages and environments * I/O, storage systems, and data management * Resource management and scheduling, including energy-?????aware techniques * Performance modeling and analysis * Fault tolerance, reliability, and availability * Data-intensive computing * Applications of parallel and distributed computing **** PAPER SUBMISSION GUIDELINES **** Authors are invited to submit technical papers of at most 12 pages in PDF format, including figures and references. Papers should be formatted in the ACM Proceedings Style and submitted via the conference web site. No changes to the margins, spacing, or font sizes as specified by the style file are allowed. Accepted papers will appear in the conference proceedings, and will be incorporated into the ACM Digital Library. A limited number of papers will be accepted as posters. Papers must be self-contained and provide the technical substance required for the program committee to evaluate their contributions. Papers should thoughtfully address all related work, particularly work presented at previous HPDC events. Submitted papers must be original work that has not appeared in and is not under consideration for another conference or a journal. See the ACM Prior Publication Policy for more details. **** IMPORTANT DATES **** Abstracts Due: 14 January 2013 Papers Due: 21 January 2013 (no extensions) **** Program Committee **** David Abramson, Monash University, Australia Kento Aida, National Institute of Informatics, Japan Gabriel Antoniu INRIA, France Henri Bal, Vrije Universiteit, the Netherlands Adam Barker, University of St Andrews, UK Michela Becchi, University of Missouri - Columbia, USA John Bent, EMC, USA Ali Butt, Virginia Tech, USA Kirk Cameron, Virginia Tech, USA Franck Cappello, INRIA, France and University of Illinois at Urbana-Champaign, USA Henri Casanova, University of Hawaii, USA Abhishek Chandra, University of Minnesota, USA Andrew Chien, University of Chicago and Argonne National Laboratory, USA Paolo Costa, Imperial College London, UK Peter Dinda, Northwestern University, USA Gilles Fedak, INRIA, France Ian Foster, University of Chicago and Argonne National Laboratory, USA Clemens Grelck, University of Amsterdam, the Netherlands Dean Hildebrand, IBM Research, USA Fabrice Huet, INRIA-University of Nice, France Adriana Iamnitchi, University of South Florida, USA Alexandru Iosup, Delft University of Technology, the Netherlands Kate Keahey, Argonne National Laboratory, USA Thilo Kielmann, Vrije Universiteit, the Netherlands Charles Kilian, Purdue University, USA Zhiling Lan, Illinois Institute of Technology, USA John Lange, University of Pittsburgh, USA Barney Maccabe, Oak Ridge National Laboratory, USA Carlos Maltzahn, University of California, Santa Cruz, USA Naoya Maruyama, RIKEN Advanced Institute for Computational Science, Japan Satoshi Matsuoka, Tokyo Institute of Technology, Japan Manish Parashar, Rutgers University, USA Judy Qiu, Indiana University, USA Ioan Raicu, Illinois Institute of Technology, USA Philip Rhodes, University of Mississippi, USA Matei Ripeanu, University of British Columbia, Canada Prasenjit Sarkar, IBM Research, USA Daniele Scarpazza, D.E. Shaw Research, USA Karsten Schwan, Georgia Institute of Technology, USA Martin Swany, Indiana University, USA Michela Taufer, University of Delaware, USA Kenjiro Taura, University of Tokyo, Japan Douglas Thain, University of Notre Dame, USA Cristian Ungureanu, NEC Research, USA Ana Varbanescu, Delft University of Technology, the Netherlands Chuliang Weng, Shanghai Jiao Tong University, China Jon Weissman, University of Minnesota, USA Yongwei Wu, Tsinghua University, China Dongyan Xu, Purdue University, USA Ming Zhao, Florida International University, USA **** Steering Committee **** Henri Bal, Vrije Universiteit, the Netherlands Andrew A. Chien, University of Chicago and Argonne National Laboratory, USA Peter Dinda, Northwestern University, USA Dick Epema, Delft University of Technology, the Netherlands Ian Foster, University of Chicago and Argonne National Laboratory, USA Salim Hariri, University of Arizona, USA Thilo Kielmann, Vrije Universiteit, the Netherlands Dieter Kranzlmueller, Ludwig-Maximilians-Universitaet Muenchen, Germany Arthur "Barney" Maccabe, Oak Ridge National Laboratory, USA Satoshi Matsuoka, Tokyo Institute of Technology, Japan Manish Parashar, Rutgers University, USA Matei Ripeanu, University of British Columbia, Canada Karsten Schwan, Georgia Tech, USA Doug Thain, University of Notre Dame, USA Jon Weissman, University of Minnesota (Chair), USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From iraicu at cs.iit.edu Fri Oct 26 13:26:57 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 26 Oct 2012 14:26:57 -0400 Subject: [Swift-devel] CFP: The 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013) Message-ID: <508AD5F1.2020301@cs.iit.edu> **** CALL FOR PAPERS **** The 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013) Delft University of Technology, Delft, the Netherlands May 13-16, 2013 http://www.pds.ewi.tudelft.nl/ccgrid2013 Rapid advances in architectures, networks, and systems and middleware technologies are leading to new concepts in and platforms for computing, ranging from Clusters and Grids to Clouds and Datacenters. CCGrid is a series of very successful conferences, sponsored by the IEEE Computer Society Technical Committee on Scalable Computing (TCSC) and the ACM, with the overarching goal of bringing together international researchers, developers, and users to provide an international forum to present leading research activities and results on a broad range of topics related to these concepts and platforms, and their applications. The conference features keynotes, technical presentations, workshops, tutorials, and posters, as well as the SCALE challenge featuring live demonstrations. In 2013, CCGrid will come to the Netherlands for the first time, and will be held in Delft, a historical, picturesque city that is less than one hour away from Amsterdam-Schiphol airport. The main conference will be held on May 14-16 (Tuesday to Thursday), with tutorials and affiliated workshops taking place on May 13 (Monday). **** IMPORTANT DATES **** Papers Due: 12 November 2012 Author Notifications: 24 January 2013 Final Papers Due: 22 February 2013 **** TOPICS OF INTEREST **** CCGrid 2013 will have a focus on important and immediate issues that are significantly influencing all aspects of cluster, cloud and grid computing. Topics of interest include, but are not limited to: * Applications and Experiences: Applications to real and complex problems in science, engineering, business, and society; User studies; Experiences with large-scale deployments, systems, or applications * Architecture: System architectures, design and deployment; Power and cooling; Security and reliability; High availability solutions * Autonomic Computing and Cyberinfrastructure: Self-managed behavior, models and technologies; Autonomic paradigms and systems (control-based, bio-inspired, emergent, etc.); Bio-inspired optimizations and computing * Cloud Computing: Cloud architectures; Software tools and techniques for clouds * Multicore and Accelerator-based Computing: Software and application techniques to utilize multicore architectures and accelerators in clusters, grids, and clouds * Performance Modeling and Evaluation: Performance prediction and modeling; Monitoring and evaluation tools; Analysis of system and application performance; Benchmarks and testbeds * Programming Models, Systems, and Fault-Tolerant Computing: Programming models and environments for cluster, cloud, and grid computing; Fault-tolerant systems, programs and algorithms; Systems software to support efficient computing * Scheduling and Resource Management: Techniques to schedule jobs and resources on cluster, cloud, and grid computing platforms; SLA definition and enforcement **** PAPER SUBMISSION GUIDELINES **** Authors are invited to submit papers electronically in PDF format. Submitted manuscripts should be structured as technical papers and may not exceed 8 letter-size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings. Submissions not conforming to these guidelines may be returned without review. Authors should make sure that their file will print on a printer that uses letter-size (8.5 x 11) paper. The official language of the conference is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding the page limit, or not appropriately structured may not be considered. Authors may contact the conference chairs for more information. The proceedings will be published through the IEEE Computer Society Press, USA, and will be made available online through the IEEE Digital Library. **** CALL FOR TUTORIAL AND WORKSHOP PROPOSALS **** Tutorials and workshops affiliated with CCGrid 2013 will be held on May 13 (Monday). For more information on the tutorials and workshops and for the complete Call for Tutorial and Workshop Proposals, please see the conference website. **** GENERAL CHAIR **** Dick Epema, Delft University of Technology, the Netherlands **** PROGRAM CHAIR **** Thomas Fahringer, University of Innsbruck, Austria **** PROGRAM VICE-CHAIRS **** Rosa Badia, Barcelona Supercomputing Center, Spain Henri Bal, Vrije Universiteit, the Netherlands Marios Dikaiakos, University of Cyprus, Cyprus Kirk Cameron, VirginiaTech, USA Daniel Katz, University of Chicago & Argonne Nat Lab, USA Kate Keahey, Argonne National Laboratory, USA Martin Schulz, Lawrence Livermore National Laboratory, USA Douglas Thain, University of Notre Dame, USA Cheng-Zhong Xu, Shenzhen Inst. of Advanced Techn, China **** WORKSHOPS CO-CHAIRS **** Shantenu Jha, Rutgers and Louisana State University, USA Ioan Raicu, Illinois Institute of Technology, USA **** TOTORIALS CHAIR **** Radu Prodan, University of Innsbruck, Austria **** DOCTORAL SYMPOSIUM CO-CHAIRS **** Yogesh Simmhan, University of Southern California, USA Ana Varbanescu, Delft Univ of Technology, the Netherlands **** SUBMISSIONS AND PROCEEDINGS CHAIR **** Pavan Balaji, Argonne National Laboratory, USA **** FINANCE AND REGISTRATION CHAIR **** Alexandru Iosup, Delft Univ of Technology, the Netherlands **** PUBLICITY CHAIRS **** Nazareno Andrade, University Federal de Campina Grance, Brazil Gabriel Antoniu, INRIA, France Bahman Javadi, University of Western Sysney, Australia Ioan Raicu, Illinois Institute of Technology and Argonne National Laboratory, USA Kin Choong Yow, Shenzhen Inst. of Advanced Technology, China **** CYBER CHAIR **** Stephen van der Laan, Delft University of Technology, the Netherlands -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= =================================================================