From wilde at mcs.anl.gov Fri May 1 10:51:52 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 01 May 2009 10:51:52 -0500 Subject: [Swift-devel] Upcoming CCGrid 2009 papers relevant to scheduling Message-ID: <49FB1A98.1060401@mcs.anl.gov> These look interesting to us, especially the second one: Efficient Grid Task-Bundle Allocation Using Bargaining Based Self-Adaptive Auction Han Zhao and Xiaolin Li Scheduling Strategies for Cycle Scavenging in Multicluster Grid Systems Ozan Sonmez, Bart Grundeken, Hashim Mohamed, Alexandru Iosup, and Dick Epema Developing Scheduling Policies in gLite Middleware A. Kretsis, P. Kokkinos, and E. Varvarigos - Mike ---------- Forwarded message ---------- From: Hu Hao yu Date: Fri, 1 May 2009 18:35:57 +0800 (HKT) Subject: CCGrid 2009 (May 18-21, 2009): Call For Participation To: (We apologize if you have received multiple copies of this e-mail) ************************************************************************** CALL FOR PARTICIPATION IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2009) 18-21 May 2009 in Shanghai, China URL: http://grid.sjtu.edu.cn/ccgrid2009 or http://www.buyya.com/ccgrid *************************************************************************** CCGrid 2009 is the ninth in a series of successful international conferences and for the first time will take place in China. The CCGrid symposium series serves as a major international forum for presenting and sharing recent research accomplishments and technological developments in the field of cluster and grid computing. The conference will be held at Shanghai Jiao Tong University in Shanghai, which is located along the coast of the East China Sea and the southern banks of the mouth of the Yangtze River. Looking forward to see you at Shanghai. PROGRAM HIGHLIGHTS: KEYNOTES Market-Oriented Cloud Computing: Vision, Hype, and Reality of Delivering Computing as the 5th Utility - Rajkumar Buyya, The University of Melbourne and Manjrasoft, Australia Challenges and Opportunities on Parallel/Distributed Programming for large-scale: from Multi-core to Clouds - Denis Caromel, INRIA/Univ. de Nice Sophia Antipolis, France Online Storage and Content Distribution System at a Large Scale: Peer-assistance and Beyond - Bo Li, Hong Kong University of Science and Technology, China CONTRIBUTED PAPERS: The technical program will consist of 57 contributed papers chosen from 271 submissions from 35 countries (acceptation rate 21%). The papers will be presented in 15 sessions (attached below). WORKSHOPS: - International Workshop on Cloud Computing (Cloud 2009) - Workshop about Challenges for the application of Grids in Healthcare - Workshop on Service-Oriented P2P Networks and Grid Systems (ServP2P 2009) - Workshop on e-Science/e-Research Visualization - Workflow Systems in e-Science (WSES09) PANEL: Cloud Computing: Technical challenges and Business Implications Chair/Moderator: Geng Lin, Cisco Systems, USA TUTORIALS: Market-Oriented Grid Computing and the Gridbus Middleware Rajkumar Buyya, The University of Melbourne and Manjrasoft, Australia Distributed Simulation on the Grid Stephen John Turner and Wentong Cai, Nanyang Technological University, Singapore Introduction to Cloud Computing James Broberg, The University of Melbourne, Australia MINI SYMPOSIUMS: Second IEEE International Scalable Computing Challenge (SCALE 2009) Coordinators: Manish Parashar, Rutgers University, USA and Shantenu Jha, CCT, Louisiana State University, Grid Projects in China Coordinators: Hai Jin, Huazhong University of Science and Technology, China and Minlu Li, Shanghai Jiao Tong University, China Please register at the conference website http://grid.sjtu.edu.cn/ccgrid2009/registration.html Thanks CCGrid09 Organizing Committee ----------------------------------------------------------------------------------------------- TECHNICAL PROGRAM Session 1 : Scheduling in Grid 1 Tuesday 19th May - 10:30-12:30 Session chair : Hidemoto Nakada, AIST, Japan Efficient Grid Task-Bundle Allocation Using Bargaining Based Self-Adaptive Auction Han Zhao and Xiaolin Li Scheduling Strategies for Cycle Scavenging in Multicluster Grid Systems Ozan Sonmez, Bart Grundeken, Hashim Mohamed, Alexandru Iosup, and Dick Epema Developing Scheduling Policies in gLite Middleware A. Kretsis, P. Kokkinos, and E. Varvarigos Runtime Estimations, Reputation and Elections for Top Performing Distributed Query Scheduling Rog?rio Lu?s de Carvalho Costa and Pedro Furtado Session 2 : Peer-to-peer I Tuesday 19th May - 10:30-12:30 Session chair : Gilles Fedak, INRIA, France Extending Pastry by an Alphanumerical Overlay Dominic Battr?, Andr? H?ing, Martin Raack, Ulf Rerrer-Brusch, and Odej Kao Self-Chord: A Bio-inspired Algorithm for Structured P2P Systems Agostino Forestiero, Carlo Mastroianni, and Michela Meo BloomCast: Efficient Full-Text Retrieval over Unstructured P2Ps with Guaranteed Recall Hanhua Chen, Hai Jin, Xucheng Luo, Yunhao Liu, and Lionel M. Ni Multicast Trees for Collaborative Applications Krzysztof Rzadca, Jackson Tan Teck Yong, and Anwitaman Datta Session 3 : Power Management Tuesday 19th May - 14:00-16:00 Session chair : Satoshi Matsuoka, Tokyo Institute of Tech., Japan Energy-Efficient Cluster Computing via Accurate Workload Characterization S. Huang and W. Feng Markov Model Based Disk Power Management for Data Intensive Workloads Rajat Garg, Seung Woo Son, Mahmut Kandemir, Padma Raghavan, and Ramya Prabhakar Optimal Power Management for Server Farm to Support Green Computing Dusit Niyato, Sivadon Chaisiri, and Lee Bu Sung Minimizing Energy Consumption for Precedence-Constrained Applications Using Dynamic Voltage Scaling Young Choon Lee and Albert Y. Zomaya Session 4 : Scheduling in Grid 2 Tuesday 19th May - 14:00-16:00 Session chair : Thomas Fahringer, University of Innsbruck, Austria The Grid Enablement and Sustainable Simulation of Multiscale Physics Applications Yingwen Song, Yoshio Tanaka, Hiroshi Takemiya, Aiichiro Nakano, Shuji Ogata, and Satoshi Sekiguchi Reliability-Oriented Genetic Algorithm for Workflow Applications Using Max-Min Strategy Xiaofeng Wang, Rajkumar Buyya, and Jinshu Su Hybrid Re-scheduling Mechanisms for Workflow Applications on Multi-cluster Grid Yang Zhang, Charles Koelbel, and Keith Cooper Session 5 : Cloud Computing Tuesday 19th May - 16:30-18:00 Session chair : Hai Jin, Huazhong University of Science and Technology, China The Eucalyptus Open-Source Cloud-Computing System Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, and Dmitrii Zagorodnov GenLM: License Management for Grid and Cloud Computing Environments Mathias Dalheimer and Franz-Josef Pfreundt On-Demand Resource Provisioning for BPEL Workflows Using Amazon's Elastic Compute Cloud Tim D?rnemann, Ernst Juhnke, and Bernd Freisleben Multi-Tiered On-Demand Resource Scheduling for VM-Based Data Center Ying Song, Hui Wang, Yaqiong Li, Binquan Feng, and Yuzhong Sun Session 6 : Peer-to-peer II Tuesday 19th May - 16:30-18:00 Session chair : Xubin He, Tennessee Technological University, USA Reliable P2P Feed Delivery Anwitaman Datta and Liu Xin PAIS: A Proximity-Aware Interest-Clustered P2P File Sharing System Haiying Shen Adaptive Resource Indexing Technique for Unstructured Peer-to-Peer Networks Sumeth Lerthirunwong, Naoya Maruyama, and Satoshi Matsuoka Range Query Using Learning-Aware RPS in DHT-Based Peer-to-Peer Networks Ze Deng, Dan Feng, Ke Zhou, Zhan Shi, and Chao Luo Session 7 : I/O & File Systems Wednesday 20th May - 10:30-12:30 Session chair : Xiong Jing, ICT, China GMount: An Ad Hoc and Locality-Aware Distributed File System by Using SSH and FUSE Nan Dun, Kenjiro Taura, and Akinori Yonezawa Improving Parallel Write by Node-Level Request Scheduling Kazuki Ohta, Hiroya Matsuba, and Yutaka Ishikawa File Clustering Based Replication Algorithm in a Grid Environment Hitoshi Sato, Satoshi Matsuoka, and Toshio Endo Latency Hiding File I/O for Blue Gene Systems Florin Isaila, Javier Garcia Blas, Jesus Carretero, Robert Latham, Samuel Lang, and Robert Ross Session 8 : Workflow Wednesday 20th May - 10:30-12:30 Session chair : Young Choon Lee, University of Sydney, Australia Utility Driven Adaptive Workflow Execution Kevin Lee, Norman W. Paton, Rizos Sakellariou, and Alvaro A. A. Fernandes Hierarchical Caches for Grid Workflows David Chiu and Gagan Agrawal Performance under Failures of DAG-Based Parallel Computing Hui Jin, Xian-He Sun, Ziming Zheng, Zhiling Lan, and Bing Xie Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids Yang Zhang, Anirban Mandal, Charles Koelbel, and Keith Cooper Session 9 : Resource Management Wednesday 20th May - 14:00-16:00 Session chair : Osamu Tatebe, Tsukuba University, Japan Harvesting Large-Scale Grids for Software Resources Asterios Katsifodimos, George Pallis, and Marios D. Dikaiakos Resource Allocation Using Virtual Clusters Mark Stillwell, David Schanzenbach, Fr?d?ric Vivien, and Henri Casanova Resource Information Aggregation in Hierarchical Grid Networks P. Kokkinos and E. A. Varvarigos Measuring Fragmentation of Two-Dimensional Resources Applied to Advance Reservation Grid Scheduling Julius Gehr and J?rg Schneider Session 10 : Data Management Thursday 21th May - 10:30-12:30 Session chair : Bruno Schulze, LNCC, Brazil BLAST Application with Data-Aware Desktop Grid Middleware Haiwu He, Gilles Fedak, Bing Tang, and Franck Cappello AVSS: An Adaptable Virtual Storage System Jian Ke, Xu-dong Zhu, Wen-wu Na, and Lu Xu Memory-Mapped File Approach for On-Demand Data Co-allocation on Grids Po-Cheng Chen, Jyh-Biau Chang, Yi-Chang Zhuang, Ce-Kuen Shieh, and Tyng-Yeu Liang Performance Issues in Parallelizing Data-Intensive Applications on a Multi-core Cluster Vignesh T. Ravi and Gagan Agrawal Session 11 : Performance Modeling Thursday 21th May - 10:30-12:30 Session chair : Hiroshi Nakashima, Kyoto University, Japan Using Templates to Predict Execution Time of Scientific Workflow Applications in the Grid Farrukh Nadeem and Thomas Fahringer Modeling Job Arrival Process with Long Range Dependence and Burstiness Characteristics Tran Ngoc Minh and Lex Wolters Modeling Job Lifespan Delays in Volunteer Computing Projects Trilce Estrada, Michela Taufer, and Kevin Reed A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids Rubing Duan, Farrukh Nadeem, Jie Wang, Radu Prodan, and Thomas Fahringer Session 12 : Virtualization Thursday 21th May - 14:00-16:00 Session chair : Manish Parashar, Rutgers University, USA A Scalable Security Model for Enabling Dynamic Virtual Private Execution Infrastructures on the Internet Pascale Vicat-Blanc Primet, Jean-Patrick Gelas, Olivier Mornard, Guilherme Koslovski, Vincent Roca, Lionel Giraud, Johan Montagnat, and Tram Truong Huu Self-Tuning Virtual Machines for Predictable eScience Sang-Min Park and Marty Humphrey Dynamic Provisioning of Virtual Organization Clusters Michael A. Murphy, Brandon Kagey, Michael Fenn, and Sebastien Goasguen Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing Song Fu Session 13 : High-Performance Communications and Fault Tolerance Thursday 21th May - 14:00-16:00 Session chair : Minyi Guo, Shanghai Jiao Tong University, China Natively Supporting True One-Sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur, W. Gropp, and D. K. Panda MPISec I/O: Providing Data Confidentiality in MPI-I/O Ramya Prabhakar, Christina Patrick, and Mahmut Kandemir Dynamic and Distributed Multipath Routing Policy for High-Speed Cluster Networks D. Lugones, D. Franco, and E. Luque Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems Pierre Riteau, Adrien L?bre, and Christine Morin Session 14 : Monitoring and Visualization Thursday 21th May - 16:30-18:00 Session chair : Yuzhong Sun, Institute of Computing Technology, Chinese Academy of Sciences, China Collusion Detection for Grid Computing Eugen Staab and Thomas Engel Multi-scale Real-Time Grid Monitoring with Job Stream Mining Xiangliang Zhang, Mich?le Sebag, and C?cile Germain-Renaud Towards Visualization Scalability through Time Intervals and Hierarchical Organization of Monitoring Data Lucas Mello Schnorr, Guillaume Huard, and Philippe Olivier Alexandre Navaux Session 15 : Matching and Adaptation Thursday 21th May - 16:30-18:00 Session chair : Carlos A. Varela, Rensselaer Polytechnic Institute, USA A Skyline Approach to the Matchmaking Web Service Hyuck Han, Hyungsoo Jung, Shingyu Kim, and Heon Y. Yeom Flexible and Efficient In-Vivo Enhancement for Grid Applications Dong Kwan Kim, Yang Jiao, and Eli Tilevich Evaluating the Divisible Load Assumption in the Context of Economic Grid Scheduling with Deadline-Based QoS guarantees Wim Depoorter, Ruben Van den Bossche, Kurt Vanmechelen, and Jan Broeckhove -- Sent from my mobile device From zhaozhang at uchicago.edu Fri May 1 11:08:19 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 May 2009 11:08:19 -0500 Subject: [Swift-devel] Regular Test of Coaster and Condor-g Message-ID: <49FB1E73.2050108@uchicago.edu> Hi, Ben, Mike I am planning to start the regular test of Coaster and Condor every other day. The Coaster-provider log will be kept at http://www.ci.uchicago.edu/~zzhang/coaster_log/, while the condor-g logs will be kept on http://www.ci.uchicago.edu/~zzhang/condor-g_log/ Before we have any tools to do the tests, I will just run it manually and post the status. Test May 1st failed: http://www.ci.uchicago.edu/~zzhang/coaster_log/log_all_0501 These sites failed: coaster/gwynn-coaster-gram2-gram2-condor.xml coaster/renci-engage-coaster.xml coaster/tgncsa-hg-coaster-pbs-gram2.xml coaster/tgncsa-hg-coaster-pbs-gram4.xml coaster/uj-pbs-gram2.xml These sites worked: coaster/coaster-local.xml coaster/gwynn-coaster-gram2-gram2-fork.xml coaster/teraport-gt2-gt2-pbs.xml Comparing with the April 24th test: http://www.ci.uchicago.edu/~zzhang/coaster_log/log_all_0424 All language behaviour tests passed These sites failed: coaster/tgncsa-hg-coaster-pbs-gram2.xml coaster/tgncsa-hg-coaster-pbs-gram4.xml These sites worked: coaster/coaster-local.xml coaster/gwynn-coaster-gram2-gram2-condor.xml coaster/gwynn-coaster-gram2-gram2-fork.xml coaster/renci-engage-coaster.xml coaster/teraport-gt2-gt2-pbs.xml coaster/uj-pbs-gram2.xml Ben, could you help me tell what the difference there is? I ran the exactly same thing. best wishes zhao From zhaozhang at uchicago.edu Fri May 1 14:01:31 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 May 2009 14:01:31 -0500 Subject: [Swift-devel] coaster issue Message-ID: <49FB470B.9020305@uchicago.edu> Hi, Mike I got an error when I rerun the test with coaster on teraport. Any ideas for this error? [zzhang at tp-grid1 sites]$ ./run-site coaster/gwynn-coaster-gram2-gram2-condor.xml testing site configuration: coaster/gwynn-coaster-gram2-gram2-condor.xml Removing files from previous runs Running test 061-cattwo at Fri May 1 13:46:37 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090501-1346-ge8hz8r7 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Failed to transfer wrapper log from 061-cattwo-20090501-1346-ge8hz8r7/info/4 on teraport Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090501-1346-ge8hz8r7/info/6 on teraport Progress: Submitting:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090501-1346-ge8hz8r7/info/8 on teraport Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: teraport Directory: 061-cattwo-20090501-1346-ge8hz8r7/jobs/8/cat-863f47aj stderr.txt: stdout.txt: ---- Caused by: Failed to start worker: null null org.globus.gram.GramException: The job failed when the job manager attempted to run it at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:595) Cleaning up... Shutting down service at https://128.135.92.83:46035 Got channel MetaChannel: 665483948 -> GSSSChannel-null(1) - Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo zhao From hategan at mcs.anl.gov Fri May 1 14:07:45 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 May 2009 14:07:45 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB470B.9020305@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> Message-ID: <1241204865.14962.0.camel@localhost> On Fri, 2009-05-01 at 14:01 -0500, Zhao Zhang wrote: > Hi, Mike > > I got an error when I rerun the test with coaster on teraport. Any ideas > for this error? Does a simple globusrun work (make sure it's not a fork job)? > > [zzhang at tp-grid1 sites]$ ./run-site > coaster/gwynn-coaster-gram2-gram2-condor.xml > testing site configuration: coaster/gwynn-coaster-gram2-gram2-condor.xml > Removing files from previous runs > Running test 061-cattwo at Fri May 1 13:46:37 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090501-1346-ge8hz8r7 > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Failed to transfer wrapper log from > 061-cattwo-20090501-1346-ge8hz8r7/info/4 on teraport > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090501-1346-ge8hz8r7/info/6 on teraport > Progress: Submitting:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090501-1346-ge8hz8r7/info/8 on teraport > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: teraport > Directory: 061-cattwo-20090501-1346-ge8hz8r7/jobs/8/cat-863f47aj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Failed to start worker: > null > null > org.globus.gram.GramException: The job failed when the job manager > attempted to run it > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:595) > > Cleaning up... > Shutting down service at https://128.135.92.83:46035 > Got channel MetaChannel: 665483948 -> GSSSChannel-null(1) > - Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > zhao From zhaozhang at uchicago.edu Fri May 1 14:09:58 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 May 2009 14:09:58 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <1241204865.14962.0.camel@localhost> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> Message-ID: <49FB4906.40702@uchicago.edu> No, globus-job-run failed ERROR: Failed to connect to local queue manager CEDAR:6001:Failed to connect to <128.135.92.83:32796> GRAM Job failed because the job failed when the job manager attempted to run it (error code 17) So, in this case could we tell that it is a GLOBUS issue, which means we need to contact the support team? zhao Mihael Hategan wrote: > On Fri, 2009-05-01 at 14:01 -0500, Zhao Zhang wrote: > >> Hi, Mike >> >> I got an error when I rerun the test with coaster on teraport. Any ideas >> for this error? >> > > Does a simple globusrun work (make sure it's not a fork job)? > > >> [zzhang at tp-grid1 sites]$ ./run-site >> coaster/gwynn-coaster-gram2-gram2-condor.xml >> testing site configuration: coaster/gwynn-coaster-gram2-gram2-condor.xml >> Removing files from previous runs >> Running test 061-cattwo at Fri May 1 13:46:37 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090501-1346-ge8hz8r7 >> Progress: >> Progress: Stage in:1 >> Progress: Submitting:1 >> Progress: Submitted:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090501-1346-ge8hz8r7/info/4 on teraport >> Progress: Stage in:1 >> Progress: Active:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090501-1346-ge8hz8r7/info/6 on teraport >> Progress: Submitting:1 >> Progress: Active:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090501-1346-ge8hz8r7/info/8 on teraport >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >> Host: teraport >> Directory: 061-cattwo-20090501-1346-ge8hz8r7/jobs/8/cat-863f47aj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Failed to start worker: >> null >> null >> org.globus.gram.GramException: The job failed when the job manager >> attempted to run it >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >> at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >> at java.lang.Thread.run(Thread.java:595) >> >> Cleaning up... >> Shutting down service at https://128.135.92.83:46035 >> Got channel MetaChannel: 665483948 -> GSSSChannel-null(1) >> - Done >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >> zhao >> > > > From hategan at mcs.anl.gov Fri May 1 14:15:24 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 May 2009 14:15:24 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB4906.40702@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> Message-ID: <1241205324.15273.0.camel@localhost> On Fri, 2009-05-01 at 14:09 -0500, Zhao Zhang wrote: > No, globus-job-run failed > > ERROR: Failed to connect to local queue manager > CEDAR:6001:Failed to connect to <128.135.92.83:32796> > GRAM Job failed because the job failed when the job manager attempted to > run it (error code 17) > > So, in this case could we tell that it is a GLOBUS issue, which means we > need to contact the support team? Yes, and yes. > > zhao > > Mihael Hategan wrote: > > On Fri, 2009-05-01 at 14:01 -0500, Zhao Zhang wrote: > > > >> Hi, Mike > >> > >> I got an error when I rerun the test with coaster on teraport. Any ideas > >> for this error? > >> > > > > Does a simple globusrun work (make sure it's not a fork job)? > > > > > >> [zzhang at tp-grid1 sites]$ ./run-site > >> coaster/gwynn-coaster-gram2-gram2-condor.xml > >> testing site configuration: coaster/gwynn-coaster-gram2-gram2-condor.xml > >> Removing files from previous runs > >> Running test 061-cattwo at Fri May 1 13:46:37 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090501-1346-ge8hz8r7 > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitting:1 > >> Progress: Submitted:1 > >> Failed to transfer wrapper log from > >> 061-cattwo-20090501-1346-ge8hz8r7/info/4 on teraport > >> Progress: Stage in:1 > >> Progress: Active:1 > >> Failed to transfer wrapper log from > >> 061-cattwo-20090501-1346-ge8hz8r7/info/6 on teraport > >> Progress: Submitting:1 > >> Progress: Active:1 > >> Failed to transfer wrapper log from > >> 061-cattwo-20090501-1346-ge8hz8r7/info/8 on teraport > >> Progress: Failed:1 > >> Execution failed: > >> Exception in cat: > >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > >> Host: teraport > >> Directory: 061-cattwo-20090501-1346-ge8hz8r7/jobs/8/cat-863f47aj > >> stderr.txt: > >> > >> stdout.txt: > >> > >> ---- > >> > >> Caused by: > >> Failed to start worker: > >> null > >> null > >> org.globus.gram.GramException: The job failed when the job manager > >> attempted to run it > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > >> at org.globus.gram.GramJob.setStatus(GramJob.java:184) > >> at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > >> at java.lang.Thread.run(Thread.java:595) > >> > >> Cleaning up... > >> Shutting down service at https://128.135.92.83:46035 > >> Got channel MetaChannel: 665483948 -> GSSSChannel-null(1) > >> - Done > >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo > >> > >> zhao > >> > > > > > > From zhaozhang at uchicago.edu Fri May 1 14:21:51 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 May 2009 14:21:51 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB4906.40702@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> Message-ID: <49FB4BCF.80605@uchicago.edu> Good, Thanks, Mihael. I write it down in my test book, with "error-reason-solution" tuple. Here comes another two: testing site configuration: coaster/renci-engage-coaster.xml Removing files from previous runs Running test 061-cattwo at Fri May 1 10:35:21 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090501-1035-4fpfuxk2 Progress: Progress: Initializing site shared directory:1 Execution failed: Could not initialize shared directory on renci-engage Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Cannot create directory /nfs/osg-data/osgedu/benc/swift/061-cattwo-20090501-1035-4fpfuxk2 Caused by: Server refused performing the request. Custom message: Server refused creating directory (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gfs_file_mkdir:554: 500-System error in mkdir: Permission denied 500-A system call failed: Permission denied 500 End.] SWIFT RETURN CODE NON-ZERO - test 061-cattwo SITE FAIL! Exit code 1 for site definition coaster/renci-engage-coaster.xml The globus-job-run is ok for renci. [zzhang at tp-grid1 ~]$ globus-job-run belhaven-1.renci.org /usr/bin/id uid=715(osg) gid=715(osg) groups=715(osg) testing site configuration: coaster/uj-pbs-gram2.xml Removing files from previous runs Running test 061-cattwo at Fri May 1 10:39:03 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090501-1039-bqzi23df Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitted:1 Failed to transfer wrapper log from 061-cattwo-20090501-1039-bqzi23df/info/k on teraport Progress: Stage in:1 Progress: Active:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from 061-cattwo-20090501-1039-bqzi23df/info/m on teraport Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from 061-cattwo-20090501-1039-bqzi23df/info/o on teraport Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: teraport Directory: 061-cattwo-20090501-1039-bqzi23df/jobs/o/cat-oilxv6aj stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Could not submit job (qsub reported an exit code of 188). no error output org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job: Could not submit job (qsub reported an exit code of 188). no error output at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Could not submit job (qsub reported an exit code of 188). no error output at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:94) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) ... 4 more Cleaning up... Shutting down service at https://152.106.18.251:46990 Got channel MetaChannel: 756415799 -> GSSSChannel-null(1) - Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo SITE FAIL! Exit code 1 for site definition coaster/uj-pbs-gram2.xml globus-job-run is also ok for this site [zzhang at tp-grid1 ~]$ globus-job-run osg-ce.grid.uj.ac.za /usr/bin/id uid=640(osgedu) gid=2000(osg) groups=2000(osg) zhao From hategan at mcs.anl.gov Fri May 1 14:44:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 May 2009 14:44:23 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB4BCF.80605@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> Message-ID: <1241207063.16409.2.camel@localhost> On Fri, 2009-05-01 at 14:21 -0500, Zhao Zhang wrote: > Good, Thanks, Mihael. I write it down in my test book, with > "error-reason-solution" tuple. > > Here comes another two: > Cannot create directory /nfs/osg-data/osgedu/benc/swift/061-cattwo-20090501-1035-4fpfuxk2 > 500-System error in mkdir: Permission denied I think that's fairly clear. You'll need to find a workdir that you can write to. > Caused by: > Cannot submit job: Could not submit job (qsub reported an exit code of 188). no error output I'm sure that means something, but can you try with globus instead of the PBS provider? From zhaozhang at uchicago.edu Fri May 1 14:57:14 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 May 2009 14:57:14 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <1241207063.16409.2.camel@localhost> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> Message-ID: <49FB541A.7000300@uchicago.edu> Mihael Hategan wrote: > On Fri, 2009-05-01 at 14:21 -0500, Zhao Zhang wrote: > >> Good, Thanks, Mihael. I write it down in my test book, with >> "error-reason-solution" tuple. >> >> Here comes another two: >> Cannot create directory /nfs/osg-data/osgedu/benc/swift/061-cattwo-20090501-1035-4fpfuxk2 >> 500-System error in mkdir: Permission denied >> > > I think that's fairly clear. You'll need to find a workdir that you can > write to. > > I found "$TMP Location /nfs/osg-data" on http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=387#SITELISTING, but it still doesn't work. Do you know any dir that I could try out? > >> Caused by: >> Cannot submit job: Could not submit job (qsub reported an exit code of 188). no error output >> > > I'm sure that means something, but can you try with globus instead of > the PBS provider? > Changing to ? zhao > > From hategan at mcs.anl.gov Fri May 1 15:08:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 May 2009 15:08:12 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB541A.7000300@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> Message-ID: <1241208492.22385.0.camel@localhost> On Fri, 2009-05-01 at 14:57 -0500, Zhao Zhang wrote: > > Mihael Hategan wrote: > > On Fri, 2009-05-01 at 14:21 -0500, Zhao Zhang wrote: > > > >> Good, Thanks, Mihael. I write it down in my test book, with > >> "error-reason-solution" tuple. > >> > >> Here comes another two: > >> Cannot create directory /nfs/osg-data/osgedu/benc/swift/061-cattwo-20090501-1035-4fpfuxk2 > >> 500-System error in mkdir: Permission denied > >> > > > > I think that's fairly clear. You'll need to find a workdir that you can > > write to. > > > > > I found "$TMP Location /nfs/osg-data" on > http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=387#SITELISTING, > > but it still doesn't work. Do you know any dir that I could try out? What exactly did you try (i.e. paste relevant lines from sites.xml). > > > >> Caused by: > >> Cannot submit job: Could not submit job (qsub reported an exit code of 188). no error output > >> > > > > I'm sure that means something, but can you try with globus instead of > > the PBS provider? > > > Changing jobManager="gt2:pbs" /> > to jobManager="gt2:gt2" /> ? gt2:gt2:pbs. From zhaozhang at uchicago.edu Fri May 1 15:18:55 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 May 2009 15:18:55 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <1241208492.22385.0.camel@localhost> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> Message-ID: <49FB592F.30609@uchicago.edu> Mihael Hategan wrote: > On Fri, 2009-05-01 at 14:57 -0500, Zhao Zhang wrote: > >> Mihael Hategan wrote: >> >>> On Fri, 2009-05-01 at 14:21 -0500, Zhao Zhang wrote: >>> >>> >>>> Good, Thanks, Mihael. I write it down in my test book, with >>>> "error-reason-solution" tuple. >>>> >>>> Here comes another two: >>>> Cannot create directory /nfs/osg-data/osgedu/benc/swift/061-cattwo-20090501-1035-4fpfuxk2 >>>> 500-System error in mkdir: Permission denied >>>> >>>> >>> I think that's fairly clear. You'll need to find a workdir that you can >>> write to. >>> >>> >>> >> I found "$TMP Location /nfs/osg-data" on >> http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=387#SITELISTING, >> >> but it still doesn't work. Do you know any dir that I could try out? >> > > What exactly did you try (i.e. paste relevant lines from sites.xml). > I used this one /nfs/osg-data/osgedu/benc/swift > >>> >>> >>>> Caused by: >>>> Cannot submit job: Could not submit job (qsub reported an exit code of 188). no error output >>>> >>>> >>> I'm sure that means something, but can you try with globus instead of >>> the PBS provider? >>> >>> >> Changing > jobManager="gt2:pbs" /> >> to > jobManager="gt2:gt2" /> ? >> > > gt2:gt2:pbs. > I tried it out: [zzhang at tp-grid1 sites]$ ./run-site coaster/uj-pbs-gram2.xml testing site configuration: coaster/uj-pbs-gram2.xml Removing files from previous runs Running test 061-cattwo at Fri May 1 15:15:25 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090501-1515-bnucy0g4 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090501-1515-bnucy0g4/info/d on teraport Progress: Stage in:1 Progress: Active:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from 061-cattwo-20090501-1515-bnucy0g4/info/f on teraport Progress: Stage in:1 Progress: Active:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from 061-cattwo-20090501-1515-bnucy0g4/info/h on teraport Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: teraport Directory: 061-cattwo-20090501-1515-bnucy0g4/jobs/h/cat-hzcz77aj stderr.txt: stdout.txt: ---- Caused by: Failed to start worker: null null org.globus.gram.GramException: The job failed when the job manager attempted to run it at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:595) Cleaning up... Shutting down service at https://152.106.18.251:49647 Got channel MetaChannel: 1969344019 -> GSSSChannel-null(1) - Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo The sites.xml file I am using is [zzhang at tp-grid1 sites]$ cat coaster/uj-pbs-gram2.xml /nfs/home/benc/swifttest 4 > > > From hategan at mcs.anl.gov Fri May 1 15:22:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 May 2009 15:22:02 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <1241135666.3603.1.camel@localhost> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241135666.3603.1.camel@localhost> Message-ID: <1241209322.28776.2.camel@localhost> Fix in cog 2394. Use globus:coasterMaxJobs profile. On Thu, 2009-04-30 at 18:54 -0500, Mihael Hategan wrote: > Mystery solved: > > Thu Apr 30 18:19:13 2009 JM_SCRIPT: ERROR: job submission failed: > Thu Apr 30 18:19:13 2009 JM_SCRIPT: > ------------------------------------------------------------------------ > Welcome to TACC's Ranger System, an NSF TeraGrid Resource > ------------------------------------------------------------------------ > > --> Submitting 16 tasks... > --> Submitting 16 tasks/host... > --> Submitting exclusive job to 1 hosts... > --> Verifying HOME file-system availability... > --> Verifying WORK file-system availability... > --> Verifying SCRATCH file-system availability... > --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits... > --> Requesting valid memory configuration (mt=31.3G)... > --> Checking ssh keys... > --> Checking file existence and permissions for passwordless ssh... > --> Verifying accounting... > ---------------------------------------------------------------- > ERROR: You have exceeded the max submitted job count. > Maximum allowed is 50 jobs. > > Please contact TACC Consulting if you believe you have > received this message in error. > ---------------------------------------------------------------- > Job aborted by esub. > > I'll add a limit for the number of jobs allowed to the current coaster > code. > > > On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: > > I have the identical response on ranger. It started yesterday evening. > > Possibly a problem that the TACC folks need to fix? > > > > Glen > > > > Yue, Chen - BMD wrote: > > > Hi Michael, > > > > > > Thank you for the advices. I tested ranger with 1 job and new > > > specifications of maxwalltime. It shows the following error message. I > > > don't know if there is other problem with my setup. Thank you! > > > > > > ///////////////////////////////////////////////// > > > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > > > sites.xml -tc.file tc.data > > > Swift 0.9rc2 swift-r2860 cog-r2388 > > > RunID: 20090430-1559-2vi6x811 > > > Progress: > > > Progress: Stage in:1 > > > Progress: Submitting:1 > > > Progress: Submitting:1 > > > Progress: Submitted:1 > > > Progress: Active:1 > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger > > > Progress: Active:1 > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger > > > Progress: Stage in:1 > > > Progress: Active:1 > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger > > > Progress: Failed:1 > > > Execution failed: > > > Exception in PTMap2: > > > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, > > > parameters.txt] > > > Host: ranger > > > Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj > > > stderr.txt: > > > stdout.txt: > > > ---- > > > Caused by: > > > Failed to start worker: > > > null > > > null > > > org.globus.gram.GramException: The job manager detected an invalid > > > script response > > > at > > > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > > > at > > > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > > > at java.lang.Thread.run(Thread.java:619) > > > Cleaning up... > > > Shutting down service at https://129.114.50.163:45562 > > > > > > Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) > > > - Done > > > [yuechen at communicado PTMap2]$ > > > /////////////////////////////////////////////////////////// > > > > > > Chen, Yue > > > > > > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > > *Sent:* Thu 4/30/2009 3:02 PM > > > *To:* Yue, Chen - BMD; swift-devel > > > *Subject:* Re: [Swift-user] Execution error > > > > > > Back on list here (I only went off-list to discuss accounts, etc) > > > > > > The problem in the run below is this: > > > > > > 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > > jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with > > > the given max walltime worker constraint (task: 3000, \ > > > maxwalltime: 2400s) > > > > > > You have this on the ptmap app in your tc.data: > > > > > > globus::maxwalltime=50 > > > > > > But you only gave coasters 40 mins per coaster worker. So its > > > complaining that it cant run a 50 minute job in a 40 minute (max) > > > coaster worker. ;) > > > > > > I mentioned in a prior mail that you need to set the two time vals in > > > your sites.xml entry; thats what you need to do next, now. > > > > > > change the coaster time in your sites.xml to: > > > key="coasterWorkerMaxwalltime">00:51:00 > > > > > > If you have more info on the variability of your ptmap run times, send > > > that to the list, and we can discuss how to handle. > > > > > > > > > (NOTE: doing grp -i of the log for "except" or scanning for "except" > > > with an editor will often locate the first "exception" that your job > > > encountered. Thats how I found the error above). > > > > > > Also, Yue, for testing new sites, or for validating that old sites still > > > work, you should create the smallest possible ptmap workflow - 1 job if > > > that is possible - and verify that this works. Then say 10 jobs to make > > > sure scheduling etc is sane. Then, send in your huge jobs. > > > > > > With only 1 job, its easier to spot the errors in the log file. > > > > > > - Mike > > > > > > > > > On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: > > > > Hi Michael, > > > > > > > > I run into the same messages again when I use Ranger: > > > > > > > > Progress: Selecting site:146 Stage in:25 Submitting:15 Submitted:821 > > > > Failed but can retry:16 > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger > > > > Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 > > > > Failed but can retry:16 > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger > > > > Failed to transfer wrapper log from > > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > > > > The log for the search is at : > > > > /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log > > > > > > > > The sites.xml I have is: > > > > > > > > > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > > > jobManager="gt2:gt2:SGE"/> > > > > > > > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > > > TG-CCR080022N > > > > 16 > > > > development > > > > > > > key="coasterWorkerMaxwalltime">00:40:00 > > > > 31 > > > > 50 > > > > 10 > > > > /work/01164/yuechen/swiftwork > > > > > > > > The tc.data I have is: > > > > > > > > ranger PTMap2 > > > > /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > > > > INTEL32::LINUX globus::maxwalltime=50 > > > > > > > > I'm using swift 0.9 rc2 > > > > > > > > Thank you very much for help! > > > > > > > > Chen, Yue > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > > > *Sent:* Thu 4/30/2009 2:05 PM > > > > *To:* Yue, Chen - BMD > > > > *Subject:* Re: [Swift-user] Execution error > > > > > > > > > > > > > > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: > > > > > Hi Michael, > > > > > > > > > > When I tried to activate my account, I encountered the following > > > error: > > > > > > > > > > "Sorry, this account is in an invalid state. You may not activate > > > your > > > > > at this time." > > > > > > > > > > I used the username and password from TG-CDA070002T. Should I use a > > > > > different password? > > > > > > > > If you can already login to Ranger, then you are all set - you must have > > > > done this previously. > > > > > > > > I thought you had *not*, because when I looked up your login on ranger > > > > ("finger yuechen") it said "never logged in". But seems like that info > > > > is incorrect. > > > > > > > > If you have ptmap compiled, seems like you are almost all set. > > > > > > > > Let me know if it works. > > > > > > > > - Mike > > > > > > > > > Thanks! > > > > > > > > > > Chen, Yue > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > > > > *Sent:* Thu 4/30/2009 1:07 PM > > > > > *To:* Yue, Chen - BMD > > > > > *Cc:* swift user > > > > > *Subject:* Re: [Swift-user] Execution error > > > > > > > > > > Yue, use this XML pool element to access ranger: > > > > > > > > > > > > > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > > > > jobManager="gt2:gt2:SGE"/> > > > > > > > url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> > > > > > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > > > > > > key="project">TG-CCR080022N > > > > > 16 > > > > > development > > > > > > > > > key="coasterWorkerMaxwalltime">00:40:00 > > > > > 31 > > > > > 50 > > > > > 10 > > > > > /work/00306/tg455797/swiftwork > > > > > > > > > > > > > > > > > > > > You will need to also do these steps: > > > > > > > > > > Go to this web page to enable your Ranger account: > > > > > > > > > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx > > > > > > > > > > Then login to Ranger via the TeraGrid portal and put your ssh keys in > > > > > place (assuming you use ssh keys, which you should) > > > > > > > > > > While on Ranger, do this: > > > > > > > > > > echo $WORK > > > > > mkdir $work/swiftwork > > > > > > > > > > and put the full path of your $WORK/swiftwork directory in the > > > > > element above. (My login is tg455etc, yours is > > > yuechen) > > > > > > > > > > Then scp your code to Ranger and compile it. > > > > > > > > > > Then create a tc.data entry for your ptmap app > > > > > > > > > > Next, set your time values in the sites.xml entry above to suitable > > > > > values for Ranger. You'll need to measure times, but I think you will > > > > > find Ranger about twice as fast as Mercury for CPU-bound jobs. > > > > > > > > > > The values above were set for one app job per coaster. I think > > > you can > > > > > probably do more. > > > > > > > > > > If you estimate a run time of 5 minutes, use: > > > > > > > > > > > > > > key="coasterWorkerMaxwalltime">00:30:00 > > > > > 5 > > > > > > > > > > Other people on the list - please sanity check what I suggest here. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: > > > > > > I just checked - TG-CDA070002T has indeed expired. > > > > > > > > > > > > The best for now is to move to use (only) Ranger, under this > > > account: > > > > > > TG-CCR080022N > > > > > > > > > > > > I will locate and send you a sites.xml entry in a moment. > > > > > > > > > > > > You need to go to a web page to activate your Ranger login. > > > > > > > > > > > > Best to contact me in IM and we can work this out. > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: > > > > > >> Also, what account are you running under? We may need to change > > > > you to > > > > > >> a new account - as the OSG Training account expires today. > > > > > >> If that happend at Noon, it *might* be the problem. > > > > > >> > > > > > >> - Mike > > > > > >> > > > > > >> > > > > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > > > > > >>> Hi, > > > > > >>> > > > > > >>> I came back to re-run my application on NCSA Mercury which was > > > > tested > > > > > >>> successfully last week after I just set up coasters with > > > swift 0.9, > > > > > >>> but I got many messages like the following: > > > > > >>> > > > > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 > > > > > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed > > > > but can > > > > > >>> retry:1 > > > > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed > > > but can > > > > > >>> retry:4 > > > > > >>> Failed to transfer wrapper log from > > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > > > > > >>> Failed to transfer wrapper log from > > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > > > > > >>> Failed to transfer wrapper log from > > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > > > > > >>> Failed to transfer wrapper log from > > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > > > > > >>> Failed to transfer wrapper log from > > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > > > > > >>> Failed to transfer wrapper log from > > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > > > > > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can > > > > retry:8 > > > > > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 > > > > > >>> The log file for the successful run last week is ; > > > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > > > > > >>> > > > > > >>> The log file for the failed run is : > > > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > > > > > >>> > > > > > >>> I don't think I did anything different, so I don't know why this > > > > time > > > > > >>> they failed. The sites.xml for Mercury is: > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > url="grid-hg.ncsa.teragrid.org" > > > > > >>> jobManager="gt2:PBS"/> > > > > > >>> > > > /gpfs_scratch1/yuechen/swiftwork > > > > > >>> debug > > > > > >>> > > > > > >>> > > > > > >>> Thank you for help! > > > > > >>> > > > > > >>> Chen, Yue > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> This email is intended only for the use of the individual or > > > entity > > > > > >>> to which it is addressed and may contain information that is > > > > > >>> privileged and confidential. If the reader of this email > > > message is > > > > > >>> not the intended recipient, you are hereby notified that any > > > > > >>> dissemination, distribution, or copying of this communication is > > > > > >>> prohibited. If you have received this email in error, please > > > notify > > > > > >>> the sender and destroy/delete all copies of the transmittal. > > > > Thank you. > > > > > >>> > > > > > >>> > > > > > >>> > > > > > > > > ------------------------------------------------------------------------ > > > > > >>> > > > > > >>> _______________________________________________ > > > > > >>> Swift-user mailing list > > > > > >>> Swift-user at ci.uchicago.edu > > > > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > >> _______________________________________________ > > > > > >> Swift-user mailing list > > > > > >> Swift-user at ci.uchicago.edu > > > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > _______________________________________________ > > > > > > Swift-user mailing list > > > > > > Swift-user at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > > > > > > This email is intended only for the use of the individual or > > > entity to > > > > > which it is addressed and may contain information that is > > > privileged and > > > > > confidential. If the reader of this email message is not the intended > > > > > recipient, you are hereby notified that any dissemination, > > > distribution, > > > > > or copying of this communication is prohibited. If you have received > > > > > this email in error, please notify the sender and destroy/delete all > > > > > copies of the transmittal. Thank you. > > > > > > > > > > > > > > > > > > > > This email is intended only for the use of the individual or entity to > > > > which it is addressed and may contain information that is privileged and > > > > confidential. If the reader of this email message is not the intended > > > > recipient, you are hereby notified that any dissemination, distribution, > > > > or copying of this communication is prohibited. If you have received > > > > this email in error, please notify the sender and destroy/delete all > > > > copies of the transmittal. Thank you. > > > > > > > > > > > > > > > This email is intended only for the use of the individual or entity to > > > which it is addressed and may contain information that is privileged > > > and confidential. If the reader of this email message is not the > > > intended recipient, you are hereby notified that any dissemination, > > > distribution, or copying of this communication is prohibited. If you > > > have received this email in error, please notify the sender and > > > destroy/delete all copies of the transmittal. Thank you. > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri May 1 15:28:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 May 2009 15:28:43 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB592F.30609@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> Message-ID: <1241209723.3291.2.camel@localhost> On Fri, 2009-05-01 at 15:18 -0500, Zhao Zhang wrote: > > >>> > >>> > >> I found "$TMP Location /nfs/osg-data" on > >> http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=387#SITELISTING, > >> > >> but it still doesn't work. Do you know any dir that I could try out? > >> > > > > What exactly did you try (i.e. paste relevant lines from sites.xml). > > > I used this one /nfs/osg-data/osgedu/benc/swift Well, the mighty benc doesn't allow you to write to his directory. Make your own! > > > org.globus.gram.GramException: The job failed when the job manager > attempted to run it > at Use the globusrun test. Always use the globusrun test! From zhaozhang at uchicago.edu Fri May 1 16:24:52 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 May 2009 16:24:52 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <1241209723.3291.2.camel@localhost> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> Message-ID: <49FB68A4.4090707@uchicago.edu> Mihael Hategan wrote: > On Fri, 2009-05-01 at 15:18 -0500, Zhao Zhang wrote: > >>>>> >>>>> >>>>> >>>> I found "$TMP Location /nfs/osg-data" on >>>> http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=387#SITELISTING, >>>> >>>> but it still doesn't work. Do you know any dir that I could try out? >>>> >>>> >>> What exactly did you try (i.e. paste relevant lines from sites.xml). >>> >>> >> I used this one /nfs/osg-data/osgedu/benc/swift >> > > Well, the mighty benc doesn't allow you to write to his directory. Make > your own! > > >>> >>> >> org.globus.gram.GramException: The job failed when the job manager >> attempted to run it >> at >> > > Use the globusrun test. Always use the globusrun test! > In this way? [zzhang at tp-grid1 sites]$ globus-job-run osg-ce.grid.uj.ac.za:gt2:gt2:pbs /usr/bin/id GRAM Job submission failed because data transfer to the server failed (error code 10) I am not sure if it is the right format to specify the provider. zhao > > From aespinosa at cs.uchicago.edu Fri May 1 17:08:14 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 1 May 2009 17:08:14 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB68A4.4090707@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <49FB68A4.4090707@uchicago.edu> Message-ID: <20090501220814.GA10742@origin> Hi Zhao, No. Its globus-job-run osg-ce.grid.uc.ac.za/jobmanager-pbs /usr/bin/id Then for gridftp service tests, checkout uberftp. you can do ls and all those nice directory poking stuff compared to file-transfer-operations-only globus-url-copy -Allan On Fri, May 01, 2009 at 04:24:52PM -0500, Zhao Zhang wrote: > >> >> Use the globusrun test. Always use the globusrun test! >> > In this way? > [zzhang at tp-grid1 sites]$ globus-job-run osg-ce.grid.uj.ac.za:gt2:gt2:pbs > /usr/bin/id > GRAM Job submission failed because data transfer to the server failed > (error code 10) > > I am not sure if it is the right format to specify the provider. > > zhao >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From zhaozhang at uchicago.edu Fri May 1 17:53:30 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 May 2009 17:53:30 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB68A4.4090707@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <49FB68A4.4090707@uchicago.edu> Message-ID: <49FB7D6A.60706@uchicago.edu> thanks, then I got this [zzhang at tp-grid1 sites]$ globus-job-run osg-ce.grid.uc.ac.za/jobmanager-pbs /usr/bin/id GRAM Job submission failed because the connection to the server failed (check host and port) (error code 12) Zhao Zhang wrote: > > > Mihael Hategan wrote: >> On Fri, 2009-05-01 at 15:18 -0500, Zhao Zhang wrote: >> >>>>>> >>>>> I found "$TMP Location /nfs/osg-data" on >>>>> http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=387#SITELISTING, >>>>> >>>>> but it still doesn't work. Do you know any dir that I could try out? >>>>> >>>> What exactly did you try (i.e. paste relevant lines from sites.xml). >>>> >>> I used this one /nfs/osg-data/osgedu/benc/swift >>> >> >> Well, the mighty benc doesn't allow you to write to his directory. Make >> your own! >> >> >>>> >>> org.globus.gram.GramException: The job failed when the job manager >>> attempted to run it >>> at >> >> Use the globusrun test. Always use the globusrun test! >> > In this way? > [zzhang at tp-grid1 sites]$ globus-job-run > osg-ce.grid.uj.ac.za:gt2:gt2:pbs /usr/bin/id > GRAM Job submission failed because data transfer to the server failed > (error code 10) > > I am not sure if it is the right format to specify the provider. > > zhao >> >> > From hategan at mcs.anl.gov Fri May 1 18:50:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 May 2009 18:50:22 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FB7D6A.60706@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <49FB68A4.4090707@uchicago.edu> <49FB7D6A.60706@uchicago.edu> Message-ID: <1241221822.6218.1.camel@localhost> On Fri, 2009-05-01 at 17:53 -0500, Zhao Zhang wrote: > thanks, then I got this > > [zzhang at tp-grid1 sites]$ globus-job-run > osg-ce.grid.uc.ac.za/jobmanager-pbs /usr/bin/id .uj. instead of .uc. maybe? Presumably it's not University of Cohannesburg. > GRAM Job submission failed because the connection to the server failed > (check host and port) (error code 12) > > > > Zhao Zhang wrote: > > > > > > Mihael Hategan wrote: > >> On Fri, 2009-05-01 at 15:18 -0500, Zhao Zhang wrote: > >> > >>>>>> > >>>>> I found "$TMP Location /nfs/osg-data" on > >>>>> http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=387#SITELISTING, > >>>>> > >>>>> but it still doesn't work. Do you know any dir that I could try out? > >>>>> > >>>> What exactly did you try (i.e. paste relevant lines from sites.xml). > >>>> > >>> I used this one /nfs/osg-data/osgedu/benc/swift > >>> > >> > >> Well, the mighty benc doesn't allow you to write to his directory. Make > >> your own! > >> > >> > >>>> > >>> org.globus.gram.GramException: The job failed when the job manager > >>> attempted to run it > >>> at > >> > >> Use the globusrun test. Always use the globusrun test! > >> > > In this way? > > [zzhang at tp-grid1 sites]$ globus-job-run > > osg-ce.grid.uj.ac.za:gt2:gt2:pbs /usr/bin/id > > GRAM Job submission failed because data transfer to the server failed > > (error code 10) > > > > I am not sure if it is the right format to specify the provider. > > > > zhao > >> > >> > > From wilde at mcs.anl.gov Sat May 2 10:16:13 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 02 May 2009 10:16:13 -0500 Subject: [Swift-devel] GridFTP small-file optimizations in Swift In-Reply-To: <1241189397.9198.1.camel@localhost> References: <49FA3B43.1050100@mcs.anl.gov> <1241136652.3869.3.camel@localhost> <1241189397.9198.1.camel@localhost> Message-ID: <49FC63BD.6080109@mcs.anl.gov> This is an interesting, low-prio background discussion. Im moving it here to swift-devel for group benefit and comment. Ian asked me: "do you know if the GridFTP "lots of small files" optimizations have helped Swift at all"? Mihael said: "There are two such optimization I can think of: data channel re-use and pipelining of commands" Based on this I wonder: 1) how does swift-over-cog use these two features? (ie, how much data gets gets batched into one channel? All the files to/from a job? All the files to/from a site for the duration of a workflow? Are channels cached or just kept open once opened?) 2) how hard would it be to turn the feature off, so the CEDPS folks could get a before vs after measurement for a few workflows? Like I said, just background discuss. CEDPS has a review coming up; It would be great if we can run some experiments on real workflows to compare the benefits of these optimizations to the "before" case. I'd be willing to try that, time permitting, if I could turn it on or off. Suggestions welcome. - Mike On 5/1/09 9:49 AM, Mihael Hategan wrote: > On Fri, 2009-05-01 at 08:02 -0500, Ian Foster wrote: >> Mihael: >> >> LOSF optimization refers to the support added for reducing per-file >> startup costs by streaming one after the other--helpful for when >> sending lots of small files. > > Right. There are two such optimization I can think of: data channel > re-use and pipelining of commands. > >> Ian. >> >> On Apr 30, 2009, at 7:10 PM, Mihael Hategan wrote: >> >>> On Thu, 2009-04-30 at 18:58 -0500, Michael Wilde wrote: >>>> Ian asked in IM while I was away: >>>> >>>> Mike, do you know if the GridFTP "lots of small files" optimizations >>>> have helped Swift at all? >>> 1. "lots of small files" optimizations is ambiguous. >>> 2. I don't think we've made any clear measurements with a real-world >>> application of anything gridftp compared to anything else gridftp. >>> 3. I think that whatever was done to make things faster helps, so if >>> you >>> want an opinion, then "yes". >>> >>>> -- >>>> >>>> I think Mihael has incorporated some aspect of that optimization into >>>> the CoG data provider and Swift, but cc him here for the right >>>> answer. >>>> >>>> - Mike > From hategan at mcs.anl.gov Sat May 2 11:52:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 02 May 2009 11:52:13 -0500 Subject: [Swift-devel] Re: GridFTP small-file optimizations in Swift In-Reply-To: <49FC63BD.6080109@mcs.anl.gov> References: <49FA3B43.1050100@mcs.anl.gov> <1241136652.3869.3.camel@localhost> <1241189397.9198.1.camel@localhost> <49FC63BD.6080109@mcs.anl.gov> Message-ID: <1241283133.12053.4.camel@localhost> On Sat, 2009-05-02 at 10:16 -0500, Michael Wilde wrote: > This is an interesting, low-prio background discussion. Im moving it > here to swift-devel for group benefit and comment. > > Ian asked me: > > "do you know if the GridFTP "lots of small files" optimizations have > helped Swift at all"? > > Mihael said: > > "There are two such optimization I can think of: data channel re-use and > pipelining of commands" > > Based on this I wonder: > > 1) how does swift-over-cog use these two features? (ie, how much data > gets gets batched into one channel? All the files to/from a job? All the > files to/from a site for the duration of a workflow? Are channels cached > or just kept open once opened?) Swift doesn't use pipelining. We discusses this at length on this mailing list before. Data channels are re-used. Clients/connections are cached based not on jobs but on site and time (i.e. they have a maximum idle time). > > 2) how hard would it be to turn the feature off, so the CEDPS folks > could get a before vs after measurement for a few workflows? There are no flags for that at this time, but I can add some if you want. > > Like I said, just background discuss. > > CEDPS has a review coming up; It would be great if we can run some > experiments on real workflows to compare the benefits of these > optimizations to the "before" case. I'd be willing to try that, time > permitting, if I could turn it on or off. Suggestions welcome. > > - Mike > > On 5/1/09 9:49 AM, Mihael Hategan wrote: > > On Fri, 2009-05-01 at 08:02 -0500, Ian Foster wrote: > >> Mihael: > >> > >> LOSF optimization refers to the support added for reducing per-file > >> startup costs by streaming one after the other--helpful for when > >> sending lots of small files. > > > > Right. There are two such optimization I can think of: data channel > > re-use and pipelining of commands. > > > >> Ian. > >> > >> On Apr 30, 2009, at 7:10 PM, Mihael Hategan wrote: > >> > >>> On Thu, 2009-04-30 at 18:58 -0500, Michael Wilde wrote: > >>>> Ian asked in IM while I was away: > >>>> > >>>> Mike, do you know if the GridFTP "lots of small files" optimizations > >>>> have helped Swift at all? > >>> 1. "lots of small files" optimizations is ambiguous. > >>> 2. I don't think we've made any clear measurements with a real-world > >>> application of anything gridftp compared to anything else gridftp. > >>> 3. I think that whatever was done to make things faster helps, so if > >>> you > >>> want an opinion, then "yes". > >>> > >>>> -- > >>>> > >>>> I think Mihael has incorporated some aspect of that optimization into > >>>> the CoG data provider and Swift, but cc him here for the right > >>>> answer. > >>>> > >>>> - Mike > > From benc at hawaga.org.uk Mon May 4 01:42:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 06:42:53 +0000 (GMT) Subject: [Swift-devel] Re: Regular Test of Coaster and Condor-g In-Reply-To: <49FB1E73.2050108@uchicago.edu> References: <49FB1E73.2050108@uchicago.edu> Message-ID: On Fri, 1 May 2009, Zhao Zhang wrote: > Before we have any tools to do the tests, I will just run it manually and post > the status. Have a lok at the NMI build and test system, metronome. That is already used to run the tests that do not require a credential. If I had figured out the a way to deal with credentials that I am comfortable with, then it would also run site tests. -- From benc at hawaga.org.uk Mon May 4 01:59:58 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 06:59:58 +0000 (GMT) Subject: [Swift-devel] Rough Swift Test Plan In-Reply-To: <49FA059F.1070309@uchicago.edu> References: <49FA059F.1070309@uchicago.edu> Message-ID: On Thu, 30 Apr 2009, Zhao Zhang wrote: > Subject: Rough Swift Test Plan Please do not call this the test plan for Swift - that suggests that you are testing all of Swift. This is the test plan for some specific parts of Swift, that you list in '2. Features'. > 2. Features > We care about two features right now. swift-over-condorG and > swift-over-coaster-over-condorG. For credential based testing, I also care about plain gram2 execution; and plain gt4 execution. The local condor and PBS providers need ongoing testing too. > 3. Automation > Mike mentioned a automatic test tool for many platforms called NMI > Test&Build. Ben, can we use the same system to do the tests on real grid > sites? The main thing that has stopped me running tests there is getting a credential there. So the tests that run on NMI are basically the tests that do not require a credential. If you figure out how credentials should work, then it is a small change to the NMI build configuration to make all of the site tests run. This is the main (only?) thing that has stopped credential-based testing on NMI. You cannot test the local PBS and condor providers on NMI because you need to run on a machine with PBS and/or condor installed locally. > 4. Sanity Check > Before we start swift workflow test on, is there a way that we could test > if GT2 or GT4 are working there? You can run globus commands and check if they succeed. But those will need separate configuration as they cannot read from sites.xml, so that might not be worth the effort. -- From benc at hawaga.org.uk Mon May 4 02:08:40 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 07:08:40 +0000 (GMT) Subject: [Swift-devel] Re: coaster issue In-Reply-To: <1241209723.3291.2.camel@localhost> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> Message-ID: On Fri, 1 May 2009, Mihael Hategan wrote: > > I used this one /nfs/osg-data/osgedu/benc/swift > > Well, the mighty benc doesn't allow you to write to his directory. Make > your own! My intention when putting stuff in the SVN tests directory was to make working directories chmod a+rwxt (like /tmp) but that appears to not be the case for this one. For OSG sites, I should probably follow the convention established in swift-osg-ress-site-catalog of $OSG_DATA/voname/tmp/sitename/ -- From benc at hawaga.org.uk Mon May 4 02:33:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 07:33:17 +0000 (GMT) Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> Message-ID: On Mon, 4 May 2009, Ben Clifford wrote: > On Fri, 1 May 2009, Mihael Hategan wrote: > > > > I used this one /nfs/osg-data/osgedu/benc/swift > > > > Well, the mighty benc doesn't allow you to write to his directory. Make > > your own! Actually, that isn't the problem. Apparently Zhao has his credential mapped to two VOs - engage and osgedu. Nondeterministically, he will run as the osgedu or engage user on an osg site that supports both. The test directories are set up for people being mapped into user osgedu. In general, Swift gets into trouble on sites where you're mapped to multiple users, and I think is inadvisable to do. -- From benc at hawaga.org.uk Mon May 4 02:37:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 07:37:39 +0000 (GMT) Subject: [Swift-devel] GridFTP small-file optimizations in Swift In-Reply-To: <49FC63BD.6080109@mcs.anl.gov> References: <49FA3B43.1050100@mcs.anl.gov> <1241136652.3869.3.camel@localhost> <1241189397.9198.1.camel@localhost> <49FC63BD.6080109@mcs.anl.gov> Message-ID: On Sat, 2 May 2009, Michael Wilde wrote: > "do you know if the GridFTP "lots of small files" optimizations have > helped Swift at all"? The one time paid attention to that for an application, the small application files were still too small for GridFTP "lots of small files" (according to a gridftp person); that was, I think, what led to the creation of the coaster file transfer provider. -- From benc at hawaga.org.uk Mon May 4 02:38:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 07:38:39 +0000 (GMT) Subject: [Swift-devel] Rough Swift Test Plan In-Reply-To: References: <49FA059F.1070309@uchicago.edu> Message-ID: On Mon, 4 May 2009, Ben Clifford wrote: > For credential based testing, I also care about plain gram2 execution; and > plain gt4 execution. Two other credential based tests: tests using the coaster file transfer provider; and tests of mappers using gridftp URLs instead of submit-system files. -- From benc at hawaga.org.uk Mon May 4 04:29:33 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 09:29:33 +0000 (GMT) Subject: [Swift-devel] Re: Regular Test of Coaster and Condor-g In-Reply-To: <49FB1E73.2050108@uchicago.edu> References: <49FB1E73.2050108@uchicago.edu> Message-ID: On Fri, 1 May 2009, Zhao Zhang wrote: > Ben, could you help me tell what the difference there is? I ran the > exactly same thing. You need to look at each site individually and see what the errors are for each one. The sites test code does not keep logs for each site, although you could modify it to do so, so you need to run a broken site individually by using the run-site test command, and then look at what the cause of failure for that site is. -- From benc at hawaga.org.uk Mon May 4 08:57:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 13:57:09 +0000 (GMT) Subject: [Swift-devel] svn bounce messages Message-ID: Swift commit messages to the dev.globus version of the commit list are bouncing as of a day or so ago, according to a source who is listed as an admin on that list but does not pretend to admin it. Is there anyone here who owns that list? If so, it might be useful to fix. -- From zhaozhang at uchicago.edu Mon May 4 09:01:54 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 09:01:54 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> Message-ID: <49FEF552.60804@uchicago.edu> Hi, Ben Then what shall I do to find out which is my default VO, and how I could change among different VOs? Thanks. zhao Ben Clifford wrote: > On Mon, 4 May 2009, Ben Clifford wrote: > > >> On Fri, 1 May 2009, Mihael Hategan wrote: >> >> >>>> I used this one /nfs/osg-data/osgedu/benc/swift >>>> >>> Well, the mighty benc doesn't allow you to write to his directory. Make >>> your own! >>> > > Actually, that isn't the problem. > > Apparently Zhao has his credential mapped to two VOs - engage and osgedu. > Nondeterministically, he will run as the osgedu or engage user on an osg > site that supports both. The test directories are set up for people being > mapped into user osgedu. > > In general, Swift gets into trouble on sites where you're mapped to > multiple users, and I think is inadvisable to do. > > From benc at hawaga.org.uk Mon May 4 09:06:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 14:06:47 +0000 (GMT) Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FEF552.60804@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <49FEF552.60804@uchicago.edu> Message-ID: On Mon, 4 May 2009, Zhao Zhang wrote: > Then what shall I do to find out which is my default VO, and how I could > change among different VOs? Thanks. You don't have a default VO. If you are mapped to multiple VOs you will be non-deterministically mapped to a VO by many site. On some sites, you can use voms-proxy-init to pick a particular VO. That is often not the case. A common way of addressing this is having a different credential for each VO. This is also a problem on teragrid systems when you get mapped to community accounts in addition to your personal account. -- From hategan at mcs.anl.gov Mon May 4 09:57:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 May 2009 09:57:07 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> Message-ID: <1241449027.22870.0.camel@localhost> On Mon, 2009-05-04 at 07:33 +0000, Ben Clifford wrote: > On Mon, 4 May 2009, Ben Clifford wrote: > > > On Fri, 1 May 2009, Mihael Hategan wrote: > > > > > > I used this one /nfs/osg-data/osgedu/benc/swift > > > > > > Well, the mighty benc doesn't allow you to write to his directory. Make > > > your own! > > Actually, that isn't the problem. > > Apparently Zhao has his credential mapped to two VOs - engage and osgedu. > Nondeterministically, he will run as the osgedu or engage user on an osg > site that supports both. The test directories are set up for people being > mapped into user osgedu. So on that site he's not allowed to write to that directory maybe? > > In general, Swift gets into trouble on sites where you're mapped to > multiple users, and I think is inadvisable to do. > From benc at hawaga.org.uk Mon May 4 09:56:58 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 14:56:58 +0000 (GMT) Subject: [Swift-devel] Re: coaster issue In-Reply-To: <1241449027.22870.0.camel@localhost> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> Message-ID: On Mon, 4 May 2009, Mihael Hategan wrote: > So on that site he's not allowed to write to that directory maybe? I looked at this with Mats a bit the other week. He was being mapped to the engage user, not to the osgedu user on that site. So the osgedu-accessible directories are not accessible. Really the site tests need you to be mapped into the OSGEDU VO as default, and not to anything else. -- From zhaozhang at uchicago.edu Mon May 4 10:04:01 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 10:04:01 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> Message-ID: <49FF03E1.3040607@uchicago.edu> Hi, Ben This means if I am mapped to osgedu VO, I could not run anything on the renci site defined in renci-engage-coaster.xml, right? In order to run the test on renci-engage-coaster, I have to apply another credential and mapped to the engage VO. Correct me if I am wrong. Thanks. zhao Ben Clifford wrote: > On Mon, 4 May 2009, Mihael Hategan wrote: > > >> So on that site he's not allowed to write to that directory maybe? >> > > I looked at this with Mats a bit the other week. He was being mapped to > the engage user, not to the osgedu user on that site. So the > osgedu-accessible directories are not accessible. > > Really the site tests need you to be mapped into the OSGEDU VO as default, > and not to anything else. > > From wilde at mcs.anl.gov Mon May 4 10:07:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 04 May 2009 10:07:39 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> Message-ID: <49FF04BB.8050404@mcs.anl.gov> My understanding is that there are only 2 sites that are intended to support the OSGEDU VO (regardless what the site info DB says at the moment, which may or may not agree). If thats correct, should we change site testing to use the OSG VO? On 5/4/09 9:56 AM, Ben Clifford wrote: > On Mon, 4 May 2009, Mihael Hategan wrote: > >> So on that site he's not allowed to write to that directory maybe? > > I looked at this with Mats a bit the other week. He was being mapped to > the engage user, not to the osgedu user on that site. So the > osgedu-accessible directories are not accessible. > > Really the site tests need you to be mapped into the OSGEDU VO as default, > and not to anything else. > From rynge at renci.org Mon May 4 10:08:55 2009 From: rynge at renci.org (Mats Rynge) Date: Mon, 04 May 2009 11:08:55 -0400 Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> Message-ID: <49FF0507.3080009@renci.org> Ben Clifford wrote: > On Mon, 4 May 2009, Mihael Hategan wrote: > >> So on that site he's not allowed to write to that directory maybe? > > I looked at this with Mats a bit the other week. He was being mapped to > the engage user, not to the osgedu user on that site. So the > osgedu-accessible directories are not accessible. > > Really the site tests need you to be mapped into the OSGEDU VO as default, > and not to anything else. I can disable him in the Engagement VO, and that should take care of this issue. Let me know. And when running as OSGEDU, you should still use $OSG_DATA/osgedu/tmp/ or something like that. Using $HOME will cause problems later on. -- Mats Rynge Renaissance Computing Institute From rynge at renci.org Mon May 4 10:10:19 2009 From: rynge at renci.org (Mats Rynge) Date: Mon, 04 May 2009 11:10:19 -0400 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF03E1.3040607@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF03E1.3040607@uchicago.edu> Message-ID: <49FF055B.1000600@renci.org> Zhao Zhang wrote: > Hi, Ben > > This means if I am mapped to osgedu VO, I could not run anything on the > renci site defined in renci-engage-coaster.xml, right? The RENCI site supports osgedu, so it should work with your existing credential. -- Mats Rynge Renaissance Computing Institute From benc at hawaga.org.uk Mon May 4 10:13:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 15:13:26 +0000 (GMT) Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF0507.3080009@renci.org> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF0507.3080009@renci.org> Message-ID: On Mon, 4 May 2009, Mats Rynge wrote: > I can disable him in the Engagement VO, and that should take care of this > issue. Let me know. Zhao should say yes or no. > And when running as OSGEDU, you should still use $OSG_DATA/osgedu/tmp/ or > something like that. Using $HOME will cause problems later on. Nothing in the sites repository should be using $HOME for OSG sites - if they do, they need pointing out and fixing. -- From zhaozhang at uchicago.edu Mon May 4 10:15:13 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 10:15:13 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF055B.1000600@renci.org> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF03E1.3040607@uchicago.edu> <49FF055B.1000600@renci.org> Message-ID: <49FF0681.3050409@uchicago.edu> Hi, Mats But I still could not run it. [zzhang at communicado sites]$ ./run-site coaster/renci-engage-coaster.xml testing site configuration: coaster/renci-engage-coaster.xml Removing files from previous runs Running test 061-cattwo at Mon May 4 10:13:57 CDT 2009 Swift svn swift-r2896 cog-r2392 RunID: 20090504-1013-fvsq25zf Progress: Progress: Initializing site shared directory:1 Execution failed: Could not initialize shared directory on renci-engage Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Cannot create directory /nfs/osg-data/osgedu/benc/swift/061-cattwo-20090504-1013-fvsq25zf Caused by: Server refused performing the request. Custom message: Server refused creating directory (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gfs_file_mkdir:554: 500-System error in mkdir: Permission denied 500-A system call failed: Permission denied 500 End.] SWIFT RETURN CODE NON-ZERO - test 061-cattwo and my sites.xml is /nfs/osg-data/osgedu/benc/swift 192.168.1.11 zhao Mats Rynge wrote: > Zhao Zhang wrote: >> Hi, Ben >> >> This means if I am mapped to osgedu VO, I could not run anything on >> the renci site defined in renci-engage-coaster.xml, right? > > The RENCI site supports osgedu, so it should work with your existing > credential. > From benc at hawaga.org.uk Mon May 4 10:16:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 15:16:05 +0000 (GMT) Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF04BB.8050404@mcs.anl.gov> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF04BB.8050404@mcs.anl.gov> Message-ID: On Mon, 4 May 2009, Michael Wilde wrote: > My understanding is that there are only 2 sites that are intended to support > the OSGEDU VO (regardless what the site info DB says at the moment, which may > or may not agree). I think there is no clear consensus on that. Plenty of sites advertise in ReSS that they support osgedu, and plenty accept such jobs. > If thats correct, should we change site testing to use the OSG VO? That might be better in the long term. I'd prefer not to mess with it right now, though, as it is going to break a lot of stuff during switchover and I'd rather people work on other stuff than spend day(s) dealing with this now. -- From zhaozhang at uchicago.edu Mon May 4 10:16:39 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 10:16:39 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF0507.3080009@renci.org> Message-ID: <49FF06D7.3000604@uchicago.edu> Hi, Mats Ben Clifford wrote: > On Mon, 4 May 2009, Mats Rynge wrote: > > >> I can disable him in the Engagement VO, and that should take care of this >> issue. Let me know. >> > > Zhao should say yes or no. > OK, disable me in the engagement VO, and let's see how it is going. Thanks. zhao > >> And when running as OSGEDU, you should still use $OSG_DATA/osgedu/tmp/ or >> something like that. Using $HOME will cause problems later on. >> > > Nothing in the sites repository should be using $HOME for OSG sites - if > they do, they need pointing out and fixing. > > From wilde at mcs.anl.gov Mon May 4 10:32:10 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 04 May 2009 10:32:10 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF04BB.8050404@mcs.anl.gov> Message-ID: <49FF0A7A.1010006@mcs.anl.gov> sounds good. if we have "plenty" sites thats good for now. OSG policy on OSGEDU may change or get enforced at some point. We can change at that point if needed. On 5/4/09 10:16 AM, Ben Clifford wrote: > On Mon, 4 May 2009, Michael Wilde wrote: > >> My understanding is that there are only 2 sites that are intended to support >> the OSGEDU VO (regardless what the site info DB says at the moment, which may >> or may not agree). > > I think there is no clear consensus on that. Plenty of sites advertise in > ReSS that they support osgedu, and plenty accept such jobs. > >> If thats correct, should we change site testing to use the OSG VO? > > That might be better in the long term. I'd prefer not to mess with it > right now, though, as it is going to break a lot of stuff during > switchover and I'd rather people work on other stuff than spend day(s) > dealing with this now. > From rynge at renci.org Mon May 4 10:51:18 2009 From: rynge at renci.org (Mats Rynge) Date: Mon, 04 May 2009 11:51:18 -0400 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF0681.3050409@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF03E1.3040607@uchicago.edu> <49FF055B.1000600@renci.org> <49FF0681.3050409@uchicago.edu> Message-ID: <49FF0EF6.9070502@renci.org> Zhao Zhang wrote: > Hi, Mats > > But I still could not run it. Today you are mapped to the OSG VO. We need to disable on you the VOs you do not want to run as, and make sure only one VO is left to use. For your testing, I suggest you stay with the OSG VO, and leave the Engagement VO and OSGEDU VO. > [zzhang at communicado sites]$ ./run-site coaster/renci-engage-coaster.xml > testing site configuration: coaster/renci-engage-coaster.xml > Removing files from previous runs > Running test 061-cattwo at Mon May 4 10:13:57 CDT 2009 > Swift svn swift-r2896 cog-r2392 > > RunID: 20090504-1013-fvsq25zf > Progress: > Progress: Initializing site shared directory:1 > Execution failed: > Could not initialize shared directory on renci-engage > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: > Cannot create directory > /nfs/osg-data/osgedu/benc/swift/061-cattwo-20090504-1013-fvsq25zf > Caused by: > Server refused performing the request. Custom message: Server > refused creating directory (error code 1) [Nested exception message: > Custom message: Unexpected reply: 500-Command failed : > globus_gridftp_server_file.c:globus_l_gfs_file_mkdir:554: > 500-System error in mkdir: Permission denied > 500-A system call failed: Permission denied > 500 End.] > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > and my sites.xml is > > > > > jobManager="gt2:gt2:condor" /> > /nfs/osg-data/osgedu/benc/swift > 192.168.1.11 > > > > zhao > > Mats Rynge wrote: >> Zhao Zhang wrote: >>> Hi, Ben >>> >>> This means if I am mapped to osgedu VO, I could not run anything on >>> the renci site defined in renci-engage-coaster.xml, right? >> >> The RENCI site supports osgedu, so it should work with your existing >> credential. >> > -- Mats Rynge Renaissance Computing Institute From benc at hawaga.org.uk Mon May 4 10:53:23 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 15:53:23 +0000 (GMT) Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF0EF6.9070502@renci.org> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF03E1.3040607@uchicago.edu> <49FF055B.1000600@renci.org> <49FF0681.3050409@uchicago.edu> <49FF0EF6.9070502@renci.org> Message-ID: On Mon, 4 May 2009, Mats Rynge wrote: > For your testing, I suggest you stay with the OSG VO, and leave the Engagement > VO and OSGEDU VO. See elsewhere in this thread. The swift sites are configured already for osgedu, and I would prefer to not mess with that at this time. -- From rynge at renci.org Mon May 4 10:59:11 2009 From: rynge at renci.org (Mats Rynge) Date: Mon, 04 May 2009 11:59:11 -0400 Subject: [Swift-devel] Re: coaster issue In-Reply-To: References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF03E1.3040607@uchicago.edu> <49FF055B.1000600@renci.org> <49FF0681.3050409@uchicago.edu> <49FF0EF6.9070502@renci.org> Message-ID: <49FF10CF.5070005@renci.org> Ben Clifford wrote: > On Mon, 4 May 2009, Mats Rynge wrote: > >> For your testing, I suggest you stay with the OSG VO, and leave the Engagement >> VO and OSGEDU VO. > > See elsewhere in this thread. The swift sites are configured already for > osgedu, and I would prefer to not mess with that at this time. > Ok, then disable on Engagement (just completed) and the OSG VO. -- Mats Rynge Renaissance Computing Institute From zhaozhang at uchicago.edu Mon May 4 11:40:24 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 11:40:24 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF10CF.5070005@renci.org> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF03E1.3040607@uchicago.edu> <49FF055B.1000600@renci.org> <49FF0681.3050409@uchicago.edu> <49FF0EF6.9070502@renci.org> <49FF10CF.5070005@renci.org> Message-ID: <49FF1A78.40202@uchicago.edu> Hi, Mats I got this: [zzhang at communicado sites]$ globus-job-run belhaven-1.renci.org /usr/bin/id uid=715(osg) gid=715(osg) groups=715(osg) Does this mean I beloong to the OSG VO right now? zhao Mats Rynge wrote: > Ben Clifford wrote: >> On Mon, 4 May 2009, Mats Rynge wrote: >> >>> For your testing, I suggest you stay with the OSG VO, and leave the >>> Engagement >>> VO and OSGEDU VO. >> >> See elsewhere in this thread. The swift sites are configured already >> for osgedu, and I would prefer to not mess with that at this time. >> > > Ok, then disable on Engagement (just completed) and the OSG VO. > From rynge at renci.org Mon May 4 11:52:33 2009 From: rynge at renci.org (Mats Rynge) Date: Mon, 04 May 2009 12:52:33 -0400 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF1A78.40202@uchicago.edu> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF03E1.3040607@uchicago.edu> <49FF055B.1000600@renci.org> <49FF0681.3050409@uchicago.edu> <49FF0EF6.9070502@renci.org> <49FF10CF.5070005@renci.org> <49FF1A78.40202@uchicago.edu> Message-ID: <49FF1D51.4000403@renci.org> Zhao Zhang wrote: > Hi, Mats > > I got this: > [zzhang at communicado sites]$ globus-job-run belhaven-1.renci.org /usr/bin/id > uid=715(osg) gid=715(osg) groups=715(osg) > > Does this mean I beloong to the OSG VO right now? Yes. If you have your certificate loaded into your browser, go to the OSG VOMRS server at: https://vomrs.fnal.gov:8443/vomrs/osg/vomrs And see if you can remove yourself. -- Mats Rynge Renaissance Computing Institute From zhaozhang at uchicago.edu Mon May 4 13:53:39 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 13:53:39 -0500 Subject: [Swift-devel] Re: coaster issue In-Reply-To: <49FF1D51.4000403@renci.org> References: <49FB470B.9020305@uchicago.edu> <1241204865.14962.0.camel@localhost> <49FB4906.40702@uchicago.edu> <49FB4BCF.80605@uchicago.edu> <1241207063.16409.2.camel@localhost> <49FB541A.7000300@uchicago.edu> <1241208492.22385.0.camel@localhost> <49FB592F.30609@uchicago.edu> <1241209723.3291.2.camel@localhost> <1241449027.22870.0.camel@localhost> <49FF03E1.3040607@uchicago.edu> <49FF055B.1000600@renci.org> <49FF0681.3050409@uchicago.edu> <49FF0EF6.9070502@renci.org> <49FF10CF.5070005@renci.org> <49FF1A78.40202@uchicago.edu> <49FF1D51.4000403@renci.org> Message-ID: <49FF39B3.9000106@uchicago.edu> Hi, All I got test finished successfully by changing the workdirectory to "/nfs/osg-data/osg/tmp". zhao Mats Rynge wrote: > Zhao Zhang wrote: >> Hi, Mats >> >> I got this: >> [zzhang at communicado sites]$ globus-job-run belhaven-1.renci.org >> /usr/bin/id >> uid=715(osg) gid=715(osg) groups=715(osg) >> >> Does this mean I beloong to the OSG VO right now? > > Yes. > > If you have your certificate loaded into your browser, go to the OSG > VOMRS server at: > > https://vomrs.fnal.gov:8443/vomrs/osg/vomrs > > And see if you can remove yourself. > From zhaozhang at uchicago.edu Mon May 4 14:03:00 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 14:03:00 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml Message-ID: <49FF3BE4.5040101@uchicago.edu> Hi, Ben The coaster test on uj-pbs-gram2.xml failed due to the following reason. I am also attaching the result from "globus-job-run" and "uj-pbs-gram2.xml". Any idea why it is happening? Thanks. zhao [zzhang at communicado sites]$ ./run-site coaster/uj-pbs-gram2.xml testing site configuration: coaster/uj-pbs-gram2.xml Removing files from previous runs Running test 061-cattwo at Mon May 4 13:57:42 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090504-1357-43geyd0c Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090504-1357-43geyd0c/info/v on teraport Progress: Stage in:1 Progress: Active:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from 061-cattwo-20090504-1357-43geyd0c/info/y on teraport Progress: Stage in:1 Progress: Active:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from 061-cattwo-20090504-1357-43geyd0c/info/0 on teraport Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: teraport Directory: 061-cattwo-20090504-1357-43geyd0c/jobs/0/cat-0k2l2caj stderr.txt: stdout.txt: ---- Caused by: Failed to start worker: null null org.globus.gram.GramException: The job failed when the job manager attempted to run it at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:595) Cleaning up... Shutting down service at https://152.106.18.251:34054 Got channel MetaChannel: 12128247 -> GSSSChannel-null(1) - Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo [zzhang at communicado sites]$ globus-job-run osg-ce.grid.uj.ac.za /usr/bin/id uid=640(osgedu) gid=2000(osg) groups=2000(osg) /nfs/home/benc/swifttest 4 From hategan at mcs.anl.gov Mon May 4 14:25:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 May 2009 14:25:13 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF3BE4.5040101@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> Message-ID: <1241465113.28803.1.camel@localhost> Can you retrieve the relevant gram log (it's on the remote site, in your home directory, named gram*.log)? On Mon, 2009-05-04 at 14:03 -0500, Zhao Zhang wrote: > Hi, Ben > > The coaster test on uj-pbs-gram2.xml failed due to the following reason. > I am also attaching the result from "globus-job-run" and "uj-pbs-gram2.xml". > Any idea why it is happening? Thanks. > > zhao > > [zzhang at communicado sites]$ ./run-site coaster/uj-pbs-gram2.xml > testing site configuration: coaster/uj-pbs-gram2.xml > Removing files from previous runs > Running test 061-cattwo at Mon May 4 13:57:42 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090504-1357-43geyd0c > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090504-1357-43geyd0c/info/v on teraport > Progress: Stage in:1 > Progress: Active:1 > Progress: Failed but can retry:1 > Failed to transfer wrapper log from > 061-cattwo-20090504-1357-43geyd0c/info/y on teraport > Progress: Stage in:1 > Progress: Active:1 > Progress: Failed but can retry:1 > Failed to transfer wrapper log from > 061-cattwo-20090504-1357-43geyd0c/info/0 on teraport > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: teraport > Directory: 061-cattwo-20090504-1357-43geyd0c/jobs/0/cat-0k2l2caj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Failed to start worker: > null > null > org.globus.gram.GramException: The job failed when the job manager > attempted to run it > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:595) > > Cleaning up... > Shutting down service at https://152.106.18.251:34054 > Got channel MetaChannel: 12128247 -> GSSSChannel-null(1) > - Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > [zzhang at communicado sites]$ globus-job-run osg-ce.grid.uj.ac.za /usr/bin/id > uid=640(osgedu) gid=2000(osg) groups=2000(osg) > > > > > jobManager="gt2:gt2:pbs" /> > /nfs/home/benc/swifttest > 4 > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon May 4 14:29:14 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 19:29:14 +0000 (GMT) Subject: [Swift-devel] Re: Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF3BE4.5040101@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> Message-ID: Not sure why thats happening, but I confirm that it also happens for me using that site definition. It didn't when I committed it. -- From zhaozhang at uchicago.edu Mon May 4 14:29:57 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 14:29:57 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <1241465113.28803.1.camel@localhost> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> Message-ID: <49FF4235.20101@uchicago.edu> Hi, Mihael I could not login that machine by neither ssh nor gsissh. [zzhang at communicado sites]$ ssh osg-ce.grid.uj.ac.za ssh: connect to host osg-ce.grid.uj.ac.za port 22: Connection refused [zzhang at communicado sites]$ gsissh osg-ce.grid.uj.ac.za ssh: connect to host osg-ce.grid.uj.ac.za port 22: Connection refused zhao Mihael Hategan wrote: > Can you retrieve the relevant gram log (it's on the remote site, in your > home directory, named gram*.log)? > > On Mon, 2009-05-04 at 14:03 -0500, Zhao Zhang wrote: > >> Hi, Ben >> >> The coaster test on uj-pbs-gram2.xml failed due to the following reason. >> I am also attaching the result from "globus-job-run" and "uj-pbs-gram2.xml". >> Any idea why it is happening? Thanks. >> >> zhao >> >> [zzhang at communicado sites]$ ./run-site coaster/uj-pbs-gram2.xml >> testing site configuration: coaster/uj-pbs-gram2.xml >> Removing files from previous runs >> Running test 061-cattwo at Mon May 4 13:57:42 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090504-1357-43geyd0c >> Progress: >> Progress: Stage in:1 >> Progress: Submitting:1 >> Progress: Submitting:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090504-1357-43geyd0c/info/v on teraport >> Progress: Stage in:1 >> Progress: Active:1 >> Progress: Failed but can retry:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090504-1357-43geyd0c/info/y on teraport >> Progress: Stage in:1 >> Progress: Active:1 >> Progress: Failed but can retry:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090504-1357-43geyd0c/info/0 on teraport >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >> Host: teraport >> Directory: 061-cattwo-20090504-1357-43geyd0c/jobs/0/cat-0k2l2caj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Failed to start worker: >> null >> null >> org.globus.gram.GramException: The job failed when the job manager >> attempted to run it >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >> at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >> at java.lang.Thread.run(Thread.java:595) >> >> Cleaning up... >> Shutting down service at https://152.106.18.251:34054 >> Got channel MetaChannel: 12128247 -> GSSSChannel-null(1) >> - Done >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >> >> [zzhang at communicado sites]$ globus-job-run osg-ce.grid.uj.ac.za /usr/bin/id >> uid=640(osgedu) gid=2000(osg) groups=2000(osg) >> >> >> >> >> > jobManager="gt2:gt2:pbs" /> >> /nfs/home/benc/swifttest >> 4 >> >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > From zhaozhang at uchicago.edu Mon May 4 14:31:05 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 14:31:05 -0500 Subject: [Swift-devel] Re: Another issue regarding uj-pbs-gram2.xml In-Reply-To: References: <49FF3BE4.5040101@uchicago.edu> Message-ID: <49FF4279.90007@uchicago.edu> exactly, I ran the same test successfully a week ago. zhao Ben Clifford wrote: > Not sure why thats happening, but I confirm that it also happens for me > using that site definition. It didn't when I committed it. > > From hategan at mcs.anl.gov Mon May 4 14:35:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 May 2009 14:35:59 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF4235.20101@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> Message-ID: <1241465759.29191.1.camel@localhost> On Mon, 2009-05-04 at 14:29 -0500, Zhao Zhang wrote: > Hi, Mihael > > I could not login that machine by neither ssh nor gsissh. > > [zzhang at communicado sites]$ ssh osg-ce.grid.uj.ac.za > ssh: connect to host osg-ce.grid.uj.ac.za port 22: Connection refused > [zzhang at communicado sites]$ gsissh osg-ce.grid.uj.ac.za > ssh: connect to host osg-ce.grid.uj.ac.za port 22: Connection refused Come on, try harder. Use globusrun and globus-url-copy. From benc at hawaga.org.uk Mon May 4 14:34:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 19:34:35 +0000 (GMT) Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF4235.20101@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> Message-ID: On Mon, 4 May 2009, Zhao Zhang wrote: > I could not login that machine by neither ssh nor gsissh. no-one apart from special people can ssh into that machine - in general, most OSG machines will not let you log in interactively. You should be able to use globus-job-run and globus-url-copy to list and copy files from that machine to your own machine, though. > [zzhang at communicado sites]$ ssh osg-ce.grid.uj.ac.za > ssh: connect to host osg-ce.grid.uj.ac.za port 22: Connection refused > [zzhang at communicado sites]$ gsissh osg-ce.grid.uj.ac.za > ssh: connect to host osg-ce.grid.uj.ac.za port 22: Connection refused > > zhao > > Mihael Hategan wrote: > > Can you retrieve the relevant gram log (it's on the remote site, in your > > home directory, named gram*.log)? > > > > On Mon, 2009-05-04 at 14:03 -0500, Zhao Zhang wrote: > > > > > Hi, Ben > > > > > > The coaster test on uj-pbs-gram2.xml failed due to the following reason. > > > I am also attaching the result from "globus-job-run" and > > > "uj-pbs-gram2.xml". > > > Any idea why it is happening? Thanks. > > > > > > zhao > > > > > > [zzhang at communicado sites]$ ./run-site coaster/uj-pbs-gram2.xml > > > testing site configuration: coaster/uj-pbs-gram2.xml > > > Removing files from previous runs > > > Running test 061-cattwo at Mon May 4 13:57:42 CDT 2009 > > > Swift 0.9rc2 swift-r2860 cog-r2388 > > > > > > RunID: 20090504-1357-43geyd0c > > > Progress: > > > Progress: Stage in:1 > > > Progress: Submitting:1 > > > Progress: Submitting:1 > > > Progress: Submitted:1 > > > Progress: Active:1 > > > Failed to transfer wrapper log from > > > 061-cattwo-20090504-1357-43geyd0c/info/v on teraport > > > Progress: Stage in:1 > > > Progress: Active:1 > > > Progress: Failed but can retry:1 > > > Failed to transfer wrapper log from > > > 061-cattwo-20090504-1357-43geyd0c/info/y on teraport > > > Progress: Stage in:1 > > > Progress: Active:1 > > > Progress: Failed but can retry:1 > > > Failed to transfer wrapper log from > > > 061-cattwo-20090504-1357-43geyd0c/info/0 on teraport > > > Progress: Failed:1 > > > Execution failed: > > > Exception in cat: > > > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > > > Host: teraport > > > Directory: 061-cattwo-20090504-1357-43geyd0c/jobs/0/cat-0k2l2caj > > > stderr.txt: > > > > > > stdout.txt: > > > > > > ---- > > > > > > Caused by: > > > Failed to start worker: > > > null > > > null > > > org.globus.gram.GramException: The job failed when the job manager > > > attempted to run it > > > at > > > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > > > at > > > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > > > at java.lang.Thread.run(Thread.java:595) > > > > > > Cleaning up... > > > Shutting down service at https://152.106.18.251:34054 > > > Got channel MetaChannel: 12128247 -> GSSSChannel-null(1) > > > - Done > > > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > > > > > > > [zzhang at communicado sites]$ globus-job-run osg-ce.grid.uj.ac.za > > > /usr/bin/id > > > uid=640(osgedu) gid=2000(osg) groups=2000(osg) > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:pbs" /> > > > /nfs/home/benc/swifttest > > > 4 > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > From zhaozhang at uchicago.edu Mon May 4 14:46:46 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 14:46:46 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <1241465759.29191.1.camel@localhost> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> Message-ID: <49FF4626.7000000@uchicago.edu> Ha, I got it. It is at CI network: /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log zhao Mihael Hategan wrote: > On Mon, 2009-05-04 at 14:29 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I could not login that machine by neither ssh nor gsissh. >> >> [zzhang at communicado sites]$ ssh osg-ce.grid.uj.ac.za >> ssh: connect to host osg-ce.grid.uj.ac.za port 22: Connection refused >> [zzhang at communicado sites]$ gsissh osg-ce.grid.uj.ac.za >> ssh: connect to host osg-ce.grid.uj.ac.za port 22: Connection refused >> > > Come on, try harder. > > Use globusrun and globus-url-copy. > > > > From hategan at mcs.anl.gov Mon May 4 15:01:56 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 May 2009 15:01:56 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF4626.7000000@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> Message-ID: <1241467316.30018.0.camel@localhost> On Mon, 2009-05-04 at 14:46 -0500, Zhao Zhang wrote: > Ha, I got it. > > It is at CI network: > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log There you go :) I'm not sure this is the right log though. This is the one from the job that was used to start the coaster service, and it shows a different error (155 - stageout failure). From zhaozhang at uchicago.edu Mon May 4 15:02:44 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 15:02:44 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <1241467316.30018.0.camel@localhost> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> Message-ID: <49FF49E4.6060707@uchicago.edu> I also pulled out the coaster log in the same directory. Then what shall we do with this error? zhao Mihael Hategan wrote: > On Mon, 2009-05-04 at 14:46 -0500, Zhao Zhang wrote: > >> Ha, I got it. >> >> It is at CI network: >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log >> > > There you go :) > > I'm not sure this is the right log though. This is the one from the job > that was used to start the coaster service, and it shows a different > error (155 - stageout failure). > > > > From hategan at mcs.anl.gov Mon May 4 15:07:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 May 2009 15:07:40 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF49E4.6060707@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> Message-ID: <1241467660.30401.0.camel@localhost> On Mon, 2009-05-04 at 15:02 -0500, Zhao Zhang wrote: > I also pulled out the coaster log in the same directory. Then what shall > we do with this error? Was that the only gram log? What I suspect is that there is another gram log which is more relevant. > > zhao > > Mihael Hategan wrote: > > On Mon, 2009-05-04 at 14:46 -0500, Zhao Zhang wrote: > > > >> Ha, I got it. > >> > >> It is at CI network: > >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log > >> > > > > There you go :) > > > > I'm not sure this is the right log though. This is the one from the job > > that was used to start the coaster service, and it shows a different > > error (155 - stageout failure). > > > > > > > > From zhaozhang at uchicago.edu Mon May 4 15:07:14 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 15:07:14 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <1241467660.30401.0.camel@localhost> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> <1241467660.30401.0.camel@localhost> Message-ID: <49FF4AF2.6050400@uchicago.edu> This one? /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_18272.log zhao Mihael Hategan wrote: > On Mon, 2009-05-04 at 15:02 -0500, Zhao Zhang wrote: > >> I also pulled out the coaster log in the same directory. Then what shall >> we do with this error? >> > > Was that the only gram log? What I suspect is that there is another gram > log which is more relevant. > > >> zhao >> >> Mihael Hategan wrote: >> >>> On Mon, 2009-05-04 at 14:46 -0500, Zhao Zhang wrote: >>> >>> >>>> Ha, I got it. >>>> >>>> It is at CI network: >>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log >>>> >>>> >>> There you go :) >>> >>> I'm not sure this is the right log though. This is the one from the job >>> that was used to start the coaster service, and it shows a different >>> error (155 - stageout failure). >>> >>> >>> >>> >>> > > > From benc at hawaga.org.uk Mon May 4 15:08:03 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 May 2009 20:08:03 +0000 (GMT) Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <1241467660.30401.0.camel@localhost> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> <1241467660.30401.0.camel@localhost> Message-ID: On Mon, 4 May 2009, Mihael Hategan wrote: > > I also pulled out the coaster log in the same directory. Then what shall > > we do with this error? > > Was that the only gram log? What I suspect is that there is another gram > log which is more relevant. Lots of gram logs there: ls gram_job_mgr_* | wc -l 797 -- From zhaozhang at uchicago.edu Mon May 4 15:09:36 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 15:09:36 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> <1241467660.30401.0.camel@localhost> Message-ID: <49FF4B80.3080107@uchicago.edu> So, I picked the ones have the same time tag with the coaster log -rw-r--r-- 1 osgedu osg 45193 May 4 21:24 gram_job_mgr_22265.log -rw-r--r-- 1 osgedu osg 583 May 4 21:24 coaster-bootstrap-1416098320.log -rw-r--r-- 1 osgedu osg 45200 May 4 21:23 gram_job_mgr_18272.log -rw-r--r-- 1 osgedu osg 584 May 4 21:23 coaster-bootstrap-01096461671.log zhao Ben Clifford wrote: > On Mon, 4 May 2009, Mihael Hategan wrote: > > >>> I also pulled out the coaster log in the same directory. Then what shall >>> we do with this error? >>> >> Was that the only gram log? What I suspect is that there is another gram >> log which is more relevant. >> > > Lots of gram logs there: > > ls gram_job_mgr_* | wc -l > 797 > > From hategan at mcs.anl.gov Mon May 4 15:25:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 May 2009 15:25:32 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF4AF2.6050400@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> <1241467660.30401.0.camel@localhost> <49FF4AF2.6050400@uchicago.edu> Message-ID: <1241468732.30713.0.camel@localhost> On Mon, 2009-05-04 at 15:07 -0500, Zhao Zhang wrote: > This one? > > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_18272.log Nope. I suggest copying them all, and I'll look for the right one. > > zhao > > Mihael Hategan wrote: > > On Mon, 2009-05-04 at 15:02 -0500, Zhao Zhang wrote: > > > >> I also pulled out the coaster log in the same directory. Then what shall > >> we do with this error? > >> > > > > Was that the only gram log? What I suspect is that there is another gram > > log which is more relevant. > > > > > >> zhao > >> > >> Mihael Hategan wrote: > >> > >>> On Mon, 2009-05-04 at 14:46 -0500, Zhao Zhang wrote: > >>> > >>> > >>>> Ha, I got it. > >>>> > >>>> It is at CI network: > >>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log > >>>> > >>>> > >>> There you go :) > >>> > >>> I'm not sure this is the right log though. This is the one from the job > >>> that was used to start the coaster service, and it shows a different > >>> error (155 - stageout failure). > >>> > >>> > >>> > >>> > >>> > > > > > > From hategan at mcs.anl.gov Mon May 4 15:28:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 May 2009 15:28:12 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF4B80.3080107@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> <1241467660.30401.0.camel@localhost> <49FF4B80.3080107@uchicago.edu> Message-ID: <1241468892.30713.4.camel@localhost> On Mon, 2009-05-04 at 15:09 -0500, Zhao Zhang wrote: > So, I picked the ones have the same time tag with the coaster log > > -rw-r--r-- 1 osgedu osg 45193 May 4 21:24 gram_job_mgr_22265.log > -rw-r--r-- 1 osgedu osg 583 May 4 21:24 > coaster-bootstrap-1416098320.log > -rw-r--r-- 1 osgedu osg 45200 May 4 21:23 gram_job_mgr_18272.log > -rw-r--r-- 1 osgedu osg 584 May 4 21:23 > coaster-bootstrap-01096461671.log No good. The bootstrap log is the last non-gram log that gets modified, and then follows the gram log for the job that runs the coaster service. You need a log close to that time, but not the last one. From zhaozhang at uchicago.edu Mon May 4 15:57:00 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 15:57:00 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <1241468732.30713.0.camel@localhost> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> <1241467660.30401.0.camel@localhost> <49FF4AF2.6050400@uchicago.edu> <1241468732.30713.0.camel@localhost> Message-ID: <49FF569C.90104@uchicago.edu> Hey, Here are all of them: /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/uj :-) zhao Mihael Hategan wrote: > On Mon, 2009-05-04 at 15:07 -0500, Zhao Zhang wrote: > >> This one? >> >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_18272.log >> > > Nope. I suggest copying them all, and I'll look for the right one. > > >> zhao >> >> Mihael Hategan wrote: >> >>> On Mon, 2009-05-04 at 15:02 -0500, Zhao Zhang wrote: >>> >>> >>>> I also pulled out the coaster log in the same directory. Then what shall >>>> we do with this error? >>>> >>>> >>> Was that the only gram log? What I suspect is that there is another gram >>> log which is more relevant. >>> >>> >>> >>>> zhao >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Mon, 2009-05-04 at 14:46 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Ha, I got it. >>>>>> >>>>>> It is at CI network: >>>>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log >>>>>> >>>>>> >>>>>> >>>>> There you go :) >>>>> >>>>> I'm not sure this is the right log though. This is the one from the job >>>>> that was used to start the coaster service, and it shows a different >>>>> error (155 - stageout failure). >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> > > > From hategan at mcs.anl.gov Mon May 4 16:05:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 May 2009 16:05:46 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <49FF569C.90104@uchicago.edu> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> <1241467660.30401.0.camel@localhost> <49FF4AF2.6050400@uchicago.edu> <1241468732.30713.0.camel@localhost> <49FF569C.90104@uchicago.edu> Message-ID: <1241471146.31918.1.camel@localhost> gram_job_mgr_10273.log: ... Mon May 4 20:59:33 2009 JM_SCRIPT: qsub returned Mon May 4 20:59:33 2009 JM_SCRIPT: qsub stderr qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max walltime requirement ... Find out what queues are on that machine (qstat -q), and change appropriately. That or set the proper coasterWorkerMaxwalltime. On Mon, 2009-05-04 at 15:57 -0500, Zhao Zhang wrote: > Hey, Here are all of them: > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/uj :-) > > zhao > > Mihael Hategan wrote: > > On Mon, 2009-05-04 at 15:07 -0500, Zhao Zhang wrote: > > > >> This one? > >> > >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_18272.log > >> > > > > Nope. I suggest copying them all, and I'll look for the right one. > > > > > >> zhao > >> > >> Mihael Hategan wrote: > >> > >>> On Mon, 2009-05-04 at 15:02 -0500, Zhao Zhang wrote: > >>> > >>> > >>>> I also pulled out the coaster log in the same directory. Then what shall > >>>> we do with this error? > >>>> > >>>> > >>> Was that the only gram log? What I suspect is that there is another gram > >>> log which is more relevant. > >>> > >>> > >>> > >>>> zhao > >>>> > >>>> Mihael Hategan wrote: > >>>> > >>>> > >>>>> On Mon, 2009-05-04 at 14:46 -0500, Zhao Zhang wrote: > >>>>> > >>>>> > >>>>> > >>>>>> Ha, I got it. > >>>>>> > >>>>>> It is at CI network: > >>>>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log > >>>>>> > >>>>>> > >>>>>> > >>>>> There you go :) > >>>>> > >>>>> I'm not sure this is the right log though. This is the one from the job > >>>>> that was used to start the coaster service, and it shows a different > >>>>> error (155 - stageout failure). > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>> > >>> > > > > > > From zhaozhang at uchicago.edu Mon May 4 16:32:23 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 04 May 2009 16:32:23 -0500 Subject: [Swift-devel] Another issue regarding uj-pbs-gram2.xml In-Reply-To: <1241471146.31918.1.camel@localhost> References: <49FF3BE4.5040101@uchicago.edu> <1241465113.28803.1.camel@localhost> <49FF4235.20101@uchicago.edu> <1241465759.29191.1.camel@localhost> <49FF4626.7000000@uchicago.edu> <1241467316.30018.0.camel@localhost> <49FF49E4.6060707@uchicago.edu> <1241467660.30401.0.camel@localhost> <49FF4AF2.6050400@uchicago.edu> <1241468732.30713.0.camel@localhost> <49FF569C.90104@uchicago.edu> <1241471146.31918.1.camel@localhost> Message-ID: <49FF5EE7.70002@uchicago.edu> So I got the following info [zzhang at communicado uj]$ globus-job-run osg-ce.grid.uj.ac.za /usr/bin/qstat -q server: gridvm Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- medium -- 02:00:00 -- -- 0 0 50 E R atlas -- -- 02:00:00 -- 0 0 -- E R gilda -- 02:00:00 -- -- 0 0 -- E R batch -- -- 01:00:00 -- 0 0 -- E R small -- 00:20:00 -- -- 0 0 50 E R default -- -- -- -- 0 0 -- E R long -- 12:00:00 -- -- 0 0 20 E R verylong -- 72:00:00 -- -- 0 0 10 E R ----- ----- 0 0 Then I used the default queue, now job is running. Thanks! zhao Mihael Hategan wrote: > gram_job_mgr_10273.log: > ... > Mon May 4 20:59:33 2009 JM_SCRIPT: qsub returned > Mon May 4 20:59:33 2009 JM_SCRIPT: qsub stderr qsub: Job exceeds queue > resource limits MSG=cannot satisfy queue max walltime requirement > ... > > Find out what queues are on that machine (qstat -q), and change > appropriately. That or set the proper coasterWorkerMaxwalltime. > > On Mon, 2009-05-04 at 15:57 -0500, Zhao Zhang wrote: > >> Hey, Here are all of them: >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/uj :-) >> >> zhao >> >> Mihael Hategan wrote: >> >>> On Mon, 2009-05-04 at 15:07 -0500, Zhao Zhang wrote: >>> >>> >>>> This one? >>>> >>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_18272.log >>>> >>>> >>> Nope. I suggest copying them all, and I'll look for the right one. >>> >>> >>> >>>> zhao >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Mon, 2009-05-04 at 15:02 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> I also pulled out the coaster log in the same directory. Then what shall >>>>>> we do with this error? >>>>>> >>>>>> >>>>>> >>>>> Was that the only gram log? What I suspect is that there is another gram >>>>> log which is more relevant. >>>>> >>>>> >>>>> >>>>> >>>>>> zhao >>>>>> >>>>>> Mihael Hategan wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Mon, 2009-05-04 at 14:46 -0500, Zhao Zhang wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Ha, I got it. >>>>>>>> >>>>>>>> It is at CI network: >>>>>>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/gram_job_mgr_22265.log >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> There you go :) >>>>>>> >>>>>>> I'm not sure this is the right log though. This is the one from the job >>>>>>> that was used to start the coaster service, and it shows a different >>>>>>> error (155 - stageout failure). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>>> >>> >>> > > > From benc at hawaga.org.uk Tue May 5 04:42:49 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 5 May 2009 09:42:49 +0000 (GMT) Subject: [Swift-devel] towards a 1.0 release Message-ID: Mike has indicated that he would like a 1.0 release towards the end of August. I would like to determine: i) is there consensus on what a 1.0 release is? (I suspect not) ii) is a 1.0 release achieveable by the above date (which I think influences i) iii) if so, what should happen between now and then My personal opinions on what should happen if we are to have 1.0 release follow: I think for an august-release 1.0, major work whch is not already substantially in-progress should not be started. Concretely, I think that means that Mihael continues to work on coasters and I continue my provenance work, and nothing major happens between now and 1.0. There should be an intermediate (0.95) release around the end of June. There absolutely should not be a rush to cram buggy features into this release at the last minute. Between 0.95 and 1.0, there should be a two month period of heightened bugfixing, testing, documentation. Development of new features can take place, but they must not land in the trunk before 1.0 - they should be managed out-of-trunk in whatever way the developer in question prefers. Development of such new features should to be allowed to seriously detract from the 1.0 release process. Any deadlines which are likely to require serious amounts of developer time between now and 1.0 should be declared now; and surprise removal of developers to work on non-1.0 work which could have been declared at this stage but weren't should not be permitted. For me, the only such deadline is the provenance challenge workshop in mid-June. -- From zhaozhang at uchicago.edu Tue May 5 13:07:41 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 05 May 2009 13:07:41 -0500 Subject: [Swift-devel] condor-G issue Message-ID: <4A00806D.8080806@uchicago.edu> Hi, All I am running test on grid sites with condor-G provider, but failed on AGLT2.xml. I am attaching the sites.xml and the standard output. Again as we did yesterday, all gram logs from that site could be found at /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g-log Thanks. zhao [zzhang at communicado condor-g]$ cat AGLT2.xml /atlas/data08/OSG/DATA/osg/tmp/AGLT2 grid gt2 gate02.grid.umich.edu/jobmanager-Condor [zzhang at communicado condor-g]$ globus-job-run gate02.grid.umich.edu /usr/bin/id uid=789087(osg) gid=55682(osg) groups=55682(osg) [zzhang at communicado condor-g]$ cd .. [zzhang at communicado sites]$ ./run-site condor-g/AGLT2.xml testing site configuration: condor-g/AGLT2.xml Removing files from previous runs Running test 061-cattwo at Tue May 5 12:45:43 CDT 2009 Swift svn swift-r2896 cog-r2392 RunID: 20090505-1245-czyer1rc Progress: Progress: Stage in:1 Progress: Submitting:1 Failed to transfer wrapper log from 061-cattwo-20090505-1245-czyer1rc/info/m on AGLT2 Progress: Stage in:1 Failed to transfer wrapper log from 061-cattwo-20090505-1245-czyer1rc/info/o on AGLT2 Progress: Stage in:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from 061-cattwo-20090505-1245-czyer1rc/info/q on AGLT2 Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: AGLT2 Directory: 061-cattwo-20090505-1245-czyer1rc/jobs/q/cat-qygamdaj stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Could not submit job (condor_submit reported an exit code of 1). no error output SWIFT RETURN CODE NON-ZERO - test 061-cattwo From zhaozhang at uchicago.edu Tue May 5 14:16:24 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 05 May 2009 14:16:24 -0500 Subject: [Swift-devel] Re: condor-G issue In-Reply-To: <4A00806D.8080806@uchicago.edu> References: <4A00806D.8080806@uchicago.edu> Message-ID: <4A009088.7020704@uchicago.edu> I got it, it is because the condor version on ci network. zhao Zhao Zhang wrote: > Hi, All > > I am running test on grid sites with condor-G provider, but failed on > AGLT2.xml. I am attaching the sites.xml and the standard output. > Again as we did yesterday, all gram logs from that site could be > found at > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g-log > Thanks. > > zhao > > [zzhang at communicado condor-g]$ cat AGLT2.xml > > > > > > /atlas/data08/OSG/DATA/osg/tmp/AGLT2 > grid > gt2 > gate02.grid.umich.edu/jobmanager-Condor > > > > [zzhang at communicado condor-g]$ globus-job-run gate02.grid.umich.edu > /usr/bin/id > uid=789087(osg) gid=55682(osg) groups=55682(osg) > [zzhang at communicado condor-g]$ cd .. > [zzhang at communicado sites]$ ./run-site condor-g/AGLT2.xml > testing site configuration: condor-g/AGLT2.xml > Removing files from previous runs > Running test 061-cattwo at Tue May 5 12:45:43 CDT 2009 > Swift svn swift-r2896 cog-r2392 > > RunID: 20090505-1245-czyer1rc > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Failed to transfer wrapper log from > 061-cattwo-20090505-1245-czyer1rc/info/m on AGLT2 > Progress: Stage in:1 > Failed to transfer wrapper log from > 061-cattwo-20090505-1245-czyer1rc/info/o on AGLT2 > Progress: Stage in:1 > Progress: Failed but can retry:1 > Failed to transfer wrapper log from > 061-cattwo-20090505-1245-czyer1rc/info/q on AGLT2 > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: AGLT2 > Directory: 061-cattwo-20090505-1245-czyer1rc/jobs/q/cat-qygamdaj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Could not submit job (condor_submit reported > an exit code of 1). no error output > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > From benc at hawaga.org.uk Wed May 6 03:08:03 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 6 May 2009 08:08:03 +0000 (GMT) Subject: [Swift-devel] swift with input and output sandboxes instead of a shared site filesystem In-Reply-To: References: Message-ID: On Wed, 22 Apr 2009, Ben Clifford wrote: > (to test if > you bothered reading this, if you paste me the random string H14n$=N:t)Z > you get a free beer) Two weeks having passed, this offer is now closed. -- From bugzilla-daemon at mcs.anl.gov Wed May 6 04:31:18 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 6 May 2009 04:31:18 -0500 (CDT) Subject: [Swift-devel] [Bug 203] New: Jobs skipped in a run due to restart appear in status ticker as "Initializing" which is confusing Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=203 Summary: Jobs skipped in a run due to restart appear in status ticker as "Initializing" which is confusing Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: ASSIGNED Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk Jobs skipped in a run due to restart appear in status ticker as "Initializing" which is confusing. They should appear either as completed or something like "completed in previous run". -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed May 6 04:39:35 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 6 May 2009 04:39:35 -0500 (CDT) Subject: [Swift-devel] [Bug 204] New: log-processing vs restarts Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=204 Summary: log-processing vs restarts Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: ASSIGNED Severity: enhancement Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk Log processing of restarted runs only gives information about stuff that happened in the restarted run, not over the whole multi-run composite run (whatever that is called?). This is sometimes undesirable. It might be useful to be able to (automatically?) get the information from all log files associated with a composite run. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed May 6 04:49:55 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 6 May 2009 04:49:55 -0500 (CDT) Subject: [Swift-devel] [Bug 205] New: move swift/osg tests away from OSGEDU vo Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=205 Summary: move swift/osg tests away from OSGEDU vo Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: ASSIGNED Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk OSG tests are designed to be run by people in the OSGEDU VO. There is probably a better OSG VO. Tests should move there. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed May 6 04:53:25 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 6 May 2009 04:53:25 -0500 (CDT) Subject: [Swift-devel] [Bug 206] New: make log processing produce SVG (and other formats?) Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=206 Summary: make log processing produce SVG (and other formats?) Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: ASSIGNED Severity: enhancement Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: aespinosa at cs.uchicago.edu Other graphical output formats than the default PNG can contain more information, though are less widely supported. Log processing should be able to optionally produce these formats. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From zhaozhang at uchicago.edu Wed May 6 11:09:03 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 06 May 2009 11:09:03 -0500 Subject: [Swift-devel] Test with condor-g provider Message-ID: <4A01B61F.8010602@uchicago.edu> Hi, All Within yesterday and today, I have run condor-g test on ~10 OSG sites. But none of them returned successful, Here is a test case: CIT_CMS_T2.xml sites.xml definition: [zzhang at tp-grid1 condor-g]$ cat CIT_CMS_T2.xml /raid2/osg-data/osg/tmp/CIT_CMS_T2 grid gt2 cit-gatekeeper.ultralight.org/jobmanager-Condor [zzhang at tp-grid1 condor-g]$ globus-job-run cit-gatekeeper.ultralight.org /usr/bin/id uid=536(osg) gid=536(osg) groups=536(osg) [zzhang at tp-grid1 sites]$ ./run-site condor-g/CIT_CMS_T2.xml testing site configuration: condor-g/CIT_CMS_T2.xml Removing files from previous runs Running test 061-cattwo at Wed May 6 11:01:49 CDT 2009 Swift svn swift-r2896 cog-r2392 RunID: 20090506-1101-gtx48v68 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Then it is hanging there forever. By this message, could we tell that the job has already been put into the condor-g queue? The only problem is that the job never goes through the queue? Thanks. zhao From benc at hawaga.org.uk Wed May 6 11:11:52 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 6 May 2009 16:11:52 +0000 (GMT) Subject: [Swift-devel] Test with condor-g provider In-Reply-To: <4A01B61F.8010602@uchicago.edu> References: <4A01B61F.8010602@uchicago.edu> Message-ID: On Wed, 6 May 2009, Zhao Zhang wrote: > Then it is hanging there forever. By this message, could we tell that the job > has already been put into the condor-g queue? The only problem is that the job > never goes through the queue? You can use the condor_q command to get information about jobs in the condor queue on your submitting machine. -- From benc at hawaga.org.uk Wed May 6 11:17:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 6 May 2009 16:17:04 +0000 (GMT) Subject: [Swift-devel] Test with condor-g provider In-Reply-To: References: <4A01B61F.8010602@uchicago.edu> Message-ID: ... like this: $ condor_q 166 -long | grep HoldReason HoldReason = "Globus error 93: the gatekeeper failed to find the requested service" HoldReasonCode = 2 HoldReasonSubCode = 93 That error is usually because you have specified the jobmanager incorrectly. Try testing with the job manager as actually specified, eg globus-job-run cit-gatekeeper.ultralight.org/jobmanager-Condor You should see that you get the same error code. If you change Condor to condor in that command line, then it should work: $ globus-job-run cit-gatekeeper.ultralight.org/jobmanager-condor /bin/hostname cithep231.ultralight.org -- From zhaozhang at uchicago.edu Wed May 6 11:29:04 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 06 May 2009 11:29:04 -0500 Subject: [Swift-devel] Test with condor-g provider In-Reply-To: References: <4A01B61F.8010602@uchicago.edu> Message-ID: <4A01BAD0.5090307@uchicago.edu> Thanks, Ben Ben Clifford wrote: > ... like this: > > $ condor_q 166 -long | grep HoldReason > HoldReason = "Globus error 93: the gatekeeper failed to find the requested > service" > HoldReasonCode = 2 > HoldReasonSubCode = 93 > I tried this [zzhang at tp-grid1 ~]$ globus-job-run cit-gatekeeper.ultralight.org /share/apps/condor/bin/condor_q 166 -long > condor_q But only get the following message -- Submitter: cithep169.ultralight.org : <10.255.255.216:44050> : cithep169.ultralight.org > > That error is usually because you have specified the jobmanager > incorrectly. > > Try testing with the job manager as actually specified, eg > globus-job-run cit-gatekeeper.ultralight.org/jobmanager-Condor > I found this: [zzhang at tp-grid1 sites]$ globus-job-run cit-gatekeeper.ultralight.org/jobmanager-Condor /usr/bin/id GRAM Job submission failed because the gatekeeper failed to find the requested service (error code 93) > You should see that you get the same error code. > > If you change Condor to condor in that command line, then it should work: > > $ globus-job-run cit-gatekeeper.ultralight.org/jobmanager-condor /bin/hostname > cithep231.ultralight.org > Yep, this is working [zzhang at tp-grid1 sites]$ globus-job-run cit-gatekeeper.ultralight.org/jobmanager-condor /usr/bin/id uid=536(osg) gid=536(osg) groups=536(osg) Then I made the change in sites.xml definition, and jobs are running through on this site. Cool. zhao > > From benc at hawaga.org.uk Wed May 6 11:32:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 6 May 2009 16:32:16 +0000 (GMT) Subject: [Swift-devel] Test with condor-g provider In-Reply-To: <4A01BAD0.5090307@uchicago.edu> References: <4A01B61F.8010602@uchicago.edu> <4A01BAD0.5090307@uchicago.edu> Message-ID: > I tried this > [zzhang at tp-grid1 ~]$ globus-job-run cit-gatekeeper.ultralight.org > /share/apps/condor/bin/condor_q 166 -long > condor_q > > But only get the following message > -- Submitter: cithep169.ultralight.org : <10.255.255.216:44050> : > cithep169.ultralight.org You need to run it on the submitting machine - the place that you are running the Swift client. In Condor-G mode, Swift submits jobs to the Condor installation on the submitting machine, and Condor-G submits them from there to the remote machine. In your case, the jobs are getting stuck in the local condor queue. -- From zhaozhang at uchicago.edu Wed May 6 11:37:33 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 06 May 2009 11:37:33 -0500 Subject: [Swift-devel] Test with condor-g provider In-Reply-To: References: <4A01B61F.8010602@uchicago.edu> <4A01BAD0.5090307@uchicago.edu> Message-ID: <4A01BCCD.8020404@uchicago.edu> Cool! This is really good to know. Another question is, how a user could tell this issue? like me? zhao Ben Clifford wrote: >> I tried this >> [zzhang at tp-grid1 ~]$ globus-job-run cit-gatekeeper.ultralight.org >> /share/apps/condor/bin/condor_q 166 -long > condor_q >> >> But only get the following message >> -- Submitter: cithep169.ultralight.org : <10.255.255.216:44050> : >> cithep169.ultralight.org >> > > You need to run it on the submitting machine - the place that you are > running the Swift client. > > In Condor-G mode, Swift submits jobs to the Condor installation on the > submitting machine, and Condor-G submits them from there to the remote > machine. > > In your case, the jobs are getting stuck in the local condor queue. > > From benc at hawaga.org.uk Wed May 6 11:39:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 6 May 2009 16:39:35 +0000 (GMT) Subject: [Swift-devel] Test with condor-g provider In-Reply-To: <4A01BCCD.8020404@uchicago.edu> References: <4A01B61F.8010602@uchicago.edu> <4A01BAD0.5090307@uchicago.edu> <4A01BCCD.8020404@uchicago.edu> Message-ID: On Wed, 6 May 2009, Zhao Zhang wrote: > Cool! This is really good to know. Another question is, how a user could tell > this issue? like me? You have to learn how Condor-G works... I think probably the condor provider should be changed so that instead of jobs going on hold, they fail when things like this happen. I think that would make the code behave more like other execution systems. -- From rynge at renci.org Wed May 6 12:39:35 2009 From: rynge at renci.org (Mats Rynge) Date: Wed, 06 May 2009 13:39:35 -0400 Subject: [Swift-devel] Test with condor-g provider In-Reply-To: References: <4A01B61F.8010602@uchicago.edu> <4A01BAD0.5090307@uchicago.edu> <4A01BCCD.8020404@uchicago.edu> Message-ID: <4A01CB57.6050405@renci.org> Ben Clifford wrote: > On Wed, 6 May 2009, Zhao Zhang wrote: > >> Cool! This is really good to know. Another question is, how a user could tell >> this issue? like me? > > You have to learn how Condor-G works... > > I think probably the condor provider should be changed so that instead of > jobs going on hold, they fail when things like this happen. I think that > would make the code behave more like other execution systems. One idea would be to have swift detect held jobs and that would be "failure". You can use periodic_hold for that. Example: # GlobusStatus==16 is suspended # GlobusStatus==32 is submitting # JobStatus==1 is pending # JobStatus==2 is running periodic_hold = ( (GlobusStatus==16) || \ ( (GlobusStatus==32) && (CurrentHosts==1) && \ ((CurrentTime - EnteredCurrentStatus) > (20*60)) ) || \ ( (JobStatus==1) && \ ((CurrentTime - EnteredCurrentStatus) > (1*24*60)) ) || \ (JobStatus==2) && \ ((CurrentTime - EnteredCurrentStatus) > (10*24*60)) ) ) on_exit_remove = (ExitBySignal == False) && (ExitCode == 0) Or, you can use a similar expression with periodic_remove, but then you have to figure out afterwards why a job exited the queue. -- Mats Rynge Renaissance Computing Institute From benc at hawaga.org.uk Wed May 6 13:41:40 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 6 May 2009 18:41:40 +0000 (GMT) Subject: [Swift-devel] Test with condor-g provider In-Reply-To: <4A01CB57.6050405@renci.org> References: <4A01B61F.8010602@uchicago.edu> <4A01BAD0.5090307@uchicago.edu> <4A01BCCD.8020404@uchicago.edu> <4A01CB57.6050405@renci.org> Message-ID: On Wed, 6 May 2009, Mats Rynge wrote: > One idea would be to have swift detect held jobs and that would be "failure". > You can use periodic_hold for that. Example: The condor provider already pokes round in the queue to see which jobs are in completed states. I think adding H as one of those states would be pretty straightforward - regard H as "failed, reason is in HoldReason, now you need to condor_rm to tidy up". -- From yizhu at cs.uchicago.edu Wed May 6 19:24:10 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Wed, 06 May 2009 19:24:10 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> Message-ID: <4A022A2A.6070907@cs.uchicago.edu> Hi all I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and configured sites.xml and tc.data accordingly.[1] When i tried to run "$swift first.swift", it stuck on "Submitting:1" phase[2], (keep repeated showing "Progress:Submitting:1" and never return). Then I ssh to pbs_server to check the server log[3] and found that the job has been enqueued, ran, and successfully dequeued. I also check the queue status[4] when this job is running and found that the output_path is "/dev/null" somewhat i don't expected. ( The working directory of swift is "/home/jobrun". I think the problem might be the pbs_server failed to return the output to the correct path (btw. what the output_path suppose to be, the same work_directory of swift?), or anyone has a better idea? Many Thanks. -Yi -------------------------------------------------- [1]----sites.xml /home/jobrun ----tc.data nb_basecluster echo /bin/echo INSTALLED INTEL32::LINUX null nb_basecluster cat /bin/cat INSTALLED INTEL32::LINUX null nb_basecluster ls /bin/ls INSTALLED INTEL32::LINUX null nb_basecluster grep /bin/grep INSTALLED INTEL32::LINUX null nb_basecluster sort /bin/sort INSTALLED INTEL32::LINUX null nb_basecluster paste /bin/paste INSTALLED INTEL32::LINUX null --------------------------------------------------------- [2] yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift Recompilation suppressed. Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data Setting resources to: {nb_basecluster=nb_basecluster} Swift 0.8 swift-r2448 cog-r2261 Swift 0.8 swift-r2448 cog-r2261 RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg RunID: 20090506-1912-zqd8t5hg closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 type string value=Hello, world! dataset=unnamed SwiftScript value (closed) ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 path=$ VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 VALUE=Hello, world! closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 type string value=Hello, world! dataset=unnamed SwiftScript value (closed) ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 path=$ VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 VALUE=Hello, world! NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 type messagefile with no value at dataset=outfile (not closed).$ NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 Progress: PROCEDURE thread=0 name=greeting PARAM thread=0 direction=output variable=t provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 type string value=hello.txt dataset=unnamed SwiftScript value (closed) ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 path=$ VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 VALUE=hello.txt closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 type string value=hello.txt dataset=unnamed SwiftScript value (closed) ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 path=$ VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 VALUE=hello.txt START thread=0 tr=echo Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0] Rand: 0.6583597597672994, sum: 1.0 Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0 Progress: Initializing site shared directory:1 START host=nb_basecluster - Initializing shared directory multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01) Old score: 0.000, new score: -0.010 No global submit throttle set. Using default (100) Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status to Submitting Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status to Submitted Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status to Active Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status to Completed multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) Old score: -0.010, new score: 0.000 multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1) Old score: 0.000, new score: 0.100 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2) Old score: 0.100, new score: -0.100 Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status to Submitting Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status to Submitted Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status to Active Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status to Completed multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2) Old score: -0.100, new score: 0.100 multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1) Old score: 0.100, new score: 0.200 Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2) Old score: 0.200, new score: 0.000 Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status to Submitting Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status to Submitted Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status to Active Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status to Completed multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2) Old score: 0.000, new score: 0.200 multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1) Old score: 0.200, new score: 0.300 Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01) Old score: 0.300, new score: 0.290 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status to Submitting Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status to Submitted Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status to Active Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status to Completed multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01) Old score: 0.290, new score: 0.300 multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1) Old score: 0.300, new score: 0.400 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01) Old score: 0.400, new score: 0.390 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status to Submitting Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status to Submitted Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status to Active Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status to Completed multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01) Old score: 0.390, new score: 0.400 multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1) Old score: 0.400, new score: 0.500 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01) Old score: 0.500, new score: 0.490 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status to Submitting Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status to Submitted Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status to Active Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status to Completed multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01) Old score: 0.490, new score: 0.500 multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1) Old score: 0.500, new score: 0.600 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M END host=nb_basecluster - Done initializing shared directory THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster replicationGroup=jfecpfaj Progress: Stage in:1 START jobid=echo-kfecpfaj host=nb_basecluster - Initializing directory structure START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating directory structure multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01) Old score: 0.600, new score: 0.590 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status to Submitting Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status to Submitted Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status to Active Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status to Completed multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01) Old score: 0.590, new score: 0.600 multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1) Old score: 0.600, new score: 0.700 Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M END jobid=echo-kfecpfaj - Done initializing directory structure START jobid=echo-kfecpfaj - Staging in files END jobid=echo-kfecpfaj - Staging in finished JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj host=nb_basecluster jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2) Old score: 0.700, new score: 0.500 Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting status to Submitting Submitting task: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) 1241655180260 1241655180623 1241655181975 From wilde at mcs.anl.gov Wed May 6 19:51:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 06 May 2009 19:51:54 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A022A2A.6070907@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> Message-ID: <4A0230AA.6040906@mcs.anl.gov> Yi, I assume you are testing from communicado to the tp-x001 virtual machine? Does that vm have GT2 GRAM running? If so, I would first verify that you can do basic globus-job-run and globus-url-copy to the vm, and then use a GT2 setting in sites.xml to talk to the vm from Swift. - Mike ps. I have not been following the list for the past week, so I need to review it and see what your plan was, and what Ben and others recommended. I'll try to catch up on that soon, and then we should meet at the CI and discuss. On 5/6/09 7:24 PM, yizhu wrote: > Hi all > > I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and configured > sites.xml and tc.data accordingly.[1] > > When i tried to run "$swift first.swift", it stuck on "Submitting:1" > phase[2], (keep repeated showing "Progress:Submitting:1" and never > return). > > Then I ssh to pbs_server to check the server log[3] and found that the > job has been enqueued, ran, and successfully dequeued. I also check the > queue status[4] when this job is running and found that the output_path > is "/dev/null" somewhat i don't expected. ( The working directory of > swift is "/home/jobrun". > > I think the problem might be the pbs_server failed to return the output > to the correct path (btw. what the output_path suppose to be, the same > work_directory of swift?), or anyone has a better idea? > > > Many Thanks. > > > -Yi > > > > -------------------------------------------------- > [1]----sites.xml > > > > url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" > jobManager="PBS" provider="gt4" /> > /home/jobrun > > > ----tc.data > > nb_basecluster echo /bin/echo INSTALLED > INTEL32::LINUX null > nb_basecluster cat /bin/cat INSTALLED > INTEL32::LINUX null > nb_basecluster ls /bin/ls INSTALLED > INTEL32::LINUX null > nb_basecluster grep /bin/grep INSTALLED > INTEL32::LINUX null > nb_basecluster sort /bin/sort INSTALLED > INTEL32::LINUX null > nb_basecluster paste /bin/paste INSTALLED > INTEL32::LINUX null > > --------------------------------------------------------- > [2] > yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift > Recompilation suppressed. > Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml > Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data > Setting resources to: {nb_basecluster=nb_basecluster} > Swift 0.8 swift-r2448 cog-r2261 > > Swift 0.8 swift-r2448 cog-r2261 > > RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg > RunID: 20090506-1912-zqd8t5hg > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > type string value=Hello, world! dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > VALUE=Hello, world! > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > type string value=Hello, world! dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > VALUE=Hello, world! > NEW > id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > > Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 > type messagefile with no value at dataset=outfile (not closed).$ > NEW > id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 > > Progress: > PROCEDURE thread=0 name=greeting > PARAM thread=0 direction=output variable=t > provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 > > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > type string value=hello.txt dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > VALUE=hello.txt > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > type string value=hello.txt dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > VALUE=hello.txt > START thread=0 tr=echo > Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0] > Rand: 0.6583597597672994, sum: 1.0 > Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0 > Progress: Initializing site shared directory:1 > START host=nb_basecluster - Initializing shared directory > multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01) > Old score: 0.000, new score: -0.010 > No global submit throttle set. Using default (100) > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status > to Completed > multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) > Old score: -0.010, new score: 0.000 > multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1) > Old score: 0.000, new score: 0.100 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M > multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2) > Old score: 0.100, new score: -0.100 > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status > to Submitting > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status > to Submitted > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status > to Active > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status > to Completed > multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2) > Old score: -0.100, new score: 0.100 > multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1) > Old score: 0.100, new score: 0.200 > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M > multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2) > Old score: 0.200, new score: 0.000 > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status > to Submitting > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status > to Submitted > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status > to Active > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status > to Completed > multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2) > Old score: 0.000, new score: 0.200 > multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1) > Old score: 0.200, new score: 0.300 > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01) > Old score: 0.300, new score: 0.290 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status > to Completed > multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01) > Old score: 0.290, new score: 0.300 > multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1) > Old score: 0.300, new score: 0.400 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01) > Old score: 0.400, new score: 0.390 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status > to Completed > multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01) > Old score: 0.390, new score: 0.400 > multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1) > Old score: 0.400, new score: 0.500 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01) > Old score: 0.500, new score: 0.490 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status > to Completed > multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01) > Old score: 0.490, new score: 0.500 > multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1) > Old score: 0.500, new score: 0.600 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > END host=nb_basecluster - Done initializing shared directory > THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster > replicationGroup=jfecpfaj > Progress: Stage in:1 > START jobid=echo-kfecpfaj host=nb_basecluster - Initializing directory > structure > START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating directory > structure > multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01) > Old score: 0.600, new score: 0.590 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status > to Completed > multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01) > Old score: 0.590, new score: 0.600 > multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1) > Old score: 0.600, new score: 0.700 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > END jobid=echo-kfecpfaj - Done initializing directory structure > START jobid=echo-kfecpfaj - Staging in files > END jobid=echo-kfecpfaj - Staging in finished > JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] > tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj > host=nb_basecluster > jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, > identity=urn:0-1-1241655174950) > multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2) > Old score: 0.700, new score: 0.500 > Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting status > to Submitting > Submitting task: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) > 1241655180260 > 1241655180623 > 1241655181975 Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) > Progress: Submitting:1 > > Progress: Submitting:1 > > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitting:1 > ^C > yizhu at ubuntu:~/swift-0.8/examples/swift$ > > -------------------------------------------------------------- > [3] see attachment > ------------------------------------------------------------- > [4] > tp-x001 torque # qstat -f > Job Id: 3.tp-x001.ci.uchicago.edu > Job_Name = STDIN > Job_Owner = jobrun at tp-x001.ci.uchicago.edu > job_state = R > queue = batch > server = tp-x001.ci.uchicago.edu > Checkpoint = u > ctime = Wed May 6 19:22:10 2009 > Error_Path = tp-x001.ci.uchicago.edu:/dev/null > exec_host = tp-x002/0 > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = n > mtime = Wed May 6 19:22:10 2009 > Output_Path = tp-x001.ci.uchicago.edu:/dev/null > Priority = 0 > qtime = Wed May 6 19:22:10 2009 > Rerunable = True > Resource_List.neednodes = 1 > Resource_List.nodect = 1 > Resource_List.nodes = 1 > Shell_Path_List = /bin/sh > substate = 40 > Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun, > PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb > in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu, > PBS_O_HOST=tp-x001.ci.uchicago.edu, > PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54, > PBS_O_QUEUE=batch > euser = jobrun > egroup = users > hashname = 3.tp-x001.c > queue_rank = 2 > queue_type = E > comment = Job started on Wed May 06 at 19:22 > etime = Wed May 6 19:22:10 2009 > > tp-x001 torque # > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From yizhu at cs.uchicago.edu Wed May 6 22:10:38 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Wed, 06 May 2009 22:10:38 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A022A2A.6070907@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> Message-ID: <4A02512E.8060809@cs.uchicago.edu> problem solved. i immigrate all of my work from my laptop to ci.uchicago.edu server and then everything works. -Yi yizhu wrote: > Hi all > > I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and configured > sites.xml and tc.data accordingly.[1] > > When i tried to run "$swift first.swift", it stuck on "Submitting:1" > phase[2], (keep repeated showing "Progress:Submitting:1" and never > return). > > Then I ssh to pbs_server to check the server log[3] and found that the > job has been enqueued, ran, and successfully dequeued. I also check the > queue status[4] when this job is running and found that the output_path > is "/dev/null" somewhat i don't expected. ( The working directory of > swift is "/home/jobrun". > > I think the problem might be the pbs_server failed to return the output > to the correct path (btw. what the output_path suppose to be, the same > work_directory of swift?), or anyone has a better idea? > > > Many Thanks. > > > -Yi > > > > -------------------------------------------------- > [1]----sites.xml > > > > url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" > jobManager="PBS" provider="gt4" /> > /home/jobrun > > > ----tc.data > > nb_basecluster echo /bin/echo INSTALLED > INTEL32::LINUX null > nb_basecluster cat /bin/cat INSTALLED > INTEL32::LINUX null > nb_basecluster ls /bin/ls INSTALLED > INTEL32::LINUX null > nb_basecluster grep /bin/grep INSTALLED > INTEL32::LINUX null > nb_basecluster sort /bin/sort INSTALLED > INTEL32::LINUX null > nb_basecluster paste /bin/paste INSTALLED > INTEL32::LINUX null > > --------------------------------------------------------- > [2] > yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift > Recompilation suppressed. > Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml > Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data > Setting resources to: {nb_basecluster=nb_basecluster} > Swift 0.8 swift-r2448 cog-r2261 > > Swift 0.8 swift-r2448 cog-r2261 > > RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg > RunID: 20090506-1912-zqd8t5hg > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > type string value=Hello, world! dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > VALUE=Hello, world! > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > type string value=Hello, world! dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > VALUE=Hello, world! > NEW > id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 > > Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 > type messagefile with no value at dataset=outfile (not closed).$ > NEW > id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 > > Progress: > PROCEDURE thread=0 name=greeting > PARAM thread=0 direction=output variable=t > provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 > > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > type string value=hello.txt dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > VALUE=hello.txt > closed org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > type string value=hello.txt dataset=unnamed SwiftScript value (closed) > ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > path=$ > VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 > VALUE=hello.txt > START thread=0 tr=echo > Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0] > Rand: 0.6583597597672994, sum: 1.0 > Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0 > Progress: Initializing site shared directory:1 > START host=nb_basecluster - Initializing shared directory > multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01) > Old score: 0.000, new score: -0.010 > No global submit throttle set. Using default (100) > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting status > to Completed > multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) > Old score: -0.010, new score: 0.000 > multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1) > Old score: 0.000, new score: 0.100 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M > multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2) > Old score: 0.100, new score: -0.100 > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status > to Submitting > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status > to Submitted > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status > to Active > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting status > to Completed > multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2) > Old score: -0.100, new score: 0.100 > multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1) > Old score: 0.100, new score: 0.200 > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M > multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2) > Old score: 0.200, new score: 0.000 > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status > to Submitting > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status > to Submitted > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status > to Active > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting status > to Completed > multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2) > Old score: 0.000, new score: 0.200 > multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1) > Old score: 0.200, new score: 0.300 > Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01) > Old score: 0.300, new score: 0.290 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting status > to Completed > multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01) > Old score: 0.290, new score: 0.300 > multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1) > Old score: 0.300, new score: 0.400 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01) > Old score: 0.400, new score: 0.390 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting status > to Completed > multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01) > Old score: 0.390, new score: 0.400 > multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1) > Old score: 0.400, new score: 0.500 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01) > Old score: 0.500, new score: 0.490 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting status > to Completed > multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01) > Old score: 0.490, new score: 0.500 > multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1) > Old score: 0.500, new score: 0.600 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > END host=nb_basecluster - Done initializing shared directory > THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster > replicationGroup=jfecpfaj > Progress: Stage in:1 > START jobid=echo-kfecpfaj host=nb_basecluster - Initializing directory > structure > START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating directory > structure > multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01) > Old score: 0.600, new score: 0.590 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status > to Submitting > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status > to Submitted > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status > to Active > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting status > to Completed > multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01) > Old score: 0.590, new score: 0.600 > multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1) > Old score: 0.600, new score: 0.700 > Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. > Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M > END jobid=echo-kfecpfaj - Done initializing directory structure > START jobid=echo-kfecpfaj - Staging in files > END jobid=echo-kfecpfaj - Staging in finished > JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] > tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj > host=nb_basecluster > jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, > identity=urn:0-1-1241655174950) > multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2) > Old score: 0.700, new score: 0.500 > Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting status > to Submitting > Submitting task: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) > 1241655180260 > 1241655180623 > 1241655181975 Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) > Progress: Submitting:1 > > Progress: Submitting:1 > > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitting:1 > ^C > yizhu at ubuntu:~/swift-0.8/examples/swift$ > > -------------------------------------------------------------- > [3] see attachment > ------------------------------------------------------------- > [4] > tp-x001 torque # qstat -f > Job Id: 3.tp-x001.ci.uchicago.edu > Job_Name = STDIN > Job_Owner = jobrun at tp-x001.ci.uchicago.edu > job_state = R > queue = batch > server = tp-x001.ci.uchicago.edu > Checkpoint = u > ctime = Wed May 6 19:22:10 2009 > Error_Path = tp-x001.ci.uchicago.edu:/dev/null > exec_host = tp-x002/0 > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = n > mtime = Wed May 6 19:22:10 2009 > Output_Path = tp-x001.ci.uchicago.edu:/dev/null > Priority = 0 > qtime = Wed May 6 19:22:10 2009 > Rerunable = True > Resource_List.neednodes = 1 > Resource_List.nodect = 1 > Resource_List.nodes = 1 > Shell_Path_List = /bin/sh > substate = 40 > Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun, > PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb > in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu, > PBS_O_HOST=tp-x001.ci.uchicago.edu, > PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54, > PBS_O_QUEUE=batch > euser = jobrun > egroup = users > hashname = 3.tp-x001.c > queue_rank = 2 > queue_type = E > comment = Job started on Wed May 06 at 19:22 > etime = Wed May 6 19:22:10 2009 > > tp-x001 torque # > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed May 6 22:48:44 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 06 May 2009 22:48:44 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A02512E.8060809@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> Message-ID: <4A025A1C.4030505@mcs.anl.gov> Very good! Now, what kind of tests can you do next? Can you exercise the cluster with an interesting workflow? How large of a cluster can you assemble in a Nimbus workspace ? Can you aggregate VM's from a few different physical clusters into one Nimbus workspace? What's the largest cluster you can assemble with Nimbus? Can we try putting Matlab on a Nimbus VM and then running it at an interesting scale? Do we have any free allocations of EC2 to enable testing of this at Amazon? - Mike On 5/6/09 10:10 PM, yizhu wrote: > problem solved. > > i immigrate all of my work from my laptop to ci.uchicago.edu server and > then everything works. > > > -Yi > yizhu wrote: >> Hi all >> >> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and configured >> sites.xml and tc.data accordingly.[1] >> >> When i tried to run "$swift first.swift", it stuck on "Submitting:1" >> phase[2], (keep repeated showing "Progress:Submitting:1" and never >> return). >> >> Then I ssh to pbs_server to check the server log[3] and found that the >> job has been enqueued, ran, and successfully dequeued. I also check >> the queue status[4] when this job is running and found that the >> output_path is "/dev/null" somewhat i don't expected. ( The working >> directory of swift is "/home/jobrun". >> >> I think the problem might be the pbs_server failed to return the >> output to the correct path (btw. what the output_path suppose to be, >> the same work_directory of swift?), or anyone has a better idea? >> >> >> Many Thanks. >> >> >> -Yi >> >> >> >> -------------------------------------------------- >> [1]----sites.xml >> >> >> >> > url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" >> jobManager="PBS" provider="gt4" /> >> /home/jobrun >> >> >> ----tc.data >> >> nb_basecluster echo /bin/echo INSTALLED >> INTEL32::LINUX null >> nb_basecluster cat /bin/cat INSTALLED >> INTEL32::LINUX null >> nb_basecluster ls /bin/ls INSTALLED >> INTEL32::LINUX null >> nb_basecluster grep /bin/grep INSTALLED >> INTEL32::LINUX null >> nb_basecluster sort /bin/sort INSTALLED >> INTEL32::LINUX null >> nb_basecluster paste /bin/paste INSTALLED >> INTEL32::LINUX null >> >> --------------------------------------------------------- >> [2] >> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift >> Recompilation suppressed. >> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml >> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data >> Setting resources to: {nb_basecluster=nb_basecluster} >> Swift 0.8 swift-r2448 cog-r2261 >> >> Swift 0.8 swift-r2448 cog-r2261 >> >> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg >> RunID: 20090506-1912-zqd8t5hg >> closed org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >> type string value=Hello, world! dataset=unnamed SwiftScript value >> (closed) >> ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >> path=$ >> VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >> VALUE=Hello, world! >> closed org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >> type string value=Hello, world! dataset=unnamed SwiftScript value >> (closed) >> ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >> path=$ >> VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >> VALUE=Hello, world! >> NEW >> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >> >> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >> type messagefile with no value at dataset=outfile (not closed).$ >> NEW >> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >> >> Progress: >> PROCEDURE thread=0 name=greeting >> PARAM thread=0 direction=output variable=t >> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >> >> closed org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >> type string value=hello.txt dataset=unnamed SwiftScript value (closed) >> ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >> path=$ >> VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >> VALUE=hello.txt >> closed org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >> type string value=hello.txt dataset=unnamed SwiftScript value (closed) >> ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >> path=$ >> VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >> VALUE=hello.txt >> START thread=0 tr=echo >> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0] >> Rand: 0.6583597597672994, sum: 1.0 >> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0 >> Progress: Initializing site shared directory:1 >> START host=nb_basecluster - Initializing shared directory >> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01) >> Old score: 0.000, new score: -0.010 >> No global submit throttle set. Using default (100) >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >> status to Submitting >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >> status to Submitted >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >> status to Active >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >> status to Completed >> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) >> Old score: -0.010, new score: 0.000 >> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1) >> Old score: 0.000, new score: 0.100 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. >> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M >> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2) >> Old score: 0.100, new score: -0.100 >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >> status to Submitting >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >> status to Submitted >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >> status to Active >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >> status to Completed >> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2) >> Old score: -0.100, new score: 0.100 >> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1) >> Old score: 0.100, new score: 0.200 >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. >> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M >> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2) >> Old score: 0.200, new score: 0.000 >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >> status to Submitting >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >> status to Submitted >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >> status to Active >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >> status to Completed >> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2) >> Old score: 0.000, new score: 0.200 >> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1) >> Old score: 0.200, new score: 0.300 >> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. >> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01) >> Old score: 0.300, new score: 0.290 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >> status to Submitting >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >> status to Submitted >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >> status to Active >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >> status to Completed >> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01) >> Old score: 0.290, new score: 0.300 >> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1) >> Old score: 0.300, new score: 0.400 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. >> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01) >> Old score: 0.400, new score: 0.390 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >> status to Submitting >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >> status to Submitted >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >> status to Active >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >> status to Completed >> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01) >> Old score: 0.390, new score: 0.400 >> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1) >> Old score: 0.400, new score: 0.500 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. >> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01) >> Old score: 0.500, new score: 0.490 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >> status to Submitting >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >> status to Submitted >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >> status to Active >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >> status to Completed >> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01) >> Old score: 0.490, new score: 0.500 >> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1) >> Old score: 0.500, new score: 0.600 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. >> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >> END host=nb_basecluster - Done initializing shared directory >> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster >> replicationGroup=jfecpfaj >> Progress: Stage in:1 >> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing directory >> structure >> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating >> directory structure >> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01) >> Old score: 0.600, new score: 0.590 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >> status to Submitting >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >> status to Submitted >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >> status to Active >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >> status to Completed >> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01) >> Old score: 0.590, new score: 0.600 >> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1) >> Old score: 0.600, new score: 0.700 >> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. >> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >> END jobid=echo-kfecpfaj - Done initializing directory structure >> START jobid=echo-kfecpfaj - Staging in files >> END jobid=echo-kfecpfaj - Staging in finished >> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] >> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj >> host=nb_basecluster >> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, >> identity=urn:0-1-1241655174950) >> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2) >> Old score: 0.700, new score: 0.500 >> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting >> status to Submitting >> Submitting task: Task(type=JOB_SUBMISSION, >> identity=urn:0-1-1241655174950) >> 1241655180260 >> 1241655180623 >> 1241655181975> Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) >> Progress: Submitting:1 >> >> Progress: Submitting:1 >> >> Progress: Submitting:1 >> Progress: Submitting:1 >> Progress: Submitting:1 >> Progress: Submitting:1 >> ^C >> yizhu at ubuntu:~/swift-0.8/examples/swift$ >> >> -------------------------------------------------------------- >> [3] see attachment >> ------------------------------------------------------------- >> [4] >> tp-x001 torque # qstat -f >> Job Id: 3.tp-x001.ci.uchicago.edu >> Job_Name = STDIN >> Job_Owner = jobrun at tp-x001.ci.uchicago.edu >> job_state = R >> queue = batch >> server = tp-x001.ci.uchicago.edu >> Checkpoint = u >> ctime = Wed May 6 19:22:10 2009 >> Error_Path = tp-x001.ci.uchicago.edu:/dev/null >> exec_host = tp-x002/0 >> Hold_Types = n >> Join_Path = n >> Keep_Files = n >> Mail_Points = n >> mtime = Wed May 6 19:22:10 2009 >> Output_Path = tp-x001.ci.uchicago.edu:/dev/null >> Priority = 0 >> qtime = Wed May 6 19:22:10 2009 >> Rerunable = True >> Resource_List.neednodes = 1 >> Resource_List.nodect = 1 >> Resource_List.nodes = 1 >> Shell_Path_List = /bin/sh >> substate = 40 >> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun, >> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb >> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu, >> PBS_O_HOST=tp-x001.ci.uchicago.edu, >> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54, >> PBS_O_QUEUE=batch >> euser = jobrun >> egroup = users >> hashname = 3.tp-x001.c >> queue_rank = 2 >> queue_type = E >> comment = Job started on Wed May 06 at 19:22 >> etime = Wed May 6 19:22:10 2009 >> >> tp-x001 torque # >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu May 7 02:21:20 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 07:21:20 +0000 (GMT) Subject: [Swift-devel] Re: Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A022A2A.6070907@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> Message-ID: > url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" > jobManager="PBS" provider="gt4" /> I was expecting you to be running directly from Swift to PBS, not using GRAM in between. Please check that you can run GRAM4 jobs using globusrun-ws, something like: globusrun-ws -submit -F tp-x001.ci.uchicago.edu -Ft PBS -c /bin/hostname This command should run through several stages and return you to the command prompt. Please paste the output of that there. -- From benc at hawaga.org.uk Thu May 7 02:28:14 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 07:28:14 +0000 (GMT) Subject: [Swift-devel] Re: Swift-issues (PBS+NFS Cluster) In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> Message-ID: On Thu, 7 May 2009, Ben Clifford wrote: > I was expecting you to be running directly from Swift to PBS, not using > GRAM in between. Separately from running with GRAM, you can submit from Swift directly to PBS from any of your cluster nodes that you can use PBS's qsub/qstat commands. That will remove one layer of complexity from your setup, although it will require you to run Swift from within your cluster rather than from some other host. -- From benc at hawaga.org.uk Thu May 7 02:35:32 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 07:35:32 +0000 (GMT) Subject: [Swift-devel] old tests Message-ID: There are old tests linked from the Swift downloads page. They appear to have been broken since 2008-09 http://www.ci.uchicago.edu/swift/tests/tests-2008-09-11.html If no one is actively using them, I propose they go away. -- From wilde at mcs.anl.gov Thu May 7 10:07:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 07 May 2009 10:07:22 -0500 Subject: [Swift-devel] towards a 1.0 release In-Reply-To: References: Message-ID: <4A02F92A.6010104@mcs.anl.gov> Some preliminary comments: - a 1.0 focus on quality as you state below is good. - features Im most eager to see in 1.0 relate to better site/tc config, error messages, and status monitoring/reporting. - Im interested in provenance feature that are useful to users, and the challenge as a by-product, not an end goal. These are all to be defined. If working on the challenge precludes these, I will ask that we drop out of that, again. Mike On 5/5/09 4:42 AM, Ben Clifford wrote: > Mike has indicated that he would like a 1.0 release towards the end of > August. > > I would like to determine: > > i) is there consensus on what a 1.0 release is? (I suspect not) > ii) is a 1.0 release achieveable by the above date (which I think > influences i) > iii) if so, what should happen between now and then > > My personal opinions on what should happen if we are to have 1.0 release > follow: > > I think for an august-release 1.0, major work whch is not already > substantially in-progress should not be started. Concretely, I think that > means that Mihael continues to work on coasters and I continue my > provenance work, and nothing major happens between now and 1.0. > > There should be an intermediate (0.95) release around the end of June. > There absolutely should not be a rush to cram buggy features into this > release at the last minute. > > Between 0.95 and 1.0, there should be a two month period of heightened > bugfixing, testing, documentation. > > Development of new features can take place, but they must not land in the > trunk before 1.0 - they should be managed out-of-trunk in whatever way the > developer in question prefers. Development of such new features should to > be allowed to seriously detract from the 1.0 release process. > > Any deadlines which are likely to require serious amounts of developer > time between now and 1.0 should be declared now; and surprise removal of > developers to work on non-1.0 work which could have been declared at this > stage but weren't should not be permitted. For me, the only such deadline > is the provenance challenge workshop in mid-June. > From benc at hawaga.org.uk Thu May 7 10:15:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 15:15:09 +0000 (GMT) Subject: [Swift-devel] towards a 1.0 release In-Reply-To: <4A02F92A.6010104@mcs.anl.gov> References: <4A02F92A.6010104@mcs.anl.gov> Message-ID: On Thu, 7 May 2009, Michael Wilde wrote: > These are all to be defined. If working on the challenge precludes these, I > will ask that we drop out of that, again. Work on the challenge mostly necessarily ends with the challenge workshop in mid-June. At the moment, I think its very well aligned with making a more useful database. I'm slowly committing pieces of that work into trunk as I tidy them up. -- From yizhu at cs.uchicago.edu Thu May 7 11:39:40 2009 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Thu, 07 May 2009 11:39:40 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A025A1C.4030505@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> Message-ID: <4A030ECC.2030303@cs.uchicago.edu> Michael Wilde wrote: > Very good! > > Now, what kind of tests can you do next? Next, I will try to let swift running on Amazon EC2. > Can you exercise the cluster with an interesting workflow? Yes, Is there any complex sample/tools i can use (rahter than first.swift) to test swift performance? Is there any benchmark available i can compare with? > How large of a cluster can you assemble in a Nimbus workspace ? Since the vm-image i use to test 'swift' is based on NFS shared file system, the performance may not be satisfiable if the we have a large scale of cluster. After I got the swift running on Amazon EC2, I will try to make a dedicate vm-image by using GPFS or any other shared file system you recommended. > Can you aggregate VM's from a few different physical clusters into one > Nimbus workspace? I don't think so. Tim may make commit on it. > What's the largest cluster you can assemble with Nimbus? I am not quite sure,I will do some test onto it soon. since it is a EC2-like cloud, it should easily be configured as a cluster with hundreds of nodes. Tim may make commit on it. > Can we try putting Matlab on a Nimbus VM and then running it at an > interesting scale? Yes, I think so. > Do we have any free allocations of EC2 to enable testing of this at Amazon? Yes, Tim already made a AIM on Amazon EC2, my next step is make 'swift' running on it. I will try get this stuff to ork by tomorrow. > > - Mike > > > > > > > > On 5/6/09 10:10 PM, yizhu wrote: >> problem solved. >> >> i immigrate all of my work from my laptop to ci.uchicago.edu server >> and then everything works. >> >> >> -Yi >> yizhu wrote: >>> Hi all >>> >>> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and >>> configured sites.xml and tc.data accordingly.[1] >>> >>> When i tried to run "$swift first.swift", it stuck on "Submitting:1" >>> phase[2], (keep repeated showing "Progress:Submitting:1" and never >>> return). >>> >>> Then I ssh to pbs_server to check the server log[3] and found that >>> the job has been enqueued, ran, and successfully dequeued. I also >>> check the queue status[4] when this job is running and found that the >>> output_path is "/dev/null" somewhat i don't expected. ( The working >>> directory of swift is "/home/jobrun". >>> >>> I think the problem might be the pbs_server failed to return the >>> output to the correct path (btw. what the output_path suppose to be, >>> the same work_directory of swift?), or anyone has a better idea? >>> >>> >>> Many Thanks. >>> >>> >>> -Yi >>> >>> >>> >>> -------------------------------------------------- >>> [1]----sites.xml >>> >>> >>> >>> >> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" >>> jobManager="PBS" provider="gt4" /> >>> /home/jobrun >>> >>> >>> ----tc.data >>> >>> nb_basecluster echo /bin/echo INSTALLED >>> INTEL32::LINUX null >>> nb_basecluster cat /bin/cat INSTALLED >>> INTEL32::LINUX null >>> nb_basecluster ls /bin/ls INSTALLED >>> INTEL32::LINUX null >>> nb_basecluster grep /bin/grep INSTALLED >>> INTEL32::LINUX null >>> nb_basecluster sort /bin/sort INSTALLED >>> INTEL32::LINUX null >>> nb_basecluster paste /bin/paste INSTALLED >>> INTEL32::LINUX null >>> >>> --------------------------------------------------------- >>> [2] >>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift >>> Recompilation suppressed. >>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml >>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data >>> Setting resources to: {nb_basecluster=nb_basecluster} >>> Swift 0.8 swift-r2448 cog-r2261 >>> >>> Swift 0.8 swift-r2448 cog-r2261 >>> >>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg >>> RunID: 20090506-1912-zqd8t5hg >>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>> type string value=Hello, world! dataset=unnamed SwiftScript value >>> (closed) >>> ROOTPATH >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>> path=$ >>> VALUE >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>> VALUE=Hello, world! >>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>> type string value=Hello, world! dataset=unnamed SwiftScript value >>> (closed) >>> ROOTPATH >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>> path=$ >>> VALUE >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>> VALUE=Hello, world! >>> NEW >>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>> >>> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier >>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>> type messagefile with no value at dataset=outfile (not closed).$ >>> NEW >>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>> >>> Progress: >>> PROCEDURE thread=0 name=greeting >>> PARAM thread=0 direction=output variable=t >>> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>> >>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>> type string value=hello.txt dataset=unnamed SwiftScript value (closed) >>> ROOTPATH >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>> path=$ >>> VALUE >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>> VALUE=hello.txt >>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>> type string value=hello.txt dataset=unnamed SwiftScript value (closed) >>> ROOTPATH >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>> path=$ >>> VALUE >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>> VALUE=hello.txt >>> START thread=0 tr=echo >>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0] >>> Rand: 0.6583597597672994, sum: 1.0 >>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0 >>> Progress: Initializing site shared directory:1 >>> START host=nb_basecluster - Initializing shared directory >>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01) >>> Old score: 0.000, new score: -0.010 >>> No global submit throttle set. Using default (100) >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>> status to Submitting >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>> status to Submitted >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>> status to Active >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>> status to Completed >>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) >>> Old score: -0.010, new score: 0.000 >>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1) >>> Old score: 0.000, new score: 0.100 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. >>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M >>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2) >>> Old score: 0.100, new score: -0.100 >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>> status to Submitting >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>> status to Submitted >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>> status to Active >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>> status to Completed >>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2) >>> Old score: -0.100, new score: 0.100 >>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1) >>> Old score: 0.100, new score: 0.200 >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. >>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M >>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2) >>> Old score: 0.200, new score: 0.000 >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>> status to Submitting >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>> status to Submitted >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>> status to Active >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>> status to Completed >>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2) >>> Old score: 0.000, new score: 0.200 >>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1) >>> Old score: 0.200, new score: 0.300 >>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. >>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01) >>> Old score: 0.300, new score: 0.290 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>> status to Submitting >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>> status to Submitted >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>> status to Active >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>> status to Completed >>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01) >>> Old score: 0.290, new score: 0.300 >>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1) >>> Old score: 0.300, new score: 0.400 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. >>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01) >>> Old score: 0.400, new score: 0.390 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>> status to Submitting >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>> status to Submitted >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>> status to Active >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>> status to Completed >>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01) >>> Old score: 0.390, new score: 0.400 >>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1) >>> Old score: 0.400, new score: 0.500 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. >>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01) >>> Old score: 0.500, new score: 0.490 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>> status to Submitting >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>> status to Submitted >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>> status to Active >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>> status to Completed >>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01) >>> Old score: 0.490, new score: 0.500 >>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1) >>> Old score: 0.500, new score: 0.600 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. >>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>> END host=nb_basecluster - Done initializing shared directory >>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 host=nb_basecluster >>> replicationGroup=jfecpfaj >>> Progress: Stage in:1 >>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing >>> directory structure >>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating >>> directory structure >>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01) >>> Old score: 0.600, new score: 0.590 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>> status to Submitting >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>> status to Submitted >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>> status to Active >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>> status to Completed >>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01) >>> Old score: 0.590, new score: 0.600 >>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1) >>> Old score: 0.600, new score: 0.700 >>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. >>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>> END jobid=echo-kfecpfaj - Done initializing directory structure >>> START jobid=echo-kfecpfaj - Staging in files >>> END jobid=echo-kfecpfaj - Staging in finished >>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] >>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj >>> host=nb_basecluster >>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, >>> identity=urn:0-1-1241655174950) >>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2) >>> Old score: 0.700, new score: 0.500 >>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting >>> status to Submitting >>> Submitting task: Task(type=JOB_SUBMISSION, >>> identity=urn:0-1-1241655174950) >>> 1241655180260 >>> 1241655180623 >>> 1241655181975>> Task submitted: Task(type=JOB_SUBMISSION, >>> identity=urn:0-1-1241655174950) >>> Progress: Submitting:1 >>> >>> Progress: Submitting:1 >>> >>> Progress: Submitting:1 >>> Progress: Submitting:1 >>> Progress: Submitting:1 >>> Progress: Submitting:1 >>> ^C >>> yizhu at ubuntu:~/swift-0.8/examples/swift$ >>> >>> -------------------------------------------------------------- >>> [3] see attachment >>> ------------------------------------------------------------- >>> [4] >>> tp-x001 torque # qstat -f >>> Job Id: 3.tp-x001.ci.uchicago.edu >>> Job_Name = STDIN >>> Job_Owner = jobrun at tp-x001.ci.uchicago.edu >>> job_state = R >>> queue = batch >>> server = tp-x001.ci.uchicago.edu >>> Checkpoint = u >>> ctime = Wed May 6 19:22:10 2009 >>> Error_Path = tp-x001.ci.uchicago.edu:/dev/null >>> exec_host = tp-x002/0 >>> Hold_Types = n >>> Join_Path = n >>> Keep_Files = n >>> Mail_Points = n >>> mtime = Wed May 6 19:22:10 2009 >>> Output_Path = tp-x001.ci.uchicago.edu:/dev/null >>> Priority = 0 >>> qtime = Wed May 6 19:22:10 2009 >>> Rerunable = True >>> Resource_List.neednodes = 1 >>> Resource_List.nodect = 1 >>> Resource_List.nodes = 1 >>> Shell_Path_List = /bin/sh >>> substate = 40 >>> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun, >>> >>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb >>> >>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu, >>> PBS_O_HOST=tp-x001.ci.uchicago.edu, >>> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54, >>> PBS_O_QUEUE=batch >>> euser = jobrun >>> egroup = users >>> hashname = 3.tp-x001.c >>> queue_rank = 2 >>> queue_type = E >>> comment = Job started on Wed May 06 at 19:22 >>> etime = Wed May 6 19:22:10 2009 >>> >>> tp-x001 torque # >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From zhaozhang at uchicago.edu Thu May 7 11:52:47 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 07 May 2009 11:52:47 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site Message-ID: <4A0311DF.9050003@uchicago.edu> Hi, I got Condor-G test hanging on UJ-OSG site. Does the following message mean that there is no condor running on the UJ-OSG site? zhao sites.xml definition [zzhang at tp-grid1 sites]$ cat condor-g/UJ-OSG.xml /nfs/data/osg_store/osg/tmp/UJ-OSG grid gt2 osg-ce.grid.uj.ac.za/jobmanager-condor Globus-Job-Run: [zzhang at tp-grid1 sites]$ globus-job-run osg-ce.grid.uj.ac.za /usr/bin/id uid=640(osgedu) gid=2000(osg) groups=2000(osg) Condor-Hold-Reason [zzhang at tp-grid1 ~]$ condor_q 166 -long | grep Hold PeriodicHold = FALSE OnExitHold = FALSE HoldReasonCode = 2 HoldReasonSubCode = 93 NumSystemHolds = 1 HoldReason = UNDEFINED Stdout: [zzhang at tp-grid1 sites]$ ./run-site condor-g/UJ-OSG.xml testing site configuration: condor-g/UJ-OSG.xml Removing files from previous runs Running test 061-cattwo at Thu May 7 11:37:09 CDT 2009 Swift svn swift-r2896 cog-r2392 RunID: 20090507-1137-n9b4lkje Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 From wilde at mcs.anl.gov Thu May 7 11:54:06 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 07 May 2009 11:54:06 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A030ECC.2030303@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> Message-ID: <4A03122E.60403@mcs.anl.gov> On 5/7/09 11:39 AM, Yi Zhu wrote: > Michael Wilde wrote: >> Very good! >> >> Now, what kind of tests can you do next? > > Next, I will try to let swift running on Amazon EC2. > >> Can you exercise the cluster with an interesting workflow? > > Yes, Is there any complex sample/tools i can use (rahter than > first.swift) to test swift performance? Is there any benchmark available > i can compare with? I think we can assemble a good benchmark that is also a useful application. I suggest we start with OOPS, the protein folding code. That way, all the test runs can contribute (a bit) to science by building a catalog of results that the OOPS team can peruse and use. How many CPU hours (or VM hours) do you have available in EC2 through Nimbus? Can we use some of these? >> How large of a cluster can you assemble in a Nimbus workspace ? > > Since the vm-image i use to test 'swift' is based on NFS shared file > system, the performance may not be satisfiable if the we have a large > scale of cluster. After I got the swift running on Amazon EC2, I will > try to make a dedicate vm-image by using GPFS or any other shared file > system you recommended. I suggest we also look at doing runs that depend little on the shared filesystem. There is a Swift option to put the per-job working dir on a local disk. Further work can reduce shared disk usage further (but thats "research" work). With OOPS, there is very little input data (all of the large input data is part of the app install at the moment). Output data is modest, about 1MB per job, or so. Allan and Zhao are experimenting with ways to batch that up and transfer it back efficiently. That is worth pursuing as a longer term goal. How well / how fast does gridftp work, moving data from a Nimbus VM (at CI, EC2, and elsewhere) back to a CI filesystem? That would be good to test. I wonder if the Swift engine could pull the data back from a GridFTP server running on the worker node? All of the above is for longer-term research. >> Can you aggregate VM's from a few different physical clusters into one >> Nimbus workspace? > > I don't think so. Tim may make commit on it. > > >> What's the largest cluster you can assemble with Nimbus? > > I am not quite sure,I will do some test onto it soon. since it is a > EC2-like cloud, it should easily be configured as a cluster with > hundreds of nodes. Tim may make commit on it. > > >> Can we try putting Matlab on a Nimbus VM and then running it at an >> interesting scale? > > Yes, I think so. I will need to locate the pointers we were trying to post on how to get started with Swift and Matlab. >> Do we have any free allocations of EC2 to enable testing of this at >> Amazon? > > Yes, Tim already made a AIM on Amazon EC2, my next step is make 'swift' > running on it. I will try get this stuff to ork by tomorrow. Awesome. - Mike > > >> >> - Mike >> >> >> >> >> >> >> >> On 5/6/09 10:10 PM, yizhu wrote: >>> problem solved. >>> >>> i immigrate all of my work from my laptop to ci.uchicago.edu server >>> and then everything works. >>> >>> >>> -Yi >>> yizhu wrote: >>>> Hi all >>>> >>>> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and >>>> configured sites.xml and tc.data accordingly.[1] >>>> >>>> When i tried to run "$swift first.swift", it stuck on >>>> "Submitting:1" phase[2], (keep repeated showing >>>> "Progress:Submitting:1" and never return). >>>> >>>> Then I ssh to pbs_server to check the server log[3] and found that >>>> the job has been enqueued, ran, and successfully dequeued. I also >>>> check the queue status[4] when this job is running and found that >>>> the output_path is "/dev/null" somewhat i don't expected. ( The >>>> working directory of swift is "/home/jobrun". >>>> >>>> I think the problem might be the pbs_server failed to return the >>>> output to the correct path (btw. what the output_path suppose to be, >>>> the same work_directory of swift?), or anyone has a better idea? >>>> >>>> >>>> Many Thanks. >>>> >>>> >>>> -Yi >>>> >>>> >>>> >>>> -------------------------------------------------- >>>> [1]----sites.xml >>>> >>>> >>>> >>>> >>> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" >>>> jobManager="PBS" provider="gt4" /> >>>> /home/jobrun >>>> >>>> >>>> ----tc.data >>>> >>>> nb_basecluster echo /bin/echo INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster cat /bin/cat INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster ls /bin/ls INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster grep /bin/grep INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster sort /bin/sort INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster paste /bin/paste INSTALLED >>>> INTEL32::LINUX null >>>> >>>> --------------------------------------------------------- >>>> [2] >>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift >>>> Recompilation suppressed. >>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml >>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data >>>> Setting resources to: {nb_basecluster=nb_basecluster} >>>> Swift 0.8 swift-r2448 cog-r2261 >>>> >>>> Swift 0.8 swift-r2448 cog-r2261 >>>> >>>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg >>>> RunID: 20090506-1912-zqd8t5hg >>>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> type string value=Hello, world! dataset=unnamed SwiftScript value >>>> (closed) >>>> ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> path=$ >>>> VALUE >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> VALUE=Hello, world! >>>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> type string value=Hello, world! dataset=unnamed SwiftScript value >>>> (closed) >>>> ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> path=$ >>>> VALUE >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> VALUE=Hello, world! >>>> NEW >>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> >>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>>> type messagefile with no value at dataset=outfile (not closed).$ >>>> NEW >>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>>> >>>> Progress: >>>> PROCEDURE thread=0 name=greeting >>>> PARAM thread=0 direction=output variable=t >>>> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>>> >>>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed) >>>> ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> path=$ >>>> VALUE >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> VALUE=hello.txt >>>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed) >>>> ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> path=$ >>>> VALUE >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> VALUE=hello.txt >>>> START thread=0 tr=echo >>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0] >>>> Rand: 0.6583597597672994, sum: 1.0 >>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0 >>>> Progress: Initializing site shared directory:1 >>>> START host=nb_basecluster - Initializing shared directory >>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01) >>>> Old score: 0.000, new score: -0.010 >>>> No global submit throttle set. Using default (100) >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) >>>> Old score: -0.010, new score: 0.000 >>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1) >>>> Old score: 0.000, new score: 0.100 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, Max heap: 720M >>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2) >>>> Old score: 0.100, new score: -0.100 >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>> status to Submitting >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>> status to Submitted >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>> status to Active >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2) >>>> Old score: -0.100, new score: 0.100 >>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1) >>>> Old score: 0.100, new score: 0.200 >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M >>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2) >>>> Old score: 0.200, new score: 0.000 >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>> status to Submitting >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>> status to Submitted >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>> status to Active >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2) >>>> Old score: 0.000, new score: 0.200 >>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1) >>>> Old score: 0.200, new score: 0.300 >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01) >>>> Old score: 0.300, new score: 0.290 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01) >>>> Old score: 0.290, new score: 0.300 >>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1) >>>> Old score: 0.300, new score: 0.400 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01) >>>> Old score: 0.400, new score: 0.390 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01) >>>> Old score: 0.390, new score: 0.400 >>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1) >>>> Old score: 0.400, new score: 0.500 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01) >>>> Old score: 0.500, new score: 0.490 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01) >>>> Old score: 0.490, new score: 0.500 >>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1) >>>> Old score: 0.500, new score: 0.600 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>>> END host=nb_basecluster - Done initializing shared directory >>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 >>>> host=nb_basecluster replicationGroup=jfecpfaj >>>> Progress: Stage in:1 >>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing >>>> directory structure >>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating >>>> directory structure >>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01) >>>> Old score: 0.600, new score: 0.590 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01) >>>> Old score: 0.590, new score: 0.600 >>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1) >>>> Old score: 0.600, new score: 0.700 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>>> END jobid=echo-kfecpfaj - Done initializing directory structure >>>> START jobid=echo-kfecpfaj - Staging in files >>>> END jobid=echo-kfecpfaj - Staging in finished >>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] >>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj >>>> host=nb_basecluster >>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1241655174950) >>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2) >>>> Old score: 0.700, new score: 0.500 >>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting >>>> status to Submitting >>>> Submitting task: Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1241655174950) >>>> 1241655180260 >>>> 1241655180623 >>>> 1241655181975>>> Task submitted: Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1241655174950) >>>> Progress: Submitting:1 >>>> >>>> Progress: Submitting:1 >>>> >>>> Progress: Submitting:1 >>>> Progress: Submitting:1 >>>> Progress: Submitting:1 >>>> Progress: Submitting:1 >>>> ^C >>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ >>>> >>>> -------------------------------------------------------------- >>>> [3] see attachment >>>> ------------------------------------------------------------- >>>> [4] >>>> tp-x001 torque # qstat -f >>>> Job Id: 3.tp-x001.ci.uchicago.edu >>>> Job_Name = STDIN >>>> Job_Owner = jobrun at tp-x001.ci.uchicago.edu >>>> job_state = R >>>> queue = batch >>>> server = tp-x001.ci.uchicago.edu >>>> Checkpoint = u >>>> ctime = Wed May 6 19:22:10 2009 >>>> Error_Path = tp-x001.ci.uchicago.edu:/dev/null >>>> exec_host = tp-x002/0 >>>> Hold_Types = n >>>> Join_Path = n >>>> Keep_Files = n >>>> Mail_Points = n >>>> mtime = Wed May 6 19:22:10 2009 >>>> Output_Path = tp-x001.ci.uchicago.edu:/dev/null >>>> Priority = 0 >>>> qtime = Wed May 6 19:22:10 2009 >>>> Rerunable = True >>>> Resource_List.neednodes = 1 >>>> Resource_List.nodect = 1 >>>> Resource_List.nodes = 1 >>>> Shell_Path_List = /bin/sh >>>> substate = 40 >>>> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun, >>>> >>>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb >>>> >>>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu, >>>> PBS_O_HOST=tp-x001.ci.uchicago.edu, >>>> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54, >>>> PBS_O_QUEUE=batch >>>> euser = jobrun >>>> egroup = users >>>> hashname = 3.tp-x001.c >>>> queue_rank = 2 >>>> queue_type = E >>>> comment = Job started on Wed May 06 at 19:22 >>>> etime = Wed May 6 19:22:10 2009 >>>> >>>> tp-x001 torque # >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > From benc at hawaga.org.uk Thu May 7 11:53:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 16:53:22 +0000 (GMT) Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A030ECC.2030303@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> Message-ID: On Thu, 7 May 2009, Yi Zhu wrote: > Yes, Is there any complex sample/tools i can use (rahter than first.swift) to > test swift performance? Is there any benchmark available i can compare with? For fairly raw throughput, you can look at the test case 066-many.swift, in the Swift SVN at trunk/tests/language-behaviour/066-many.swift If you change your default sites file to use your cluster, you can type: ./run 066-many in that directory. This test is unlikely to be able to saturate your cluster as the actual executions are very short - the jobs create a 0-byte file each and exit. Also have a look at 0651-several-delay in that same directory. This runs 1m30s delay jobs. By default 10 jobs are run, but it should be apparent from the source how to make it run much larger numbers. -- From hategan at mcs.anl.gov Thu May 7 12:03:39 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 07 May 2009 12:03:39 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: <4A0311DF.9050003@uchicago.edu> References: <4A0311DF.9050003@uchicago.edu> Message-ID: <1241715819.5713.3.camel@localhost> On Thu, 2009-05-07 at 11:52 -0500, Zhao Zhang wrote: > Hi, > > I got Condor-G test hanging on UJ-OSG site. > Does the following message mean that there is no condor running on the > UJ-OSG site? > > zhao > > sites.xml definition > [zzhang at tp-grid1 sites]$ cat condor-g/UJ-OSG.xml > > > > > > /nfs/data/osg_store/osg/tmp/UJ-OSG > grid > gt2 > osg-ce.grid.uj.ac.za/jobmanager-condor > > > > Globus-Job-Run: > [zzhang at tp-grid1 sites]$ globus-job-run osg-ce.grid.uj.ac.za /usr/bin/id > uid=640(osgedu) gid=2000(osg) groups=2000(osg) > > Condor-Hold-Reason > [zzhang at tp-grid1 ~]$ condor_q 166 -long | grep Hold > PeriodicHold = FALSE > OnExitHold = FALSE > HoldReasonCode = 2 That's Condor's way of saying "gram reported an error" (http://www.cs.wisc.edu/condor/manual/v6.8/2_5Submitting_Job.html#2162) > HoldReasonSubCode = 93 And that's the gram error which says "the gatekeeper failed to find the requested service" (http://homepages.nesc.ac.uk/~gcw/NGS/GRAM_error_codes.html) So I think the problem there might be "jobmanager-condor" which should probably be "jobmanager-pbs" or otherwise something is broken with the site. From benc at hawaga.org.uk Thu May 7 12:01:59 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 17:01:59 +0000 (GMT) Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A030ECC.2030303@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> Message-ID: On Thu, 7 May 2009, Yi Zhu wrote: > > Can you aggregate VM's from a few different physical clusters into one > > Nimbus workspace? > > I don't think so. Tim may make commit on it. If you try this, an interesting comparison would be to compare with a separate virtual cluster running on each physical cluster using Swift's multisite features to spread work between those mutliple virtual clusters. This might affect things in at least the following ways: In the everything is a single cluster case, the shared posix file system (NFS or GPFS) will perhaps encounter delays due to high latency between the pieces of the cluster. This may slow things down in the same way as I think happens for MPI-over-WAN. In the multi-site case, each posix shared filesystem will be closer (though how much closer?) to its workers. But swift will have to transfer data to each one, and between each one. -- From benc at hawaga.org.uk Thu May 7 12:04:00 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 17:04:00 +0000 (GMT) Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: <1241715819.5713.3.camel@localhost> References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> Message-ID: On Thu, 7 May 2009, Mihael Hategan wrote: > So I think the problem there might be "jobmanager-condor" which should > probably be "jobmanager-pbs" or otherwise something is broken with the > site. Yes its PBS. Zhao, are you basing your sites file on the output of swift-osg-ress-site-catalog? If so, you should see the jobmanager correctly populated for each site there: and should base your Condor-G sites file on that. -- From foster at anl.gov Thu May 7 12:10:10 2009 From: foster at anl.gov (foster at anl.gov) Date: Thu, 7 May 2009 12:10:10 -0500 (CDT) Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A03122E.60403@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <4A03122E.60403@mcs.anl.gov> Message-ID: I'd suggest we want some microbenchmarks too. Then we want to get running on amazon and evaluate s3. Ian -- from mobile On May 7, 2009, at 11:54 AM, Michael Wilde wrote: > > > On 5/7/09 11:39 AM, Yi Zhu wrote: >> Michael Wilde wrote: >>> Very good! >>> >>> Now, what kind of tests can you do next? >> Next, I will try to let swift running on Amazon EC2. >>> Can you exercise the cluster with an interesting workflow? >> Yes, Is there any complex sample/tools i can use (rahter than >> first.swift) to test swift performance? Is there any benchmark >> available i can compare with? > > I think we can assemble a good benchmark that is also a useful > application. I suggest we start with OOPS, the protein folding code. > > That way, all the test runs can contribute (a bit) to science by > building a catalog of results that the OOPS team can peruse and use. > > How many CPU hours (or VM hours) do you have available in EC2 > through Nimbus? Can we use some of these? > >>> How large of a cluster can you assemble in a Nimbus workspace ? >> Since the vm-image i use to test 'swift' is based on NFS shared >> file system, the performance may not be satisfiable if the we have >> a large scale of cluster. After I got the swift running on Amazon >> EC2, I will try to make a dedicate vm-image by using GPFS or any >> other shared file system you recommended. > > I suggest we also look at doing runs that depend little on the > shared filesystem. There is a Swift option to put the per-job > working dir on a local disk. Further work can reduce shared disk > usage further (but thats "research" work). > > With OOPS, there is very little input data (all of the large input > data is part of the app install at the moment). Output data is > modest, about 1MB per job, or so. Allan and Zhao are experimenting > with ways to batch that up and transfer it back efficiently. That is > worth pursuing as a longer term goal. > > How well / how fast does gridftp work, moving data from a Nimbus VM > (at CI, EC2, and elsewhere) back to a CI filesystem? That would be > good to test. > > I wonder if the Swift engine could pull the data back from a GridFTP > server running on the worker node? > > All of the above is for longer-term research. > >>> Can you aggregate VM's from a few different physical clusters into >>> one Nimbus workspace? >> I don't think so. Tim may make commit on it. >>> What's the largest cluster you can assemble with Nimbus? >> I am not quite sure,I will do some test onto it soon. since it is a >> EC2-like cloud, it should easily be configured as a cluster with >> hundreds of nodes. Tim may make commit on it. >>> Can we try putting Matlab on a Nimbus VM and then running it at an >>> interesting scale? >> Yes, I think so. > > I will need to locate the pointers we were trying to post on how to > get started with Swift and Matlab. > > >>> Do we have any free allocations of EC2 to enable testing of this >>> at Amazon? >> Yes, Tim already made a AIM on Amazon EC2, my next step is make >> 'swift' running on it. I will try get this stuff to ork by tomorrow. > > Awesome. > > - Mike > > >>> >>> - Mike >>> >>> >>> >>> >>> >>> >>> >>> On 5/6/09 10:10 PM, yizhu wrote: >>>> problem solved. >>>> >>>> i immigrate all of my work from my laptop to ci.uchicago.edu >>>> server and then everything works. >>>> >>>> >>>> -Yi >>>> yizhu wrote: >>>>> Hi all >>>>> >>>>> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and >>>>> configured sites.xml and tc.data accordingly.[1] >>>>> >>>>> When i tried to run "$swift first.swift", it stuck on >>>>> "Submitting:1" phase[2], (keep repeated showing >>>>> "Progress:Submitting:1" and never return). >>>>> >>>>> Then I ssh to pbs_server to check the server log[3] and found >>>>> that the job has been enqueued, ran, and successfully dequeued. >>>>> I also check the queue status[4] when this job is running and >>>>> found that the output_path is "/dev/null" somewhat i don't >>>>> expected. ( The working directory of swift is "/home/jobrun". >>>>> >>>>> I think the problem might be the pbs_server failed to return the >>>>> output to the correct path (btw. what the output_path suppose to >>>>> be, the same work_directory of swift?), or anyone has a better >>>>> idea? >>>>> >>>>> >>>>> Many Thanks. >>>>> >>>>> >>>>> -Yi >>>>> >>>>> >>>>> >>>>> -------------------------------------------------- >>>>> [1]----sites.xml >>>>> >>>>> >>>>> >>>>> >>>>> /home/jobrun >>>>> >>>>> >>>>> ----tc.data >>>>> >>>>> nb_basecluster echo /bin/echo INSTALLED >>>>> INTEL32::LINUX null >>>>> nb_basecluster cat /bin/cat INSTALLED >>>>> INTEL32::LINUX null >>>>> nb_basecluster ls /bin/ls INSTALLED >>>>> INTEL32::LINUX null >>>>> nb_basecluster grep /bin/grep INSTALLED >>>>> INTEL32::LINUX null >>>>> nb_basecluster sort /bin/sort INSTALLED >>>>> INTEL32::LINUX null >>>>> nb_basecluster paste /bin/paste INSTALLED >>>>> INTEL32::LINUX null >>>>> >>>>> --------------------------------------------------------- >>>>> [2] >>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift >>>>> Recompilation suppressed. >>>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml >>>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data >>>>> Setting resources to: {nb_basecluster=nb_basecluster} >>>>> Swift 0.8 swift-r2448 cog-r2261 >>>>> >>>>> Swift 0.8 swift-r2448 cog-r2261 >>>>> >>>>> RUNID id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912- >>>>> zqd8t5hg >>>>> RunID: 20090506-1912-zqd8t5hg >>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu >>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 type >>>>> string value=Hello, world! dataset=unnamed SwiftScript value >>>>> (closed) >>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000001 path=$ >>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000001 VALUE=Hello, world! >>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu >>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 type >>>>> string value=Hello, world! dataset=unnamed SwiftScript value >>>>> (closed) >>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000001 path=$ >>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000001 VALUE=Hello, world! >>>>> NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912- >>>>> lxp69uu2:720000000001 >>>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode >>>>> identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000002 type messagefile with no value at >>>>> dataset=outfile (not closed).$ >>>>> NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912- >>>>> lxp69uu2:720000000002 >>>>> Progress: >>>>> PROCEDURE thread=0 name=greeting >>>>> PARAM thread=0 direction=output variable=t provenanceid=tag:benc at ci.uchicago.edu >>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu >>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 type >>>>> string value=hello.txt dataset=unnamed SwiftScript value (closed) >>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000003 path=$ >>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000003 VALUE=hello.txt >>>>> closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu >>>>> ,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 type >>>>> string value=hello.txt dataset=unnamed SwiftScript value (closed) >>>>> ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000003 path=$ >>>>> VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912 >>>>> -lxp69uu2:720000000003 VALUE=hello.txt >>>>> START thread=0 tr=echo >>>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0] >>>>> Rand: 0.6583597597672994, sum: 1.0 >>>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0 >>>>> Progress: Initializing site shared directory:1 >>>>> START host=nb_basecluster - Initializing shared directory >>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01) >>>>> Old score: 0.000, new score: -0.010 >>>>> No global submit throttle set. Using default (100) >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) >>>>> setting status to Submitting >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) >>>>> setting status to Submitted >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) >>>>> setting status to Active >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) >>>>> setting status to Completed >>>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) >>>>> Old score: -0.010, new score: 0.000 >>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1) >>>>> Old score: 0.000, new score: 0.100 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) >>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: >>>>> 63M, Max heap: 720M >>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2) >>>>> Old score: 0.100, new score: -0.100 >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>>> status to Submitting >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>>> status to Submitted >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>>> status to Active >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>>> status to Completed >>>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2) >>>>> Old score: -0.100, new score: 0.100 >>>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1) >>>>> Old score: 0.100, new score: 0.200 >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) >>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: >>>>> 60M, Max heap: 720M >>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2) >>>>> Old score: 0.200, new score: 0.000 >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>>> status to Submitting >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>>> status to Submitted >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>>> status to Active >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>>> status to Completed >>>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2) >>>>> Old score: 0.000, new score: 0.200 >>>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1) >>>>> Old score: 0.200, new score: 0.300 >>>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) >>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: >>>>> 59M, Max heap: 720M >>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01) >>>>> Old score: 0.300, new score: 0.290 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) >>>>> setting status to Submitting >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) >>>>> setting status to Submitted >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) >>>>> setting status to Active >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) >>>>> setting status to Completed >>>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01) >>>>> Old score: 0.290, new score: 0.300 >>>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1) >>>>> Old score: 0.300, new score: 0.400 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) >>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: >>>>> 59M, Max heap: 720M >>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01) >>>>> Old score: 0.400, new score: 0.390 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) >>>>> setting status to Submitting >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) >>>>> setting status to Submitted >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) >>>>> setting status to Active >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) >>>>> setting status to Completed >>>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01) >>>>> Old score: 0.390, new score: 0.400 >>>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1) >>>>> Old score: 0.400, new score: 0.500 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) >>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: >>>>> 59M, Max heap: 720M >>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01) >>>>> Old score: 0.500, new score: 0.490 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) >>>>> setting status to Submitting >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) >>>>> setting status to Submitted >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) >>>>> setting status to Active >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) >>>>> setting status to Completed >>>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01) >>>>> Old score: 0.490, new score: 0.500 >>>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1) >>>>> Old score: 0.500, new score: 0.600 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) >>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: >>>>> 59M, Max heap: 720M >>>>> END host=nb_basecluster - Done initializing shared directory >>>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 >>>>> host=nb_basecluster replicationGroup=jfecpfaj >>>>> Progress: Stage in:1 >>>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing >>>>> directory structure >>>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating >>>>> directory structure >>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01) >>>>> Old score: 0.600, new score: 0.590 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) >>>>> setting status to Submitting >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) >>>>> setting status to Submitted >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) >>>>> setting status to Active >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) >>>>> setting status to Completed >>>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01) >>>>> Old score: 0.590, new score: 0.600 >>>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1) >>>>> Old score: 0.600, new score: 0.700 >>>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) >>>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: >>>>> 59M, Max heap: 720M >>>>> END jobid=echo-kfecpfaj - Done initializing directory structure >>>>> START jobid=echo-kfecpfaj - Staging in files >>>>> END jobid=echo-kfecpfaj - Staging in finished >>>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] >>>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj >>>>> host=nb_basecluster >>>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950 >>>>> ) >>>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2) >>>>> Old score: 0.700, new score: 0.500 >>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) >>>>> setting status to Submitting >>>>> Submitting task: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950 >>>>> ) >>>>> 1241655180260 >>>>> 1241655180623 >>>>> 1241655181975>>>> Task submitted: Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950 >>>>> ) >>>>> Progress: Submitting:1 >>>>> >>>>> Progress: Submitting:1 >>>>> >>>>> Progress: Submitting:1 >>>>> Progress: Submitting:1 >>>>> Progress: Submitting:1 >>>>> Progress: Submitting:1 >>>>> ^C >>>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ >>>>> >>>>> -------------------------------------------------------------- >>>>> [3] see attachment >>>>> ------------------------------------------------------------- >>>>> [4] >>>>> tp-x001 torque # qstat -f >>>>> Job Id: 3.tp-x001.ci.uchicago.edu >>>>> Job_Name = STDIN >>>>> Job_Owner = jobrun at tp-x001.ci.uchicago.edu >>>>> job_state = R >>>>> queue = batch >>>>> server = tp-x001.ci.uchicago.edu >>>>> Checkpoint = u >>>>> ctime = Wed May 6 19:22:10 2009 >>>>> Error_Path = tp-x001.ci.uchicago.edu:/dev/null >>>>> exec_host = tp-x002/0 >>>>> Hold_Types = n >>>>> Join_Path = n >>>>> Keep_Files = n >>>>> Mail_Points = n >>>>> mtime = Wed May 6 19:22:10 2009 >>>>> Output_Path = tp-x001.ci.uchicago.edu:/dev/null >>>>> Priority = 0 >>>>> qtime = Wed May 6 19:22:10 2009 >>>>> Rerunable = True >>>>> Resource_List.neednodes = 1 >>>>> Resource_List.nodect = 1 >>>>> Resource_List.nodes = 1 >>>>> Shell_Path_List = /bin/sh >>>>> substate = 40 >>>>> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun, >>>>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/ >>>>> local/sb >>>>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp- >>>>> x001.ci.uchicago.edu, >>>>> PBS_O_HOST=tp-x001.ci.uchicago.edu, >>>>> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54, >>>>> PBS_O_QUEUE=batch >>>>> euser = jobrun >>>>> egroup = users >>>>> hashname = 3.tp-x001.c >>>>> queue_rank = 2 >>>>> queue_type = E >>>>> comment = Job started on Wed May 06 at 19:22 >>>>> etime = Wed May 6 19:22:10 2009 >>>>> >>>>> tp-x001 torque # >>>>> >>>>> >>>>> --- >>>>> --- >>>>> ------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu May 7 12:17:52 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 17:17:52 +0000 (GMT) Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A03122E.60403@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <4A03122E.60403@mcs.anl.gov> Message-ID: On Thu, 7 May 2009, Michael Wilde wrote: > I wonder if the Swift engine could pull the data back from a GridFTP server > running on the worker node? There's nothing virtual-machine specific here, I think. >From the perspective of the cio work: If the site shared filesystem is some distributed fs (which is what I think the collective I/O stuff mostly is) then a CIO-aware gridftp client could choose one of several gridftp servers that have access to the same cio file system, and select a preferable one to contact. That could fit into Swift at the cog file transfer layer, as an alternative provider to gridftp. It would preserve Swift's view of sites as fairly monolithic entities, with intra-site selection activity made by lower levels in the stack (as happens with pbs or falkon or coasters). That hierarchy feels more scalable to me - it works ok in the job submission scenario with worker nodes being allocated by LRMs, for example. -- From wilde at mcs.anl.gov Thu May 7 12:18:46 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 07 May 2009 12:18:46 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> Message-ID: <4A0317F6.80708@mcs.anl.gov> and we should parameterize swift-osg-ress-site-catalog to generate the exact site catalog needed for Condor-G and other variations. On 5/7/09 12:04 PM, Ben Clifford wrote: > On Thu, 7 May 2009, Mihael Hategan wrote: > >> So I think the problem there might be "jobmanager-condor" which should >> probably be "jobmanager-pbs" or otherwise something is broken with the >> site. > > Yes its PBS. > > Zhao, are you basing your sites file on the output of > swift-osg-ress-site-catalog? > > If so, you should see the jobmanager correctly populated for each site > there: > > url="osg-ce.grid.uj.ac.za/jobmanager-pbs" major="2" /> > > and should base your Condor-G sites file on that. > From bugzilla-daemon at mcs.anl.gov Thu May 7 12:25:31 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 7 May 2009 12:25:31 -0500 (CDT) Subject: [Swift-devel] [Bug 207] New: swift-osg-ress-site-catalog should have parameter to generate condor-g style site files Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=207 Summary: swift-osg-ress-site-catalog should have parameter to generate condor-g style site files Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk swift-osg-ress-site-catalog should have parameter to generate condor-g style site files. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From zhaozhang at uchicago.edu Thu May 7 12:26:33 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 07 May 2009 12:26:33 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> Message-ID: <4A0319C9.3030803@uchicago.edu> Hi, Ben Ben Clifford wrote: > On Thu, 7 May 2009, Mihael Hategan wrote: > > >> So I think the problem there might be "jobmanager-condor" which should >> probably be "jobmanager-pbs" or otherwise something is broken with the >> site. >> > > Yes its PBS. > > Zhao, are you basing your sites file on the output of > swift-osg-ress-site-catalog? > I am basing on swift-osg-ress-site-catalog in most of the cases. I am still confused on the roles of each component in the sites. I need some time to understand them. GridFtp, I know it. Job-Manager, we have several ways to declare it: gt2 iogw1.hpc.ufl.edu/jobmanager-condor Execution Provider, I am confused of the roles of Job-Manager and Execution Provider, say we have pbs as job-manager, then I am assuming swift will submit jobs through GRAM to PBS, then PBS request resource then run the jobs. In this case why do we need a Execution-Provider? What does it do? zhao > If so, you should see the jobmanager correctly populated for each site > there: > > url="osg-ce.grid.uj.ac.za/jobmanager-pbs" major="2" /> > > and should base your Condor-G sites file on that. > > From benc at hawaga.org.uk Thu May 7 12:30:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 May 2009 17:30:47 +0000 (GMT) Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: <4A0319C9.3030803@uchicago.edu> References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> Message-ID: On Thu, 7 May 2009, Zhao Zhang wrote: > Job-Manager, we have several ways to declare it: > major="2" /> means about the same as: So then your choice is either or but not both. When you use provider="gt2" you are using direct GRAM submission from Swift. When you use provider="condor" you are submitting to a condor running on the local system, and by using the gridResource profile you are indicating that the local condor should submit to some remote GRAM system. -- From wilde at mcs.anl.gov Thu May 7 12:33:07 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 07 May 2009 12:33:07 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <4A03122E.60403@mcs.anl.gov> Message-ID: <4A031B53.9060501@mcs.anl.gov> That all sounds very good. On 5/7/09 12:17 PM, Ben Clifford wrote: > On Thu, 7 May 2009, Michael Wilde wrote: > >> I wonder if the Swift engine could pull the data back from a GridFTP server >> running on the worker node? > > There's nothing virtual-machine specific here, I think. > > From the perspective of the cio work: > > If the site shared filesystem is some distributed fs (which is what I > think the collective I/O stuff mostly is) then a CIO-aware gridftp client > could choose one of several gridftp servers that have access to the same > cio file system, and select a preferable one to contact. > > That could fit into Swift at the cog file transfer layer, as an > alternative provider to gridftp. > > It would preserve Swift's view of sites as fairly monolithic entities, > with intra-site selection activity made by lower levels in the stack (as > happens with pbs or falkon or coasters). > > That hierarchy feels more scalable to me - it works ok in the job > submission scenario with worker nodes being allocated by LRMs, for > example. > From iraicu at cs.uchicago.edu Thu May 7 12:34:34 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 07 May 2009 12:34:34 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A030ECC.2030303@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> Message-ID: <4A031BAA.1060807@cs.uchicago.edu> Hi Yi, Back in late 2007, Catalin, Tim, Borja and I started a side project whose goal was to evaluate how well Swift could run on EC2. We got quite far, up to the point of running a complex application (MolDyn) at small scales of some 10 molecules (100s ~ 1000s of jobs) at a scale of about 10 EC2 instances. Our environment was Swift+Falkon+Workspace_Service+EC2, and our virtual cluster had NFS mounted to enable Swift to operate unmodified. At the time, a shared file system was a requirement for Swift. Also, the Workspace Service is now Nimbus. Unfortunately, Catalin ended up finding a job, and we never really finished the project. We never got to the point of comparing the real app, MolDyn between say the TeraGrid and EC2, and we never got to run real stress tests for throughputs, etc. In case you find anything useful, here were the wiki's we maintained during our work realated to Swift and EC2: http://dev.globus.org/wiki/Incubator/Falkon/EC2. This is really exciting work, good luck! Ioan Yi Zhu wrote: > Michael Wilde wrote: >> Very good! >> >> Now, what kind of tests can you do next? > > Next, I will try to let swift running on Amazon EC2. > >> Can you exercise the cluster with an interesting workflow? > > Yes, Is there any complex sample/tools i can use (rahter than > first.swift) to test swift performance? Is there any benchmark > available i can compare with? > >> How large of a cluster can you assemble in a Nimbus workspace ? > > Since the vm-image i use to test 'swift' is based on NFS shared file > system, the performance may not be satisfiable if the we have a large > scale of cluster. After I got the swift running on Amazon EC2, I will > try to make a dedicate vm-image by using GPFS or any other shared file > system you recommended. > > >> Can you aggregate VM's from a few different physical clusters into >> one Nimbus workspace? > > I don't think so. Tim may make commit on it. > > >> What's the largest cluster you can assemble with Nimbus? > > I am not quite sure,I will do some test onto it soon. since it is a > EC2-like cloud, it should easily be configured as a cluster with > hundreds of nodes. Tim may make commit on it. > > >> Can we try putting Matlab on a Nimbus VM and then running it at an >> interesting scale? > > Yes, I think so. > >> Do we have any free allocations of EC2 to enable testing of this at >> Amazon? > > Yes, Tim already made a AIM on Amazon EC2, my next step is make > 'swift' running on it. I will try get this stuff to ork by tomorrow. > > > >> >> - Mike >> >> >> >> >> >> >> >> On 5/6/09 10:10 PM, yizhu wrote: >>> problem solved. >>> >>> i immigrate all of my work from my laptop to ci.uchicago.edu server >>> and then everything works. >>> >>> >>> -Yi >>> yizhu wrote: >>>> Hi all >>>> >>>> I tried running swift-0.8 over Nimbus Cloud (PBS+NFS), and >>>> configured sites.xml and tc.data accordingly.[1] >>>> >>>> When i tried to run "$swift first.swift", it stuck on >>>> "Submitting:1" phase[2], (keep repeated showing >>>> "Progress:Submitting:1" and never return). >>>> >>>> Then I ssh to pbs_server to check the server log[3] and found that >>>> the job has been enqueued, ran, and successfully dequeued. I also >>>> check the queue status[4] when this job is running and found that >>>> the output_path is "/dev/null" somewhat i don't expected. ( The >>>> working directory of swift is "/home/jobrun". >>>> >>>> I think the problem might be the pbs_server failed to return the >>>> output to the correct path (btw. what the output_path suppose to >>>> be, the same work_directory of swift?), or anyone has a better idea? >>>> >>>> >>>> Many Thanks. >>>> >>>> >>>> -Yi >>>> >>>> >>>> >>>> -------------------------------------------------- >>>> [1]----sites.xml >>>> >>>> >>>> >>>> >>> url="https://tp-x001.ci.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService" >>>> jobManager="PBS" provider="gt4" /> >>>> /home/jobrun >>>> >>>> >>>> ----tc.data >>>> >>>> nb_basecluster echo /bin/echo INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster cat /bin/cat INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster ls /bin/ls INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster grep /bin/grep INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster sort /bin/sort INSTALLED >>>> INTEL32::LINUX null >>>> nb_basecluster paste /bin/paste INSTALLED >>>> INTEL32::LINUX null >>>> >>>> --------------------------------------------------------- >>>> [2] >>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ swift -d first.swift >>>> Recompilation suppressed. >>>> Using sites file: /home/yizhu/swift-0.8/bin/../etc/sites.xml >>>> Using tc.data: /home/yizhu/swift-0.8/bin/../etc/tc.data >>>> Setting resources to: {nb_basecluster=nb_basecluster} >>>> Swift 0.8 swift-r2448 cog-r2261 >>>> >>>> Swift 0.8 swift-r2448 cog-r2261 >>>> >>>> RUNID >>>> id=tag:benc at ci.uchicago.edu,2007:swift:run:20090506-1912-zqd8t5hg >>>> RunID: 20090506-1912-zqd8t5hg >>>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> type string value=Hello, world! dataset=unnamed SwiftScript value >>>> (closed) >>>> ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> path=$ >>>> VALUE >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> VALUE=Hello, world! >>>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> type string value=Hello, world! dataset=unnamed SwiftScript value >>>> (closed) >>>> ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> path=$ >>>> VALUE >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> VALUE=Hello, world! >>>> NEW >>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000001 >>>> >>>> Found mapped data org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>>> type messagefile with no value at dataset=outfile (not closed).$ >>>> NEW >>>> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>>> >>>> Progress: >>>> PROCEDURE thread=0 name=greeting >>>> PARAM thread=0 direction=output variable=t >>>> provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000002 >>>> >>>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed) >>>> ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> path=$ >>>> VALUE >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> VALUE=hello.txt >>>> closed org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> type string value=hello.txt dataset=unnamed SwiftScript value (closed) >>>> ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> path=$ >>>> VALUE >>>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090506-1912-lxp69uu2:720000000003 >>>> VALUE=hello.txt >>>> START thread=0 tr=echo >>>> Sorted: [nb_basecluster:0.000(1.000):0/1 overload: 0] >>>> Rand: 0.6583597597672994, sum: 1.0 >>>> Next contact: nb_basecluster:0.000(1.000):0/1 overload: 0 >>>> Progress: Initializing site shared directory:1 >>>> START host=nb_basecluster - Initializing shared directory >>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, -0.01) >>>> Old score: 0.000, new score: -0.010 >>>> No global submit throttle set. Using default (100) >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:-0.010(0.994):1/1 overload: 0, 0.01) >>>> Old score: -0.010, new score: 0.000 >>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.1) >>>> Old score: 0.000, new score: 0.100 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174935) >>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 63M, >>>> Max heap: 720M >>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, -0.2) >>>> Old score: 0.100, new score: -0.100 >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>> status to Submitting >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>> status to Submitted >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>> status to Active >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:-0.100(0.943):1/1 overload: 0, 0.2) >>>> Old score: -0.100, new score: 0.100 >>>> multiplyScore(nb_basecluster:0.100(1.060):1/1 overload: 0, 0.1) >>>> Old score: 0.100, new score: 0.200 >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174938) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 60M, Max heap: 720M >>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, -0.2) >>>> Old score: 0.200, new score: 0.000 >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>> status to Submitting >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>> status to Submitted >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>> status to Active >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.000(1.000):1/1 overload: 0, 0.2) >>>> Old score: 0.000, new score: 0.200 >>>> multiplyScore(nb_basecluster:0.200(1.124):1/1 overload: 0, 0.1) >>>> Old score: 0.200, new score: 0.300 >>>> Task(type=FILE_TRANSFER, identity=urn:0-1-1241655174940) Completed. >>>> Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, Max heap: 720M >>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, -0.01) >>>> Old score: 0.300, new score: 0.290 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.290(1.185):1/1 overload: 0, 0.01) >>>> Old score: 0.290, new score: 0.300 >>>> multiplyScore(nb_basecluster:0.300(1.192):1/1 overload: 0, 0.1) >>>> Old score: 0.300, new score: 0.400 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174942) >>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, >>>> Max heap: 720M >>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, -0.01) >>>> Old score: 0.400, new score: 0.390 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.390(1.256):1/1 overload: 0, 0.01) >>>> Old score: 0.390, new score: 0.400 >>>> multiplyScore(nb_basecluster:0.400(1.264):1/1 overload: 0, 0.1) >>>> Old score: 0.400, new score: 0.500 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174944) >>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, >>>> Max heap: 720M >>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, -0.01) >>>> Old score: 0.500, new score: 0.490 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.490(1.332):1/1 overload: 0, 0.01) >>>> Old score: 0.490, new score: 0.500 >>>> multiplyScore(nb_basecluster:0.500(1.339):1/1 overload: 0, 0.1) >>>> Old score: 0.500, new score: 0.600 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174946) >>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, >>>> Max heap: 720M >>>> END host=nb_basecluster - Done initializing shared directory >>>> THREAD_ASSOCIATION jobid=echo-kfecpfaj thread=0-1 >>>> host=nb_basecluster replicationGroup=jfecpfaj >>>> Progress: Stage in:1 >>>> START jobid=echo-kfecpfaj host=nb_basecluster - Initializing >>>> directory structure >>>> START path= dir=first-20090506-1912-zqd8t5hg/shared - Creating >>>> directory structure >>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, -0.01) >>>> Old score: 0.600, new score: 0.590 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>>> status to Submitting >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>>> status to Submitted >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>>> status to Active >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) setting >>>> status to Completed >>>> multiplyScore(nb_basecluster:0.590(1.411):1/1 overload: 0, 0.01) >>>> Old score: 0.590, new score: 0.600 >>>> multiplyScore(nb_basecluster:0.600(1.419):1/1 overload: 0, 0.1) >>>> Old score: 0.600, new score: 0.700 >>>> Task(type=FILE_OPERATION, identity=urn:0-1-1241655174948) >>>> Completed. Waiting: 0, Running: 0. Heap size: 75M, Heap free: 59M, >>>> Max heap: 720M >>>> END jobid=echo-kfecpfaj - Done initializing directory structure >>>> START jobid=echo-kfecpfaj - Staging in files >>>> END jobid=echo-kfecpfaj - Staging in finished >>>> JOB_START jobid=echo-kfecpfaj tr=echo arguments=[Hello, world!] >>>> tmpdir=first-20090506-1912-zqd8t5hg/jobs/k/echo-kfecpfaj >>>> host=nb_basecluster >>>> jobid=echo-kfecpfaj task=Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1241655174950) >>>> multiplyScore(nb_basecluster:0.700(1.503):1/1 overload: 0, -0.2) >>>> Old score: 0.700, new score: 0.500 >>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1241655174950) setting >>>> status to Submitting >>>> Submitting task: Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1241655174950) >>>> 1241655180260 >>>> 1241655180623 >>>> 1241655181975>>> Task submitted: Task(type=JOB_SUBMISSION, >>>> identity=urn:0-1-1241655174950) >>>> Progress: Submitting:1 >>>> >>>> Progress: Submitting:1 >>>> >>>> Progress: Submitting:1 >>>> Progress: Submitting:1 >>>> Progress: Submitting:1 >>>> Progress: Submitting:1 >>>> ^C >>>> yizhu at ubuntu:~/swift-0.8/examples/swift$ >>>> >>>> -------------------------------------------------------------- >>>> [3] see attachment >>>> ------------------------------------------------------------- >>>> [4] >>>> tp-x001 torque # qstat -f >>>> Job Id: 3.tp-x001.ci.uchicago.edu >>>> Job_Name = STDIN >>>> Job_Owner = jobrun at tp-x001.ci.uchicago.edu >>>> job_state = R >>>> queue = batch >>>> server = tp-x001.ci.uchicago.edu >>>> Checkpoint = u >>>> ctime = Wed May 6 19:22:10 2009 >>>> Error_Path = tp-x001.ci.uchicago.edu:/dev/null >>>> exec_host = tp-x002/0 >>>> Hold_Types = n >>>> Join_Path = n >>>> Keep_Files = n >>>> Mail_Points = n >>>> mtime = Wed May 6 19:22:10 2009 >>>> Output_Path = tp-x001.ci.uchicago.edu:/dev/null >>>> Priority = 0 >>>> qtime = Wed May 6 19:22:10 2009 >>>> Rerunable = True >>>> Resource_List.neednodes = 1 >>>> Resource_List.nodect = 1 >>>> Resource_List.nodes = 1 >>>> Shell_Path_List = /bin/sh >>>> substate = 40 >>>> Variable_List = PBS_O_HOME=/home/jobrun,PBS_O_LOGNAME=jobrun, >>>> >>>> PBS_O_PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sb >>>> >>>> in:/opt/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=tp-x001.ci.uchicago.edu, >>>> PBS_O_HOST=tp-x001.ci.uchicago.edu, >>>> PBS_O_WORKDIR=/home/jobrun/first-20090506-1922-xaandi54, >>>> PBS_O_QUEUE=batch >>>> euser = jobrun >>>> egroup = users >>>> hashname = 3.tp-x001.c >>>> queue_rank = 2 >>>> queue_type = E >>>> comment = Job started on Wed May 06 at 19:22 >>>> etime = Wed May 6 19:22:10 2009 >>>> >>>> tp-x001 torque # >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Thu May 7 13:08:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 07 May 2009 13:08:55 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090507125435.62306f23@sietch> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> Message-ID: <4A0323B7.8080206@mcs.anl.gov> On 5/7/09 12:54 PM, Tim Freeman wrote: > On Thu, 07 May 2009 11:39:40 -0500 > Yi Zhu wrote: > >> Michael Wilde wrote: >>> Very good! >>> >>> Now, what kind of tests can you do next? >> Next, I will try to let swift running on Amazon EC2. >> >>> Can you exercise the cluster with an interesting workflow? >> Yes, Is there any complex sample/tools i can use (rahter than >> first.swift) to test swift performance? Is there any benchmark available >> i can compare with? >> >>> How large of a cluster can you assemble in a Nimbus workspace ? >> Since the vm-image i use to test 'swift' is based on NFS shared file >> system, the performance may not be satisfiable if the we have a large >> scale of cluster. After I got the swift running on Amazon EC2, I will >> try to make a dedicate vm-image by using GPFS or any other shared file >> system you recommended. >> >> >>> Can you aggregate VM's from a few different physical clusters into one >>> Nimbus workspace? >> I don't think so. Tim may make commit on it. > > There is some work going on right now making auto-configuration easier to do > over multiple clusters (that is possible now, it's just very 'manual' and > non-ideal unlike with one physical cluster). You wouldn't really want to do NFS > across a WAN, though. Indeed. Now that I think this through more clearly, one workspace == one cluster == one Swift "site", so we could aggregate the resources of multiple workspaces through Swift to execute a multi-site workflow. - Mike >> >>> What's the largest cluster you can assemble with Nimbus? >> I am not quite sure,I will do some test onto it soon. since it is a >> EC2-like cloud, it should easily be configured as a cluster with >> hundreds of nodes. Tim may make commit on it. > > I've heard of EC2 deployments in the 1000s at once, it's up to your EC2 account > limitations (they seem pretty efficient with making your quota ever-higher). > Nimbus installation at Teraport maxes out at 16, there are other 'science > clouds' but I don't know their node numbers. EC2 is the place where you will > be able to really test scaling something. > > Tim From tfreeman at mcs.anl.gov Thu May 7 12:54:35 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Thu, 7 May 2009 12:54:35 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A030ECC.2030303@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> Message-ID: <20090507125435.62306f23@sietch> On Thu, 07 May 2009 11:39:40 -0500 Yi Zhu wrote: > Michael Wilde wrote: > > Very good! > > > > Now, what kind of tests can you do next? > > Next, I will try to let swift running on Amazon EC2. > > > Can you exercise the cluster with an interesting workflow? > > Yes, Is there any complex sample/tools i can use (rahter than > first.swift) to test swift performance? Is there any benchmark available > i can compare with? > > > How large of a cluster can you assemble in a Nimbus workspace ? > > Since the vm-image i use to test 'swift' is based on NFS shared file > system, the performance may not be satisfiable if the we have a large > scale of cluster. After I got the swift running on Amazon EC2, I will > try to make a dedicate vm-image by using GPFS or any other shared file > system you recommended. > > > > Can you aggregate VM's from a few different physical clusters into one > > Nimbus workspace? > > I don't think so. Tim may make commit on it. There is some work going on right now making auto-configuration easier to do over multiple clusters (that is possible now, it's just very 'manual' and non-ideal unlike with one physical cluster). You wouldn't really want to do NFS across a WAN, though. > > > > What's the largest cluster you can assemble with Nimbus? > > I am not quite sure,I will do some test onto it soon. since it is a > EC2-like cloud, it should easily be configured as a cluster with > hundreds of nodes. Tim may make commit on it. I've heard of EC2 deployments in the 1000s at once, it's up to your EC2 account limitations (they seem pretty efficient with making your quota ever-higher). Nimbus installation at Teraport maxes out at 16, there are other 'science clouds' but I don't know their node numbers. EC2 is the place where you will be able to really test scaling something. Tim From zhaozhang at uchicago.edu Thu May 7 13:29:12 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 07 May 2009 13:29:12 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> Message-ID: <4A032878.1080404@uchicago.edu> Hi, Ben I have another question regarding the definition of sites.xml: [zzhang at tp-grid1 condor-g_back]$ cat CIT_CMS_T2.xml 1 2 3 4 5 6 /raid2/osg-data/osg/tmp/CIT_CMS_T2 7 grid 8 gt2 cit-gatekeeper.ultralight.org/jobmanager-condor 9 10 In the above case, we are specifying condor twice, Line 5 and Line 8. Is it Line 5 the one we are specifying condor-G ? So Condor-G is a execution provider, right? Then what does Line 8 mean? Specifying the location of job-manager? zhao Ben Clifford wrote: > On Thu, 7 May 2009, Zhao Zhang wrote: > > >> Job-Manager, we have several ways to declare it: >> > major="2" /> >> > > means about the same as: > > > So then your choice is either or but not both. > > When you use provider="gt2" you are using direct GRAM submission from > Swift. > > When you use provider="condor" you are submitting to a condor running on > the local system, and by using the gridResource profile you are indicating > that the local condor should submit to some remote GRAM system. > > From wilde at mcs.anl.gov Thu May 7 14:40:32 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 07 May 2009 14:40:32 -0500 Subject: [Swift-devel] [Fwd: Re: adem user manual 0.2] Message-ID: <4A033930.8030504@mcs.anl.gov> Zhao, I think this is the latest ADEM document. Zhengxiong, if there is something more recent, can you sent it to us? We'd like to start testing it a bit more. Thanks, Mike -------- Original Message -------- Subject: Re: adem user manual 0.2 Date: Mon, 16 Feb 2009 23:28:14 -0600 From: Zhengxiong Hou To: Michael Wilde References: <49999702.3000701 at mcs.anl.gov> <4999A5CD.20109 at mcs.anl.gov> Hi Mike, The attached is the updated adem user manual 0.2. Please download the adem-osg.tar.gz and run it again. Thanks much, Regards! Zhengxiong -------------- next part -------------- A non-text attachment was scrubbed... Name: User Manual-ADEM for OSG-0.2.doc Type: application/msword Size: 625152 bytes Desc: not available URL: From wilde at mcs.anl.gov Thu May 7 14:47:36 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 07 May 2009 14:47:36 -0500 Subject: [Swift-devel] What is best OSG site test suite ? Message-ID: <4A033AD8.40209@mcs.anl.gov> Hi All, Many of you have developed, or use, various test suites to probe the sites of an OSG VO to see if basic authentication, job execution, and data transfer work for a given user (cert). Can you let us know, on swift-devel, what test suite you use, and what you suggest that the Swift team use, as part of a Swift verification test suite that reports what sites Swift does (and does not) work on? In other words, once a user probes OSG to create a Swift sites file, it would be useful to test if the basic services are working for the test user, before testing at a higher level through Swift. Thanks, Mike From keahey at mcs.anl.gov Thu May 7 18:03:08 2009 From: keahey at mcs.anl.gov (Kate Keahey) Date: Thu, 07 May 2009 18:03:08 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090507125435.62306f23@sietch> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> Message-ID: <4A0368AC.2030003@mcs.anl.gov> Science Clouds: Purdue has just 4-6 nodes on a regular basis (but have been known to extend as need arises), same for Masaryk, U of Florida have 32. Tim Freeman wrote: > On Thu, 07 May 2009 11:39:40 -0500 > Yi Zhu wrote: > >> Michael Wilde wrote: >>> Very good! >>> >>> Now, what kind of tests can you do next? >> Next, I will try to let swift running on Amazon EC2. >> >>> Can you exercise the cluster with an interesting workflow? >> Yes, Is there any complex sample/tools i can use (rahter than >> first.swift) to test swift performance? Is there any benchmark available >> i can compare with? >> >>> How large of a cluster can you assemble in a Nimbus workspace ? >> Since the vm-image i use to test 'swift' is based on NFS shared file >> system, the performance may not be satisfiable if the we have a large >> scale of cluster. After I got the swift running on Amazon EC2, I will >> try to make a dedicate vm-image by using GPFS or any other shared file >> system you recommended. >> >> >>> Can you aggregate VM's from a few different physical clusters into one >>> Nimbus workspace? >> I don't think so. Tim may make commit on it. > > There is some work going on right now making auto-configuration easier to do > over multiple clusters (that is possible now, it's just very 'manual' and > non-ideal unlike with one physical cluster). You wouldn't really want to do NFS > across a WAN, though. > >> >>> What's the largest cluster you can assemble with Nimbus? >> I am not quite sure,I will do some test onto it soon. since it is a >> EC2-like cloud, it should easily be configured as a cluster with >> hundreds of nodes. Tim may make commit on it. > > I've heard of EC2 deployments in the 1000s at once, it's up to your EC2 account > limitations (they seem pretty efficient with making your quota ever-higher). > Nimbus installation at Teraport maxes out at 16, there are other 'science > clouds' but I don't know their node numbers. EC2 is the place where you will > be able to really test scaling something. > > Tim -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From aespinosa at cs.uchicago.edu Fri May 8 00:17:57 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 8 May 2009 00:17:57 -0500 Subject: [Swift-devel] Using multiple "optional" providers Message-ID: <20090508051757.GA24615@origin> I tried including provider-deef and provider-ln through $ant -Dwith-provider-deef -Dwith-provider-ln redist but only provider-deef is being built (or provider-ln if it was the one defined first). So I set it to arbitrary values $ant -Dwith-provider-deef=1 -Dwith-provider-ln=1 redist and it works! Works for other combinations of these optional providers like wonky and condor -Allan From tiejing at gmail.com Fri May 8 00:43:00 2009 From: tiejing at gmail.com (Jing Tie) Date: Fri, 8 May 2009 00:43:00 -0500 Subject: [Swift-devel] Re: What is best OSG site test suite ? In-Reply-To: <4A033AD8.40209@mcs.anl.gov> References: <4A033AD8.40209@mcs.anl.gov> Message-ID: Hi Mike, I think others might have better tools, but I can list three here as possible options. The first one is "osg-vo-test" developed by Chris Green. It lists probing results for all the sites in a certain VO (e.g. http://www.ci.uchicago.edu/~jtie/osg-vo-test/osg_summary/2008727-19:20:43.html). I attached his announcement about this tool in the end of the email. The second one "test_osg_ce" is used by Zhengxiong, Xi and me a lot. It's a simple package that I developed based on site_verify.pl. It runs the script (check authentication, globus-job-run, gridftp, user directory...) for each site in a user configuration file, and print out the good ones. I have attached it in the email. The last one is RSV developed by GOC. It runs similar scripts as site_verify.pl by system administrators, and reports to a centralized server periodically. You can find it on MyOSG (http://myosg.grid.iu.edu/about). I think it won't be hard to setup an independent version for user. Hope it helps, Jing ------------------------------------------------------ Email from Chris Green: Hi, I am happy to announce the first general release of an extensible VO-centric site testing kit, "osg-vo-test". To explain why you might be interested in this package, I quote from the overview on the package's home TWiki page : ------------------------------------------------------------------------ This package is an attempt to allow /application owners/ (by which I mean people responsible for running an application on OSG) to characterize OSG sites from the point of view of being able to run your application. Questions can be asked of each site in multiple ways, for instance: * Command line, eg: ping my-ce.my-domain * Fork job, eg: globus-job-run my-ce.my-domain /usr/bin/printenv * Batch job via CondorG. * ReSS, the *Re*source *S*election *S*ystem. * VORS, the *VO* *R*esource *S*elector. The results are presented primarily in the form of an HTML table with results columns (possibly multiple columns per test), with a link to more detailed information for the test. In addition, the summary results are available in .CSV format for machine readability; a true XML format may be forthcoming if there is enough demand. The application owner can write new test modules inheriting from the old; for more details, see Making your own module . In addition, existing tests are highly configurable and allow for the addition of new results columns with minimal effort; for a quick example, see the Getting Started section; and also the detail section. ------------------------------------------------------------------------ It is extremely straightforward, for example, to test basic authorization at a site from the point of view of your own voms-proxy; and other tests are showcased in example control scripts; or provided as standalone test modules. The example summary page shows a wide range of tests of which this extensible system is capable. Please, visit the package's home TWiki page , download the source; use ; and give feedback . The aim is to make it easy for application owners to put together their own suite or suites of tests to analyze sites across the OSG from the perspective of the needs of their own application(s) without having to re-invent the wheel to interrogate all the different sources of information about OSG sites. On Thu, May 7, 2009 at 2:47 PM, Michael Wilde wrote: > Hi All, > > Many of you have developed, or use, various test suites to probe the sites > of an OSG VO to see if basic authentication, job execution, and data > transfer work for a given user (cert). > > Can you let us know, on swift-devel, what test suite you use, and what you > suggest that the Swift team use, as part of a Swift verification test suite > that reports what sites Swift does (and does not) work on? > > In other words, once a user probes OSG to create a Swift sites file, it > would be useful to test if the basic services are working for the test user, > before testing at a higher level through Swift. > > Thanks, > > Mike > > -------------- next part -------------- A non-text attachment was scrubbed... Name: test_osg_ce.zip Type: application/zip Size: 30520 bytes Desc: not available URL: From benc at hawaga.org.uk Fri May 8 01:56:54 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 May 2009 06:56:54 +0000 (GMT) Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: <4A032878.1080404@uchicago.edu> References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> <4A032878.1080404@uchicago.edu> Message-ID: On Thu, 7 May 2009, Zhao Zhang wrote: > In the above case, we are specifying condor twice, Line 5 and Line 8. > Is it Line 5 the one we are specifying condor-G ? So Condor-G is a execution > provider, right? > > Then what does Line 8 mean? Specifying the location of job-manager? Think about the case for the UJ site rather than the site you pasted. Condor is specified once, and PBS is specified once. Condor runs on the local system to manage jobs that Swift submits. Condor the local system then uses GRAM to submit to the remote system. On the remote system, PBS manages the jobs. In the example you pasted, there are two Condor installations. One locally (that is the one referred to in the element) and one remotely that replaces PBS. -- From benc at hawaga.org.uk Fri May 8 02:14:18 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 May 2009 07:14:18 +0000 (GMT) Subject: [Swift-devel] [provenance-challenge] The Third Provenance Challenge Workshop - Call for Participation (fwd) Message-ID: I'm going to this and intend to present Swift provenance work there. ---------- Forwarded message ---------- Date: Thu, 7 May 2009 10:47:15 -0700 From: Paul Groth Reply-To: provenance-challenge at ipaw.info To: provenance-challenge at ipaw.info Subject: [provenance-challenge] The Third Provenance Challenge Workshop - Call for Participation Please distribute to those that might be interested - Paul ---------------------------------------------- The Third Provenance Challenge Workshop - Call for Participation Amsterdam June 10-11, 2009 Data products are increasingly being produced by the composition of services and data supplied by multiple parties using a variety of data analysis, management, and collection technologies. This approach is particular evident in e-Science where scientists combine sensor data and shared Web-accessible databases using a variety of local and remote data analysis routines to produce experimental results. In such environments, provenance (also referred to as audit trail, lineage, and pedigree) plays a critical role as it enables users to understand, verify, reproduce, and ascertain the quality of data products. An important challenge in the context of these compositional applications is how to integrate the provenance data produced by different systems to be able to construct the full provenance of complex data products. To that end, a common data model for provenance is needed to help ease the integration of provenance data across the heterogeneous environments used for running such applications. To work towards interoperability across a variety of provenance systems, since March 2nd, 14 teams from across the world have participated in the Third Provenance Challenge. During this challenge, teams have exchanged provenance data between their provenance systems using a common data model, the Open Provenance Model (OPM). They developed OPM serializations in both RDF and XML, ran provenance queries over the exchanged data, and began to create common tools for use with OPM. The Third Provenance Challenge Workshop will provide a forum for teams to present their results and exchange ideas about OPM as an interoperability layer. Additionally, the workshop will provide the community an opportunity to discuss the future of interoperability between provenance systems and how to continue and expand the provenance community. Registration Info To see the state of the art in provenance systems and be part of the conversation, attend the Third Provenance Challenge Workshop held at Science Park Amsterdam June 10-11. Registration is free. To register please send e-mail with your name, address and affiliation to pgroth at isi.edu with Register PC3 in the subject field. (Your name will be added to the participant list on the challenge wiki.) - More information about the challenge can be found at http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge - Event details (including location, hotel info and schedule) can be found at http://twiki.ipaw.info/bin/view/Challenge/LocalDetailsPC3 Contact: For details or questions, contact Paul Groth (pgroth at isi.edu). Sponsors: Microsoft The Virtual Laboratory for e-Science Organizers: Paul Groth, ISI / University of Southern California Yogesh Simmhan, Microsoft Research Luc Moreau, University of Southampton Local Organizers: Adam Belloum, University of Amsterdam Zhiming Zhao, University of Amsterdam From benc at hawaga.org.uk Fri May 8 02:28:12 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 May 2009 07:28:12 +0000 (GMT) Subject: [Swift-devel] Using multiple "optional" providers In-Reply-To: <20090508051757.GA24615@origin> References: <20090508051757.GA24615@origin> Message-ID: Slightly strange things happen when turning options on and off and running redist. For example, if provider-ln is already built, it appears in your dist/ directory even if you don't enable it in the present build. I think the same will happen for deef. You might be encountering effects like that. On Fri, 8 May 2009, Allan Espinosa wrote: > I tried including provider-deef and provider-ln through > > $ant -Dwith-provider-deef -Dwith-provider-ln redist > > but only provider-deef is being built (or provider-ln if it was the one > defined first). So I set it to arbitrary values > > $ant -Dwith-provider-deef=1 -Dwith-provider-ln=1 redist > > and it works! > > Works for other combinations of these optional providers like wonky and > condor > > -Allan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Fri May 8 06:07:27 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 May 2009 11:07:27 +0000 (GMT) Subject: [Swift-devel] What is best OSG site test suite ? In-Reply-To: <4A033AD8.40209@mcs.anl.gov> References: <4A033AD8.40209@mcs.anl.gov> Message-ID: I don't have any experience using it, but gt4.2 contained some certificate diagnosis tools that might be worth investigating. -- From wilde at mcs.anl.gov Fri May 8 06:21:01 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 08 May 2009 06:21:01 -0500 Subject: [Swift-devel] Re: What is best OSG site test suite ? In-Reply-To: References: <4A033AD8.40209@mcs.anl.gov> Message-ID: <4A04159D.50901@mcs.anl.gov> Jing, Ben, thanks very much for the tips. Jing, was site_verify.pr the older test suite was done by Jens as part of VDS? Also, how would you compare test_osg_ce with osg-vo-test? Mats, do you have anything for such site testing? Zhao, I suggest you start with test_osg_ce. - Mike On 5/8/09 12:43 AM, Jing Tie wrote: > Hi Mike, > > I think others might have better tools, but I can list three here as > possible options. > > The first one is "osg-vo-test" developed by Chris Green. It lists > probing results for all the sites in a certain VO (e.g. > http://www.ci.uchicago.edu/~jtie/osg-vo-test/osg_summary/2008727-19:20:43.html). > I attached his announcement about this tool in the end of the email. > > The second one "test_osg_ce" is used by Zhengxiong, Xi and me a lot. > It's a simple package that I developed based on site_verify.pl. It > runs the script (check authentication, globus-job-run, gridftp, user > directory...) for each site in a user configuration file, and print > out the good ones. I have attached it in the email. > > The last one is RSV developed by GOC. It runs similar scripts as > site_verify.pl by system administrators, and reports to a centralized > server periodically. You can find it on MyOSG > (http://myosg.grid.iu.edu/about). I think it won't be hard to setup an > independent version for user. > > Hope it helps, > Jing > > > ------------------------------------------------------ > Email from Chris Green: > > Hi, > > I am happy to announce the first general release of an extensible > VO-centric site testing kit, "osg-vo-test". > > To explain why you might be interested in this package, I quote from the > overview > > on the package's home TWiki page > : > > ------------------------------------------------------------------------ > > This package is an attempt to allow /application owners/ (by which I > mean people responsible for running an application on OSG) to > characterize OSG sites from the point of view of being able to run your > application. Questions can be asked of each site in multiple ways, for > instance: > > * Command line, eg: > > ping my-ce.my-domain > > * Fork job, eg: > > globus-job-run my-ce.my-domain /usr/bin/printenv > > * Batch job via CondorG. > > * ReSS, the *Re*source *S*election *S*ystem. > > * VORS, the *VO* *R*esource *S*elector. > > The results are presented primarily in the form of an HTML table > > with results columns (possibly multiple columns per test), with a link > to more detailed information for the test. > > In addition, the summary results are available in .CSV format > > for machine readability; a true XML format may be forthcoming if there > is enough demand. > > The application owner can write new test modules inheriting from the > old; for more details, see Making your own module > . > In addition, existing tests are highly configurable and allow for the > addition of new results columns with minimal effort; for a quick > example, see the Getting Started > > section; and also the detail > > section. > > ------------------------------------------------------------------------ > > It is extremely straightforward, for example, to test basic > authorization at a site from the point of view of your own voms-proxy; > and other tests are showcased in example control scripts; or provided as > standalone test modules. The example summary page > > shows a wide range of tests of which this extensible system is capable. > > Please, visit the package's home TWiki page > , > download > > the source; use > ; > and give feedback . The aim > is to make it easy for application owners to put together their own > suite or suites of tests to analyze sites across the OSG from the > perspective of the needs of their own application(s) without having to > re-invent the wheel to interrogate all the different sources of > information about OSG sites. > > > On Thu, May 7, 2009 at 2:47 PM, Michael Wilde wrote: >> Hi All, >> >> Many of you have developed, or use, various test suites to probe the sites >> of an OSG VO to see if basic authentication, job execution, and data >> transfer work for a given user (cert). >> >> Can you let us know, on swift-devel, what test suite you use, and what you >> suggest that the Swift team use, as part of a Swift verification test suite >> that reports what sites Swift does (and does not) work on? >> >> In other words, once a user probes OSG to create a Swift sites file, it >> would be useful to test if the basic services are working for the test user, >> before testing at a higher level through Swift. >> >> Thanks, >> >> Mike >> >> From benc at hawaga.org.uk Fri May 8 06:49:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 May 2009 11:49:56 +0000 (GMT) Subject: [Swift-devel] Re: What is best OSG site test suite ? In-Reply-To: <4A04159D.50901@mcs.anl.gov> References: <4A033AD8.40209@mcs.anl.gov> <4A04159D.50901@mcs.anl.gov> Message-ID: On Fri, 8 May 2009, Michael Wilde wrote: > Mats, do you have anything for such site testing? OSG Engage does site testing for anything that advertises support for that VO and publishes the results. You can get the results for that by using the --engage-verified option to swift-osg-ress-site-catalog, which will generate a sites file for sites that have passed the most recent test. This does not address your requirement to test > for a given user (cert). as it uses some fixed credential for testing, not yours (or another users). -- From aespinosa at cs.uchicago.edu Fri May 8 08:01:47 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 8 May 2009 08:01:47 -0500 Subject: [Swift-devel] Using multiple "optional" providers In-Reply-To: References: <20090508051757.GA24615@origin> Message-ID: <20090508130147.GA1579@origin> $ ant -Dwith-provider-ln redist ... build build build ... $ ls dist/swift-svn/lib/cog-provider-ln* dist/swift-svn/lib/cog-provider-deef* ls: dist/swift-svn/lib/cog-provider-deef*: No such file or directory dist/swift-svn/lib/cog-provider-ln-2.2.jar $ ant -Dwith-provider-deef redist ... build build build ... $ ls dist/swift-svn/lib/cog-provider-ln* dist/swift-svn/lib/cog-provider-deef* ls: dist/swift-svn/lib/cog-provider-ln*: No such file or directory dist/swift-svn/lib/cog-provider-deef-1.1.jar Looks like my version of ant is behaving properly. -Allan On Fri, May 08, 2009 at 07:28:12AM +0000, Ben Clifford wrote: > > Slightly strange things happen when turning options on and off and running > redist. For example, if provider-ln is already built, it appears in your > dist/ directory even if you don't enable it in the present build. I think > the same will happen for deef. You might be encountering effects like > that. > > On Fri, 8 May 2009, Allan Espinosa wrote: > > > I tried including provider-deef and provider-ln through > > > > $ant -Dwith-provider-deef -Dwith-provider-ln redist > > > > but only provider-deef is being built (or provider-ln if it was the one > > defined first). So I set it to arbitrary values > > > > $ant -Dwith-provider-deef=1 -Dwith-provider-ln=1 redist > > > > and it works! > > > > Works for other combinations of these optional providers like wonky and > > condor > > > > -Allan > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From qinz at ihpc.a-star.edu.sg Fri May 8 10:40:43 2009 From: qinz at ihpc.a-star.edu.sg (Qin Zheng) Date: Fri, 8 May 2009 23:40:43 +0800 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> , Message-ID: Dear all, I just completed a draft on the runtime adaptation work. Now start running some experiments. Your comments would be appreciated. Ben, please feel free to forward it to Jon if it's relevant. Sincerely, Qin Zheng ________________________________________ From: Qin Zheng Sent: Wednesday, April 08, 2009 4:03 PM To: Ben Clifford Cc: Ian Foster; swift-devel Subject: RE: [Swift-devel] Re: replication vs site score Dear Ben, Thanks for your detailed reply and it helps me understand scheduling in Swift better. I wrote from a researcher perspective and I understand that for development, there are much more practical issues and are more challenging. I agree with you that scheduling a task after its parents completes is cost effective. It is the best "time" given all the updated info on the completion times of its parents. Also, it makes DAG submission easy (without dependency description) and minimizes the number of job instances in queues. The concern is that at this time, the task still needs to be submitted in queue and wait. This may not be sufficient for workflows with deadlines, where certain delivery guarantee in response time is necessary. The same applies for other remaining tasks in the workflow. I felt besides offline planning, runtime adaptation is necessary considering task duration variation (overrun) and faults. But the number of updates should be kept minimum and only for the very near future as the workflow proceeds. I am writing a paper on this and hopefully I could share it with you guys in a few weeks. This implies that the Swift code could be submitted a little bit more eagerly with a short-sighted look ahead. Yes, your points on the differences are valid and the replica in my case is used for FT while in Swift it could enable a task to run earlier (by submitting a replica at a short queue). You mentioned about queue time and can you share more on it, for example its accuracy and also the change to have some other job prioritization for coasters? I will be on star cruise to Malaysia in a few hours :). If I can not access email there, I will reply to you guys on Friday when I return to Singapore. Qin Zheng -----Original Message----- From: Ben Clifford [mailto:benc at hawaga.org.uk] Sent: Tuesday, April 07, 2009 7:31 PM To: Qin Zheng Cc: Ian Foster; swift-devel Subject: RE: [Swift-devel] Re: replication vs site score Hi. Most/all of the work that we've done with Swift works with fairly opportunistic use of resources - we submit work into job queues on one or more sites, where those job queues are shared with many other users, and where the runtimes for both our jobs and other users jobs are not well defined ahead of time. So whilst we use the word 'scheduling' sometimes in Swift, its more a case of "what do we think is the best site to queue a job on right now?" rather than making an execution plan that we think will be valid for a long period of time. Our replication mechanism sounds fairly similar to your pre-scheduled backups, but I think there are these important differences: * we don't launch a replica until we think there is a reasonable chance that the replica will run instead of the original (based on queue time) * as soon as one of the jobs *starts* running, we cancel all the others. from what I understand, you do that when one of the jobs *ends* successfully. We do have one situation where we have some pre-allocation of resources, and that is when coasters are being used. These use the above opportunistic queuing methods to acquire a worker node for a long period of time, and then runs Swift level jobs in there, at present on a first-come first-serve basis. Its likely that we'll change that to have some other job prioritisation, but still pre-scheduling the jobs. Where Swift would have trouble working with an ahead-of-time planner/scheduler is that the module that generates file transfer and execution tasks from high level SwiftScripts does not submit a dependent task for scheduling and execution until its predecessors have been successfully executed. What the scheduler sees is a stream, over time, of file transfer and execution tasks that are safe to run immediately. It might be easy, or it might be hard, to make the Swift code submit more eagerly, with description of task dependencies, which would allow you to plug in a pre-planner underneath. On Tue, 7 Apr 2009, Qin Zheng wrote: > Prof Foster, thanks for introducing me to the team. > > My research interest is on scheduling workflows (DAGs). Ben, we decided > not to use resubmission in the consideration that a DAG cannot be > completed when any of its tasks fails, which each time would trigger the > resubmission\retry of the DAG. Instead, we use fault tolerance by > pre-scheduling replica (backup) for each task (see enclosure for > details). The objective is to guarantee that this DAG can be completed > (in a preplanned manner with fast failover to the backup upon failure) > before its deadline. > > Currently I am also working on workflow scheduling under uncertainties > of task running times. This work includes priorities tasks based on the > impact of the variation of its running time on the overall response time > and offline planning for high-priority tasks as well as runtime > adaptation for all tasks once up-to-date information is available. > > I am looking forward to talking to you guys and knowing your research! > > Regards, > Qin Zheng > ________________________________ > From: Ian Foster [mailto:foster at anl.gov] > Sent: Monday, April 06, 2009 10:46 PM > To: Ben Clifford > Cc: swift-devel; Qin Zheng > Subject: Re: [Swift-devel] Re: replication vs site score > > Ben: > > You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. > > I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. > > Ian. > > > On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: > > > even more rambling... in the context of a scheduler that is doing things > like prioritising jobs based on more than the order that Swift happened to > submit them (hopefully I will have a student for this in the summer), I > think a replicant job should be pushed toward later execution rather than > earlier execution to reduce the number of replicant jobs in the system at > any one time. > > This is because I suspect (though I have gathered no numerical evidence) > that given the choice between submitting a fresh job and a replicant job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant job we > give that replicant's original a chance to start running and thus do not > discard our precious time-and-load-dollars that we have already spent on > queueing that replicant's original. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > ________________________________ > This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. > This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. -------------- next part -------------- A non-text attachment was scrubbed... Name: Adaptation_Qin.pdf Type: application/pdf Size: 103038 bytes Desc: Adaptation_Qin.pdf URL: From tiejing at gmail.com Fri May 8 11:34:17 2009 From: tiejing at gmail.com (Jing Tie) Date: Fri, 8 May 2009 11:34:17 -0500 Subject: [Swift-devel] Re: What is best OSG site test suite ? In-Reply-To: <4A04159D.50901@mcs.anl.gov> References: <4A033AD8.40209@mcs.anl.gov> <4A04159D.50901@mcs.anl.gov> Message-ID: On Fri, May 8, 2009 at 6:21 AM, Michael Wilde wrote: > Jing, Ben, thanks very much for the tips. > > Jing, was site_verify.pr the older test suite was done by Jens as part of > VDS? I think site_verify scanner is a part of OSG MIS, and the author is Craig Prescott. VORS (http://scan.grid.iu.edu/) runs the script in MIS VO and shows the testing results. But VORS has problems since if a site doesn't support MIS VO, the site cannot be seen on the VORS. So GOC is replacing VORS with RSV. > > Also, how would you compare test_osg_ce with osg-vo-test? osg-vo-test contains one more thing than test_osg_ce: resource selection info such as free slot and cpu type. But it selects sites from dynamic VORS which has the problem described above, while test_osg_ce selects from a static sites list. RSV starts recently, so I am not familiar with it. But I think RSV contains the best list of sites in a chosen VO. Best, Jing > > Mats, do you have anything for such site testing? > > Zhao, I suggest you start with test_osg_ce. > > - Mike > > On 5/8/09 12:43 AM, Jing Tie wrote: >> >> Hi Mike, >> >> I think others might have better tools, but I can list three here as >> possible options. >> >> The first one is "osg-vo-test" developed by Chris Green. It lists >> probing results for all the sites in a certain VO (e.g. >> >> http://www.ci.uchicago.edu/~jtie/osg-vo-test/osg_summary/2008727-19:20:43.html). >> I attached his announcement about this tool in the end of the email. >> >> The second one "test_osg_ce" is used by Zhengxiong, Xi and me a lot. >> It's a simple package that I developed based on site_verify.pl. It >> runs the script (check authentication, globus-job-run, gridftp, user >> directory...) for each site in a user configuration file, and print >> out the good ones. I have attached it in the email. >> >> The last one is RSV developed by GOC. It runs similar scripts as >> site_verify.pl by system administrators, and reports to a centralized >> server periodically. You can find it on MyOSG >> (http://myosg.grid.iu.edu/about). I think it won't be hard to setup an >> independent version for user. >> >> Hope it helps, >> Jing >> >> >> ------------------------------------------------------ >> Email from Chris Green: >> >> Hi, >> >> I am happy to announce the first general release of an extensible >> VO-centric site testing kit, "osg-vo-test". >> >> To explain why you might be interested in this package, I quote from the >> overview >> >> >> on the package's home TWiki page >> : >> >> ------------------------------------------------------------------------ >> >> This package is an attempt to allow /application owners/ (by which I >> mean people responsible for running an application on OSG) to >> characterize OSG sites from the point of view of being able to run your >> application. Questions can be asked of each site in multiple ways, for >> instance: >> >> ? * Command line, eg: >> >> ? ? ping my-ce.my-domain >> >> ? * Fork job, eg: >> >> ? ? globus-job-run my-ce.my-domain /usr/bin/printenv >> >> ? * Batch job via CondorG. >> >> ? * ReSS, the *Re*source *S*election *S*ystem. >> >> ? * VORS, the *VO* *R*esource *S*elector. >> >> The results are presented primarily in the form of an HTML table >> >> >> with results columns (possibly multiple columns per test), with a link >> to more detailed information for the test. >> >> In addition, the summary results are available in .CSV format >> >> >> for machine readability; a true XML format may be forthcoming if there >> is enough demand. >> >> The application owner can write new test modules inheriting from the >> old; for more details, see Making your own module >> >> . >> In addition, existing tests are highly configurable and allow for the >> addition of new results columns with minimal effort; for a quick >> example, see the Getting Started >> >> >> section; and also the detail >> >> >> section. >> >> ------------------------------------------------------------------------ >> >> It is extremely straightforward, for example, to test basic >> authorization at a site from the point of view of your own voms-proxy; >> and other tests are showcased in example control scripts; or provided as >> standalone test modules. The example summary page >> >> >> shows a wide range of tests of which this extensible system is capable. >> >> Please, visit the package's home TWiki page >> , >> download >> >> >> the source; use >> >> ; >> and give feedback . The aim >> is to make it easy for application owners to put together their own >> suite or suites of tests to analyze sites across the OSG from the >> perspective of the needs of their own application(s) without having to >> re-invent the wheel to interrogate all the different sources of >> information about OSG sites. >> >> >> On Thu, May 7, 2009 at 2:47 PM, Michael Wilde wrote: >>> >>> Hi All, >>> >>> Many of you have developed, or use, various test suites to probe the >>> sites >>> of an OSG VO to see if basic authentication, job execution, and data >>> transfer work for a given user (cert). >>> >>> Can you let us know, on swift-devel, what test suite you use, and what >>> you >>> suggest that the Swift team use, as part of a Swift verification test >>> suite >>> that reports what sites Swift does (and does not) work on? >>> >>> In other words, once a user probes OSG to create a Swift sites file, it >>> would be useful to test if the basic services are working for the test >>> user, >>> before testing at a higher level through Swift. >>> >>> Thanks, >>> >>> Mike >>> >>> > From zhaozhang at uchicago.edu Fri May 8 11:47:10 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 08 May 2009 11:47:10 -0500 Subject: [Swift-devel] Re: What is best OSG site test suite ? In-Reply-To: References: <4A033AD8.40209@mcs.anl.gov> <4A04159D.50901@mcs.anl.gov> Message-ID: <4A04620E.3000803@uchicago.edu> Hi, Jing I tried the test_osg_ce tool out. I made two runs there, /home/zzhang/test_osg_ce/200958-10:39:37 is the latter one. For this test, no site showed in ce.pass. I am trying to figure out the error for each site, could you help me point out the reason for error? I mean how to interpret the result file. Also, I am attaching part of the ce.fail file of the first site. best wishes zhao ----------------------- AGLT2: gate01.aglt2.org ----------------------- Checking for a running gatekeeper: 1; port 2119 1 Checking remote host uptime: 1 11:39:50 up 4 days, 12:51, 1 user, load average: 1.18, 1.21, 1.22 1 Checking for GLOBUS_LOCATION: /opt/OSG100/globus 1 Checking expiration date of remote host certificate: Mar 24 17:17:48 2010 GMT 1 Checking for gatekeeper configuration file: 1 /opt/OSG100/globus/etc/globus-gatekeeper.conf 1 Checking users in grid-mapfile, if none must be using Prima: usatlas1,usatlas3,usatlas4,usatlas5 1 Checking for remote globus-sh-tools-vars.sh: 1 1 Checking configured grid services: PASS jobmanager,jobmanager-condor,jobmanager-fork,jobmanager-managedfork 1 Checking for OSG osg-attributes.conf: 1 1 Checking scheduler types associated with remote jobmanagers: 1 jobmanager is of type managedfork jobmanager-condor is of type condor jobmanager-fork is of type managedfork jobmanager-managedfork is of type managedfork 1 Checking for paths to binaries of remote schedulers: 1 Path to condor binaries is /opt/condor/bin Path to managedfork binaries is . 1 Checking remote scheduler status: 1 condor : 14 jobs running, 0 jobs idle/pending 1 Checking if Globus is deployed from the VDT: 1; version 1.10.1j 1 Checking for OSG version: 1; version 1.0.0 1 Checking for OSG grid3-user-vo-map.txt: 1 usatlas users: usatlas1,usatlas2,usatlas3,usatlas4 gridex users: gridex ops users: ops mis users: mis osg users: osg 1 Checking for OSG site name: AGLT2 1 Checking for OSG $GRID3 definition: /opt/OSG100 1 Checking for OSG $OSG_GRID definition: /afs/atlas.umich.edu/OSGWN 1 Checking for OSG $APP definition: /atlas/data08/OSG/APP 1 Checking for OSG $DATA definition: /atlas/data08/OSG/DATA 1 Checking for OSG $TMP definition: /atlas/data08/OSG/DATA 1 Checking for OSG $WNTMP definition: /tmp 1 Checking for OSG $APP available space: 2032.963 GB 1 Checking for OSG $DATA available space: 2032.963 GB 1 Checking for OSG $TMP available space: 2032.963 GB 1 Checking for OSG additional site-specific variable definitions: 1 ATLAS_APP prod /atlas/data08/OSG/APP/atlas_app ATLAS_DATA prod /atlas/data08/OSG/DATA/atlas_data ... ... 1 Checking for OSG execution jobmanager(s): gate01.aglt2.org/jobmanager-condor 1 Checking for OSG utility jobmanager(s): gate01.aglt2.org/jobmanager 1 Checking for OSG sponsoring VO: usatlas:80 local:20 1 Checking for OSG policy expression: NONE 1 Checking for OSG setup.sh: 1 1 Checking for OSG $Monalisa_HOME definition: /opt/osg-ce-1.0.0-r2/MonaLisa 1 Checking for MonALISA configuration: 0 Can't obtain ml_env 0 Checking for a running MonALISA: 0 MonALISA does not appear to be running 1 Checking for a running GANGLIA gmond daemon: 1 (pid 8095 ...) /usr/sbin/gmond name = "swap_free" owner = "University of Michigan" url = "https://hep.pa.msu.edu/twiki/bin/view/AGLT2" 1 Checking for a running GANGLIA gmetad daemon: 0 gmetad does not appear to be running 1 Checking for a running gsiftp server: 1; port 2811 1 Jing Tie wrote: > On Fri, May 8, 2009 at 6:21 AM, Michael Wilde wrote: > >> Jing, Ben, thanks very much for the tips. >> >> Jing, was site_verify.pr the older test suite was done by Jens as part of >> VDS? >> > > I think site_verify scanner is a part of OSG MIS, and the author is > Craig Prescott. > > VORS (http://scan.grid.iu.edu/) runs the script in MIS VO and shows > the testing results. But VORS has problems since if a site doesn't > support MIS VO, the site cannot be seen on the VORS. So GOC is > replacing VORS with RSV. > > >> Also, how would you compare test_osg_ce with osg-vo-test? >> > > osg-vo-test contains one more thing than test_osg_ce: resource > selection info such as free slot and cpu type. But it selects sites > from dynamic VORS which has the problem described above, while > test_osg_ce selects from a static sites list. > > RSV starts recently, so I am not familiar with it. But I think RSV > contains the best list of sites in a chosen VO. > > Best, > Jing > > >> Mats, do you have anything for such site testing? >> >> Zhao, I suggest you start with test_osg_ce. >> >> - Mike >> >> On 5/8/09 12:43 AM, Jing Tie wrote: >> >>> Hi Mike, >>> >>> I think others might have better tools, but I can list three here as >>> possible options. >>> >>> The first one is "osg-vo-test" developed by Chris Green. It lists >>> probing results for all the sites in a certain VO (e.g. >>> >>> http://www.ci.uchicago.edu/~jtie/osg-vo-test/osg_summary/2008727-19:20:43.html). >>> I attached his announcement about this tool in the end of the email. >>> >>> The second one "test_osg_ce" is used by Zhengxiong, Xi and me a lot. >>> It's a simple package that I developed based on site_verify.pl. It >>> runs the script (check authentication, globus-job-run, gridftp, user >>> directory...) for each site in a user configuration file, and print >>> out the good ones. I have attached it in the email. >>> >>> The last one is RSV developed by GOC. It runs similar scripts as >>> site_verify.pl by system administrators, and reports to a centralized >>> server periodically. You can find it on MyOSG >>> (http://myosg.grid.iu.edu/about). I think it won't be hard to setup an >>> independent version for user. >>> >>> Hope it helps, >>> Jing >>> >>> >>> ------------------------------------------------------ >>> Email from Chris Green: >>> >>> Hi, >>> >>> I am happy to announce the first general release of an extensible >>> VO-centric site testing kit, "osg-vo-test". >>> >>> To explain why you might be interested in this package, I quote from the >>> overview >>> >>> >>> on the package's home TWiki page >>> : >>> >>> ------------------------------------------------------------------------ >>> >>> This package is an attempt to allow /application owners/ (by which I >>> mean people responsible for running an application on OSG) to >>> characterize OSG sites from the point of view of being able to run your >>> application. Questions can be asked of each site in multiple ways, for >>> instance: >>> >>> * Command line, eg: >>> >>> ping my-ce.my-domain >>> >>> * Fork job, eg: >>> >>> globus-job-run my-ce.my-domain /usr/bin/printenv >>> >>> * Batch job via CondorG. >>> >>> * ReSS, the *Re*source *S*election *S*ystem. >>> >>> * VORS, the *VO* *R*esource *S*elector. >>> >>> The results are presented primarily in the form of an HTML table >>> >>> >>> with results columns (possibly multiple columns per test), with a link >>> to more detailed information for the test. >>> >>> In addition, the summary results are available in .CSV format >>> >>> >>> for machine readability; a true XML format may be forthcoming if there >>> is enough demand. >>> >>> The application owner can write new test modules inheriting from the >>> old; for more details, see Making your own module >>> >>> . >>> In addition, existing tests are highly configurable and allow for the >>> addition of new results columns with minimal effort; for a quick >>> example, see the Getting Started >>> >>> >>> section; and also the detail >>> >>> >>> section. >>> >>> ------------------------------------------------------------------------ >>> >>> It is extremely straightforward, for example, to test basic >>> authorization at a site from the point of view of your own voms-proxy; >>> and other tests are showcased in example control scripts; or provided as >>> standalone test modules. The example summary page >>> >>> >>> shows a wide range of tests of which this extensible system is capable. >>> >>> Please, visit the package's home TWiki page >>> , >>> download >>> >>> >>> the source; use >>> >>> ; >>> and give feedback . The aim >>> is to make it easy for application owners to put together their own >>> suite or suites of tests to analyze sites across the OSG from the >>> perspective of the needs of their own application(s) without having to >>> re-invent the wheel to interrogate all the different sources of >>> information about OSG sites. >>> >>> >>> On Thu, May 7, 2009 at 2:47 PM, Michael Wilde wrote: >>> >>>> Hi All, >>>> >>>> Many of you have developed, or use, various test suites to probe the >>>> sites >>>> of an OSG VO to see if basic authentication, job execution, and data >>>> transfer work for a given user (cert). >>>> >>>> Can you let us know, on swift-devel, what test suite you use, and what >>>> you >>>> suggest that the Swift team use, as part of a Swift verification test >>>> suite >>>> that reports what sites Swift does (and does not) work on? >>>> >>>> In other words, once a user probes OSG to create a Swift sites file, it >>>> would be useful to test if the basic services are working for the test >>>> user, >>>> before testing at a higher level through Swift. >>>> >>>> Thanks, >>>> >>>> Mike >>>> >>>> >>>> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From zhaozhang at uchicago.edu Fri May 8 13:54:14 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 08 May 2009 13:54:14 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> <4A032878.1080404@uchicago.edu> Message-ID: <4A047FD6.1050909@uchicago.edu> Hi, Ben Ben Clifford wrote: > On Thu, 7 May 2009, Zhao Zhang wrote: > > >> In the above case, we are specifying condor twice, Line 5 and Line 8. >> Is it Line 5 the one we are specifying condor-G ? So Condor-G is a execution >> provider, right? >> >> Then what does Line 8 mean? Specifying the location of job-manager? >> > > Think about the case for the UJ site rather than the site you pasted. > > Condor is specified once, and PBS is specified once. > > Condor runs on the local system to manage jobs that Swift submits. Condor > the local system then uses GRAM to submit to the remote system. On the > remote system, PBS manages the jobs. > > In the example you pasted, there are two Condor installations. One locally > (that is the one referred to in the element) and one remotely > that replaces PBS. > So, for the remote job manger, I could use pbs, condor, fork or SGE which ever is available on the remote site, right? zhao From rynge at renci.org Fri May 8 15:48:58 2009 From: rynge at renci.org (Mats Rynge) Date: Fri, 08 May 2009 16:48:58 -0400 Subject: [Swift-devel] Re: What is best OSG site test suite ? In-Reply-To: <4A033AD8.40209@mcs.anl.gov> References: <4A033AD8.40209@mcs.anl.gov> Message-ID: <4A049ABA.2080600@renci.org> Michael Wilde wrote: > > Many of you have developed, or use, various test suites to probe the > sites of an OSG VO to see if basic authentication, job execution, and > data transfer work for a given user (cert). Other people have replied with the individual user test suites. But personally, I believe that users should not have to do the testing. That should be done at the VO level. Engagement VO runs tests and advertises the result. For swift, the easiest way to get the information is to use: swift-osg-ress-site-catalog --engage-verified Compare this with: swift-osg-ress-site-catalog --vo=engage The first only returns sites which passed verification last night. The latter returns all sites advertising support for Engagement. I recommend that Engagement users only use the first version, and that the site catalog is updated at least daily. -- Mats Rynge Renaissance Computing Institute From aespinosa at cs.uchicago.edu Fri May 8 15:46:40 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 8 May 2009 15:46:40 -0500 Subject: [Swift-devel] SWIFT_JOBDIR_PATH env not passed on remote sites Message-ID: <50b07b4b0905081346k647306bds6c9d86b2e22f8a52@mail.gmail.com> Hi, SWIFT_JOBDIR_PATH only works in local executions. it looks like the environment variable is not being passed for remote job submissions: sites.xml: fast /home/aespinosa/workflows/jobdirpath/sitedir /home/aespinosa/workflows/localnode resulting wrapper log: Progress 2009-05-08 15:33:31.258501000-0500 LOG_START _____________________________________________________________________________ Wrapper _____________________________________________________________________________ Job directory mode is: link on shared filesystem DIR=jobs/v/cat-vbmpqiaj EXEC=/bin/cat STDIN= STDOUT=061-cattwo.out STDERR=stderr.txt DIRS= INF=061-cattwo.1.in|061-cattwo.2.in OUTF=061-cattwo.out KICKSTART= ARGS=061-cattwo.1.in 061-cattwo.2.in ARGC=2 Progress 2009-05-08 15:33:31.277785000-0500 CREATE_JOBDIR Created job directory: jobs/v/cat-vbmpqiaj Progress 2009-05-08 15:33:31.289956000-0500 CREATE_INPUTDIR Progress 2009-05-08 15:33:31.294467000-0500 LINK_INPUTS Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.1.in to jobs/v/cat-vbmpqiaj/061-cattwo.1.in Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.2.in to jobs/v/cat-vbmpqiaj/061-cattwo.2.in Progress 2009-05-08 15:33:31.327217000-0500 EXECUTE Moving back to workflow directory /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g Progress 2009-05-08 15:33:31.338850000-0500 EXECUTE_DONE Job ran successfully Progress 2009-05-08 15:33:31.343910000-0500 COPYING_OUTPUTS Progress 2009-05-08 15:33:31.373251000-0500 RM_JOBDIR Progress 2009-05-08 15:33:31.383733000-0500 TOUCH_SUCCESS Progress 2009-05-08 15:33:31.398652000-0500 END the directory "localnode" is also not created. Test peformed using 061-cattwo. my testcase script is in ~aespinosa/workflows/jobdirpath [aespinosa at tp-login1 jobdirpath]$ ./runtest.sh Swift svn swift-r2911 cog-r2394 RunID: remoterun Progress: uninitialized:1 Progress: Active:1 Final status: Finished successfully:1 Remote test FAIL Swift svn swift-r2911 cog-r2394 RunID: localrun Progress: Final status: Finished successfully:1 Local test PASS -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Fri May 8 16:23:15 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 May 2009 16:23:15 -0500 Subject: [Swift-devel] SWIFT_JOBDIR_PATH env not passed on remote sites In-Reply-To: <50b07b4b0905081346k647306bds6c9d86b2e22f8a52@mail.gmail.com> References: <50b07b4b0905081346k647306bds6c9d86b2e22f8a52@mail.gmail.com> Message-ID: <1241817795.15432.0.camel@localhost> Does it also happen if you use GT2 instead of PBS? On Fri, 2009-05-08 at 15:46 -0500, Allan Espinosa wrote: > Hi, > > SWIFT_JOBDIR_PATH only works in local executions. it looks like the > environment variable is not being passed for remote job submissions: > > sites.xml: > > > > fast > > /home/aespinosa/workflows/jobdirpath/sitedir > key="SWIFT_JOBDIR_PATH">/home/aespinosa/workflows/localnode > > > > resulting wrapper log: > Progress 2009-05-08 15:33:31.258501000-0500 LOG_START > > _____________________________________________________________________________ > > Wrapper > _____________________________________________________________________________ > > Job directory mode is: link on shared filesystem > DIR=jobs/v/cat-vbmpqiaj > EXEC=/bin/cat > STDIN= > STDOUT=061-cattwo.out > STDERR=stderr.txt > DIRS= > INF=061-cattwo.1.in|061-cattwo.2.in > OUTF=061-cattwo.out > KICKSTART= > ARGS=061-cattwo.1.in 061-cattwo.2.in > ARGC=2 > Progress 2009-05-08 15:33:31.277785000-0500 CREATE_JOBDIR > Created job directory: jobs/v/cat-vbmpqiaj > Progress 2009-05-08 15:33:31.289956000-0500 CREATE_INPUTDIR > Progress 2009-05-08 15:33:31.294467000-0500 LINK_INPUTS > Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.1.in > to jobs/v/cat-vbmpqiaj/061-cattwo.1.in > Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.2.in > to jobs/v/cat-vbmpqiaj/061-cattwo.2.in > Progress 2009-05-08 15:33:31.327217000-0500 EXECUTE > Moving back to workflow directory > /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g > Progress 2009-05-08 15:33:31.338850000-0500 EXECUTE_DONE > Job ran successfully > Progress 2009-05-08 15:33:31.343910000-0500 COPYING_OUTPUTS > Progress 2009-05-08 15:33:31.373251000-0500 RM_JOBDIR > Progress 2009-05-08 15:33:31.383733000-0500 TOUCH_SUCCESS > Progress 2009-05-08 15:33:31.398652000-0500 END > > the directory "localnode" is also not created. Test peformed using > 061-cattwo. my testcase script is in ~aespinosa/workflows/jobdirpath > > [aespinosa at tp-login1 jobdirpath]$ ./runtest.sh > Swift svn swift-r2911 cog-r2394 > > RunID: remoterun > Progress: uninitialized:1 > Progress: Active:1 > Final status: Finished successfully:1 > Remote test FAIL > Swift svn swift-r2911 cog-r2394 > > RunID: localrun > Progress: > Final status: Finished successfully:1 > Local test PASS > > -Allan > > From aespinosa at cs.uchicago.edu Fri May 8 16:37:29 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 8 May 2009 16:37:29 -0500 Subject: [Swift-devel] SWIFT_JOBDIR_PATH env not passed on remote sites In-Reply-To: <1241817795.15432.0.camel@localhost> References: <50b07b4b0905081346k647306bds6c9d86b2e22f8a52@mail.gmail.com> <1241817795.15432.0.camel@localhost> Message-ID: <50b07b4b0905081437v6f1035ddjd13bcfbe8095b146@mail.gmail.com> It works with gt2. Is this a pbs problem? Other provider that I am testing that it does not work is on provider-deef. For my purposes, I would like to have it work under provider-deef and provider-pbs Thanks, -Allan 2009/5/8 Mihael Hategan : > Does it also happen if you use GT2 instead of PBS? > > On Fri, 2009-05-08 at 15:46 -0500, Allan Espinosa wrote: >> Hi, >> >> SWIFT_JOBDIR_PATH only works in local executions. ?it looks like the >> environment variable is not being passed for remote job submissions: >> >> sites.xml: >> >> ? >> ? ? >> ? ? fast >> ? ? >> ? ? /home/aespinosa/workflows/jobdirpath/sitedir >> ? ? > key="SWIFT_JOBDIR_PATH">/home/aespinosa/workflows/localnode >> ? >> >> >> resulting wrapper log: >> Progress ?2009-05-08 15:33:31.258501000-0500 ?LOG_START >> >> _____________________________________________________________________________ >> >> ? ? ? ? Wrapper >> _____________________________________________________________________________ >> >> Job directory mode is: link on shared filesystem >> DIR=jobs/v/cat-vbmpqiaj >> EXEC=/bin/cat >> STDIN= >> STDOUT=061-cattwo.out >> STDERR=stderr.txt >> DIRS= >> INF=061-cattwo.1.in|061-cattwo.2.in >> OUTF=061-cattwo.out >> KICKSTART= >> ARGS=061-cattwo.1.in 061-cattwo.2.in >> ARGC=2 >> Progress ?2009-05-08 15:33:31.277785000-0500 ?CREATE_JOBDIR >> Created job directory: jobs/v/cat-vbmpqiaj >> Progress ?2009-05-08 15:33:31.289956000-0500 ?CREATE_INPUTDIR >> Progress ?2009-05-08 15:33:31.294467000-0500 ?LINK_INPUTS >> Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.1.in >> to jobs/v/cat-vbmpqiaj/061-cattwo.1.in >> Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.2.in >> to jobs/v/cat-vbmpqiaj/061-cattwo.2.in >> Progress ?2009-05-08 15:33:31.327217000-0500 ?EXECUTE >> Moving back to workflow directory >> /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g >> Progress ?2009-05-08 15:33:31.338850000-0500 ?EXECUTE_DONE >> Job ran successfully >> Progress ?2009-05-08 15:33:31.343910000-0500 ?COPYING_OUTPUTS >> Progress ?2009-05-08 15:33:31.373251000-0500 ?RM_JOBDIR >> Progress ?2009-05-08 15:33:31.383733000-0500 ?TOUCH_SUCCESS >> Progress ?2009-05-08 15:33:31.398652000-0500 ?END >> >> the directory "localnode" is also not created. ?Test peformed using >> 061-cattwo. ?my testcase script is in ~aespinosa/workflows/jobdirpath >> >> [aespinosa at tp-login1 jobdirpath]$ ./runtest.sh >> Swift svn swift-r2911 cog-r2394 >> >> RunID: remoterun >> Progress: ?uninitialized:1 >> Progress: ?Active:1 >> Final status: ?Finished successfully:1 >> Remote test FAIL >> Swift svn swift-r2911 cog-r2394 >> >> RunID: localrun >> Progress: >> Final status: ?Finished successfully:1 >> Local test PASS >> >> -Allan >> >> > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From tiejing at gmail.com Fri May 8 16:52:09 2009 From: tiejing at gmail.com (Jing Tie) Date: Fri, 8 May 2009 16:52:09 -0500 Subject: [Swift-devel] Re: What is best OSG site test suite ? In-Reply-To: <4A04620E.3000803@uchicago.edu> References: <4A033AD8.40209@mcs.anl.gov> <4A04159D.50901@mcs.anl.gov> <4A04620E.3000803@uchicago.edu> Message-ID: Hi Zhao, There are two problems: 1. some sites haven't finished all the tests when you run "find_failure.pl". (the tool takes 2 - 10 mins depending on the sits) If some site hasn't finished all the tests, just re-execute this script later. But if some site takes too long time, it might indicate some problems (e.g. gridftp). 2. test suit contains testing a monitoring tool "Monalisa" which may not useful here. So I delete "Monalisa" one, and also print out the status for all the sites each time you run "find_failure.pl". A newer version is at http://ci.uchicago.edu/~jtie/ts/test_osg_ce_0.1.tar.gz Best, Jing On Fri, May 8, 2009 at 11:47 AM, Zhao Zhang wrote: > Hi, Jing > > I tried the test_osg_ce tool out. I made two runs there, > /home/zzhang/test_osg_ce/200958-10:39:37 is the latter one. > For this test, no site showed in ce.pass. I am trying to figure out the > error for each site, could you help me point out the > reason for error? I mean how to interpret the result file. Also, I am > attaching part of the ce.fail file of the first site. > > best wishes > zhao > > > ----------------------- AGLT2: gate01.aglt2.org ----------------------- > Checking for a running gatekeeper: > 1; port 2119 > 1 > Checking remote host uptime: > 1 > ?11:39:50 up 4 days, 12:51, ?1 user, ?load average: 1.18, 1.21, 1.22 > 1 > Checking for GLOBUS_LOCATION: > /opt/OSG100/globus > 1 > Checking expiration date of remote host certificate: > Mar 24 17:17:48 2010 GMT > 1 > Checking for gatekeeper configuration file: > 1 > ?/opt/OSG100/globus/etc/globus-gatekeeper.conf > 1 > Checking users in grid-mapfile, if none must be using Prima: > usatlas1,usatlas3,usatlas4,usatlas5 > 1 > Checking for remote globus-sh-tools-vars.sh: > 1 > 1 > Checking configured grid services: > PASS > ?jobmanager,jobmanager-condor,jobmanager-fork,jobmanager-managedfork > 1 > Checking for OSG osg-attributes.conf: > 1 > 1 > Checking scheduler types associated with remote jobmanagers: > 1 > ?jobmanager is of type managedfork > ?jobmanager-condor is of type condor > ?jobmanager-fork is of type managedfork > ?jobmanager-managedfork is of type managedfork > 1 > Checking for paths to binaries of remote schedulers: > 1 > ?Path to condor binaries is /opt/condor/bin > ?Path to managedfork binaries is . > 1 > Checking remote scheduler status: > 1 > ?condor : 14 jobs running, 0 jobs idle/pending > 1 > Checking if Globus is deployed from the VDT: > 1; version 1.10.1j > 1 > Checking for OSG version: > 1; version 1.0.0 > 1 > Checking for OSG grid3-user-vo-map.txt: > 1 > ?usatlas users: usatlas1,usatlas2,usatlas3,usatlas4 > ?gridex users: gridex > ?ops users: ops > ?mis users: mis > ?osg users: osg > 1 > Checking for OSG site name: > AGLT2 > 1 > Checking for OSG $GRID3 definition: > /opt/OSG100 > 1 > Checking for OSG $OSG_GRID definition: > /afs/atlas.umich.edu/OSGWN > 1 > Checking for OSG $APP definition: > /atlas/data08/OSG/APP > 1 > Checking for OSG $DATA definition: > /atlas/data08/OSG/DATA > 1 > Checking for OSG $TMP definition: > /atlas/data08/OSG/DATA > 1 > Checking for OSG $WNTMP definition: > /tmp > 1 > Checking for OSG $APP available space: > 2032.963 GB > 1 > Checking for OSG $DATA available space: > 2032.963 GB > 1 > Checking for OSG $TMP available space: > 2032.963 GB > 1 > Checking for OSG additional site-specific variable definitions: > 1 > ? > ? ATLAS_APP prod /atlas/data08/OSG/APP/atlas_app > ? ATLAS_DATA prod /atlas/data08/OSG/DATA/atlas_data > ? ... > ? ... > 1 > Checking for OSG execution jobmanager(s): > gate01.aglt2.org/jobmanager-condor > 1 > Checking for OSG utility jobmanager(s): > gate01.aglt2.org/jobmanager > 1 > Checking for OSG sponsoring VO: > usatlas:80 local:20 > 1 > Checking for OSG policy expression: > NONE > 1 > Checking for OSG setup.sh: > 1 > 1 > Checking for OSG $Monalisa_HOME definition: > /opt/osg-ce-1.0.0-r2/MonaLisa > 1 > Checking for MonALISA configuration: > 0 > ?Can't obtain ml_env > 0 > Checking for a running MonALISA: > 0 > ?MonALISA does not appear to be running > 1 > Checking for a running GANGLIA gmond daemon: > 1 (pid 8095 ...) > ?/usr/sbin/gmond > ?name = "swap_free" > ?owner = "University of Michigan" > ?url = "https://hep.pa.msu.edu/twiki/bin/view/AGLT2" > 1 > Checking for a running GANGLIA gmetad daemon: > 0 > ?gmetad does not appear to be running > 1 > Checking for a running gsiftp server: > 1; port 2811 > 1 > > > Jing Tie wrote: >> >> On Fri, May 8, 2009 at 6:21 AM, Michael Wilde wrote: >> >>> >>> Jing, Ben, thanks very much for the tips. >>> >>> Jing, was site_verify.pr the older test suite was done by Jens as part of >>> VDS? >>> >> >> I think site_verify scanner is a part of OSG MIS, and the author is >> Craig Prescott. >> >> VORS (http://scan.grid.iu.edu/) runs the script in MIS VO and shows >> the testing results. But VORS has problems since if a site doesn't >> support MIS VO, the site cannot be seen on the VORS. So GOC is >> replacing VORS with RSV. >> >> >>> >>> Also, how would you compare test_osg_ce with osg-vo-test? >>> >> >> osg-vo-test contains one more thing than test_osg_ce: resource >> selection info such as free slot and cpu type. But it selects sites >> from dynamic VORS which has the problem described above, while >> test_osg_ce selects from a static sites list. >> >> RSV starts recently, so I am not familiar with it. But I think RSV >> contains the best list of sites in a chosen VO. >> >> Best, >> Jing >> >> >>> >>> Mats, do you have anything for such site testing? >>> >>> Zhao, I suggest you start with test_osg_ce. >>> >>> - Mike >>> >>> On 5/8/09 12:43 AM, Jing Tie wrote: >>> >>>> >>>> Hi Mike, >>>> >>>> I think others might have better tools, but I can list three here as >>>> possible options. >>>> >>>> The first one is "osg-vo-test" developed by Chris Green. It lists >>>> probing results for all the sites in a certain VO (e.g. >>>> >>>> >>>> http://www.ci.uchicago.edu/~jtie/osg-vo-test/osg_summary/2008727-19:20:43.html). >>>> I attached his announcement about this tool in the end of the email. >>>> >>>> The second one "test_osg_ce" is used by Zhengxiong, Xi and me a lot. >>>> It's a simple package that I developed based on site_verify.pl. It >>>> runs the script (check authentication, globus-job-run, gridftp, user >>>> directory...) for each site in a user configuration file, and print >>>> out the good ones. I have attached it in the email. >>>> >>>> The last one is RSV developed by GOC. It runs similar scripts as >>>> site_verify.pl by system administrators, and reports to a centralized >>>> server periodically. You can find it on MyOSG >>>> (http://myosg.grid.iu.edu/about). I think it won't be hard to setup an >>>> independent version for user. >>>> >>>> Hope it helps, >>>> Jing >>>> >>>> >>>> ------------------------------------------------------ >>>> Email from Chris Green: >>>> >>>> Hi, >>>> >>>> I am happy to announce the first general release of an extensible >>>> VO-centric site testing kit, "osg-vo-test". >>>> >>>> To explain why you might be interested in this package, I quote from the >>>> overview >>>> >>>> >>>> >>>> on the package's home TWiki page >>>> : >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> This package is an attempt to allow /application owners/ (by which I >>>> mean people responsible for running an application on OSG) to >>>> characterize OSG sites from the point of view of being able to run your >>>> application. Questions can be asked of each site in multiple ways, for >>>> instance: >>>> >>>> ?* Command line, eg: >>>> >>>> ? ?ping my-ce.my-domain >>>> >>>> ?* Fork job, eg: >>>> >>>> ? ?globus-job-run my-ce.my-domain /usr/bin/printenv >>>> >>>> ?* Batch job via CondorG. >>>> >>>> ?* ReSS, the *Re*source *S*election *S*ystem. >>>> >>>> ?* VORS, the *VO* *R*esource *S*elector. >>>> >>>> The results are presented primarily in the form of an HTML table >>>> >>>> >>>> >>>> with results columns (possibly multiple columns per test), with a link >>>> to more detailed information for the test. >>>> >>>> In addition, the summary results are available in .CSV format >>>> >>>> >>>> >>>> for machine readability; a true XML format may be forthcoming if there >>>> is enough demand. >>>> >>>> The application owner can write new test modules inheriting from the >>>> old; for more details, see Making your own module >>>> >>>> >>>> . >>>> In addition, existing tests are highly configurable and allow for the >>>> addition of new results columns with minimal effort; for a quick >>>> example, see the Getting Started >>>> >>>> >>>> >>>> section; and also the detail >>>> >>>> >>>> >>>> section. >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> It is extremely straightforward, for example, to test basic >>>> authorization at a site from the point of view of your own voms-proxy; >>>> and other tests are showcased in example control scripts; or provided as >>>> standalone test modules. The example summary page >>>> >>>> >>>> >>>> shows a wide range of tests of which this extensible system is capable. >>>> >>>> Please, visit the package's home TWiki page >>>> , >>>> download >>>> >>>> >>>> >>>> the source; use >>>> >>>> >>>> ; >>>> and give feedback . The aim >>>> is to make it easy for application owners to put together their own >>>> suite or suites of tests to analyze sites across the OSG from the >>>> perspective of the needs of their own application(s) without having to >>>> re-invent the wheel to interrogate all the different sources of >>>> information about OSG sites. >>>> >>>> >>>> On Thu, May 7, 2009 at 2:47 PM, Michael Wilde wrote: >>>> >>>>> >>>>> Hi All, >>>>> >>>>> Many of you have developed, or use, various test suites to probe the >>>>> sites >>>>> of an OSG VO to see if basic authentication, job execution, and data >>>>> transfer work for a given user (cert). >>>>> >>>>> Can you let us know, on swift-devel, what test suite you use, and what >>>>> you >>>>> suggest that the Swift team use, as part of a Swift verification test >>>>> suite >>>>> that reports what sites Swift does (and does not) work on? >>>>> >>>>> In other words, once a user probes OSG to create a Swift sites file, it >>>>> would be useful to test if the basic services are working for the test >>>>> user, >>>>> before testing at a higher level through Swift. >>>>> >>>>> Thanks, >>>>> >>>>> Mike >>>>> >>>>> >>>>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > From hategan at mcs.anl.gov Fri May 8 17:01:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 May 2009 17:01:46 -0500 Subject: [Swift-devel] SWIFT_JOBDIR_PATH env not passed on remote sites In-Reply-To: <50b07b4b0905081437v6f1035ddjd13bcfbe8095b146@mail.gmail.com> References: <50b07b4b0905081346k647306bds6c9d86b2e22f8a52@mail.gmail.com> <1241817795.15432.0.camel@localhost> <50b07b4b0905081437v6f1035ddjd13bcfbe8095b146@mail.gmail.com> Message-ID: <1241820106.16140.0.camel@localhost> On Fri, 2009-05-08 at 16:37 -0500, Allan Espinosa wrote: > It works with gt2. Is this a pbs problem? It seems so. You should file a bug report. > > Other provider that I am testing that it does not work is on > provider-deef. For my purposes, I would like to have it work under > provider-deef and provider-pbs > > Thanks, > -Allan > > 2009/5/8 Mihael Hategan : > > Does it also happen if you use GT2 instead of PBS? > > > > On Fri, 2009-05-08 at 15:46 -0500, Allan Espinosa wrote: > >> Hi, > >> > >> SWIFT_JOBDIR_PATH only works in local executions. it looks like the > >> environment variable is not being passed for remote job submissions: > >> > >> sites.xml: > >> > >> > >> > >> fast > >> > >> /home/aespinosa/workflows/jobdirpath/sitedir > >> >> key="SWIFT_JOBDIR_PATH">/home/aespinosa/workflows/localnode > >> > >> > >> > >> resulting wrapper log: > >> Progress 2009-05-08 15:33:31.258501000-0500 LOG_START > >> > >> _____________________________________________________________________________ > >> > >> Wrapper > >> _____________________________________________________________________________ > >> > >> Job directory mode is: link on shared filesystem > >> DIR=jobs/v/cat-vbmpqiaj > >> EXEC=/bin/cat > >> STDIN= > >> STDOUT=061-cattwo.out > >> STDERR=stderr.txt > >> DIRS= > >> INF=061-cattwo.1.in|061-cattwo.2.in > >> OUTF=061-cattwo.out > >> KICKSTART= > >> ARGS=061-cattwo.1.in 061-cattwo.2.in > >> ARGC=2 > >> Progress 2009-05-08 15:33:31.277785000-0500 CREATE_JOBDIR > >> Created job directory: jobs/v/cat-vbmpqiaj > >> Progress 2009-05-08 15:33:31.289956000-0500 CREATE_INPUTDIR > >> Progress 2009-05-08 15:33:31.294467000-0500 LINK_INPUTS > >> Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.1.in > >> to jobs/v/cat-vbmpqiaj/061-cattwo.1.in > >> Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.2.in > >> to jobs/v/cat-vbmpqiaj/061-cattwo.2.in > >> Progress 2009-05-08 15:33:31.327217000-0500 EXECUTE > >> Moving back to workflow directory > >> /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g > >> Progress 2009-05-08 15:33:31.338850000-0500 EXECUTE_DONE > >> Job ran successfully > >> Progress 2009-05-08 15:33:31.343910000-0500 COPYING_OUTPUTS > >> Progress 2009-05-08 15:33:31.373251000-0500 RM_JOBDIR > >> Progress 2009-05-08 15:33:31.383733000-0500 TOUCH_SUCCESS > >> Progress 2009-05-08 15:33:31.398652000-0500 END > >> > >> the directory "localnode" is also not created. Test peformed using > >> 061-cattwo. my testcase script is in ~aespinosa/workflows/jobdirpath > >> > >> [aespinosa at tp-login1 jobdirpath]$ ./runtest.sh > >> Swift svn swift-r2911 cog-r2394 > >> > >> RunID: remoterun > >> Progress: uninitialized:1 > >> Progress: Active:1 > >> Final status: Finished successfully:1 > >> Remote test FAIL > >> Swift svn swift-r2911 cog-r2394 > >> > >> RunID: localrun > >> Progress: > >> Final status: Finished successfully:1 > >> Local test PASS > >> > >> -Allan > >> > >> > > > > > > > > > From keahey at mcs.anl.gov Fri May 8 17:11:22 2009 From: keahey at mcs.anl.gov (Kate Keahey) Date: Fri, 08 May 2009 17:11:22 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A0323B7.8080206@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> Message-ID: <4A04AE0A.9070100@mcs.anl.gov> Michael Wilde wrote: >>>> Can you aggregate VM's from a few different physical clusters into >>>> one Nimbus workspace? >>> I don't think so. Tim may make commit on it. >> >> There is some work going on right now making auto-configuration easier >> to do >> over multiple clusters (that is possible now, it's just very 'manual' and >> non-ideal unlike with one physical cluster). You wouldn't really want >> to do NFS >> across a WAN, though. > > Indeed. Now that I think this through more clearly, one workspace == one > cluster == one Swift "site", so we could aggregate the resources of > multiple workspaces through Swift to execute a multi-site workflow. > > - Mike Just to make sure we are all on the same page: we can stand up one virtual cluster over a few different clouds. We did that with a hadoop clusters, MPI clusters and a few looser things. The only reason Tim is advocating against doing this here is that we are standing up an OSG cluster which has NFS on it where the latencies would likely kill us (as opposed to an MPI cluster where the latencies only slow us down slightly). -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From bugzilla-daemon at mcs.anl.gov Fri May 8 17:15:52 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 8 May 2009 17:15:52 -0500 (CDT) Subject: [Swift-devel] [Bug 208] New: SWIFT_JOBDIR_PATH env not passed on remote sites (pbs and deef) Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=208 Summary: SWIFT_JOBDIR_PATH env not passed on remote sites (pbs and deef) Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: aespinosa at cs.uchicago.edu CC: benc at hawaga.org.uk, wilde at mcs.anl.gov Created an attachment (id=286) --> (http://bugzilla.mcs.anl.gov/swift/attachment.cgi?id=286) test case: pbs (fail), gt2(pass), local(pass) i, SWIFT_JOBDIR_PATH only works in local executions. it looks like the environment variable is not being passed for remote job submissions: sites.xml: fast /home/aespinosa/workflows/jobdirpath/sitedir /home/aespinosa/workflows/localnode resulting wrapper log: Progress 2009-05-08 15:33:31.258501000-0500 LOG_START _____________________________________________________________________________ Wrapper _____________________________________________________________________________ Job directory mode is: link on shared filesystem DIR=jobs/v/cat-vbmpqiaj EXEC=/bin/cat STDIN= STDOUT=061-cattwo.out STDERR=stderr.txt DIRS= INF=061-cattwo.1.in|061-cattwo.2.in OUTF=061-cattwo.out KICKSTART= ARGS=061-cattwo.1.in 061-cattwo.2.in ARGC=2 Progress 2009-05-08 15:33:31.277785000-0500 CREATE_JOBDIR Created job directory: jobs/v/cat-vbmpqiaj Progress 2009-05-08 15:33:31.289956000-0500 CREATE_INPUTDIR Progress 2009-05-08 15:33:31.294467000-0500 LINK_INPUTS Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.1.in to jobs/v/cat-vbmpqiaj/061-cattwo.1.in Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.2.in to jobs/v/cat-vbmpqiaj/061-cattwo.2.in Progress 2009-05-08 15:33:31.327217000-0500 EXECUTE Moving back to workflow directory /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g Progress 2009-05-08 15:33:31.338850000-0500 EXECUTE_DONE Job ran successfully Progress 2009-05-08 15:33:31.343910000-0500 COPYING_OUTPUTS Progress 2009-05-08 15:33:31.373251000-0500 RM_JOBDIR Progress 2009-05-08 15:33:31.383733000-0500 TOUCH_SUCCESS Progress 2009-05-08 15:33:31.398652000-0500 END the directory "localnode" is also not created. Test peformed using 061-cattwo. my testcase script is in ~aespinosa/workflows/jobdirpath [aespinosa at tp-login1 jobdirpath]$ ./runtest.sh Swift svn swift-r2911 cog-r2394 RunID: remoterun Progress: uninitialized:1 Progress: Active:1 Final status: Finished successfully:1 Remote test FAIL Swift svn swift-r2911 cog-r2394 RunID: localrun Progress: Final status: Finished successfully:1 Local test PASS Tests indicated SWIFT_JOBDIR_PATH only works with gt2 and local providers. Test case is attached in this bugzilla report -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching someone on the CC list of the bug. From foster at anl.gov Fri May 8 17:19:01 2009 From: foster at anl.gov (Ian Foster) Date: Fri, 8 May 2009 17:19:01 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A04AE0A.9070100@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A04AE0A.9070100@mcs.anl.gov> Message-ID: <0B074FA7-FCE3-467C-984C-E08961C61D05@anl.gov> Yes, and all we are trying to do is to get some basic testing done before we move to EC2 On May 8, 2009, at 5:11 PM, Kate Keahey wrote: > > > Michael Wilde wrote: > >>>>> Can you aggregate VM's from a few different physical clusters >>>>> into one Nimbus workspace? >>>> I don't think so. Tim may make commit on it. >>> >>> There is some work going on right now making auto-configuration >>> easier to do >>> over multiple clusters (that is possible now, it's just very >>> 'manual' and >>> non-ideal unlike with one physical cluster). You wouldn't really >>> want to do NFS >>> across a WAN, though. >> Indeed. Now that I think this through more clearly, one workspace >> == one cluster == one Swift "site", so we could aggregate the >> resources of multiple workspaces through Swift to execute a multi- >> site workflow. >> - Mike > > Just to make sure we are all on the same page: we can stand up one > virtual cluster over a few different clouds. We did that with a > hadoop clusters, MPI clusters and a few looser things. The only > reason Tim is advocating against doing this here is that we are > standing up an OSG cluster which has NFS on it where the latencies > would likely kill us (as opposed to an MPI cluster where the > latencies only slow us down slightly). > > > > -- > > Kate Keahey, > Mathematics & CS Division, Argonne National Laboratory > Computation Institute, University of Chicago > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri May 8 17:35:40 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 08 May 2009 17:35:40 -0500 Subject: [Swift-devel] SWIFT_JOBDIR_PATH env not passed on remote sites In-Reply-To: <50b07b4b0905081437v6f1035ddjd13bcfbe8095b146@mail.gmail.com> References: <50b07b4b0905081346k647306bds6c9d86b2e22f8a52@mail.gmail.com> <1241817795.15432.0.camel@localhost> <50b07b4b0905081437v6f1035ddjd13bcfbe8095b146@mail.gmail.com> Message-ID: <4A04B3BC.2060600@mcs.anl.gov> Yes, under provider-deef and Falkon the entire environment is lost. We think that the C worker agent is not receiving or setting the environment. - Mike On 5/8/09 4:37 PM, Allan Espinosa wrote: > It works with gt2. Is this a pbs problem? > > Other provider that I am testing that it does not work is on > provider-deef. For my purposes, I would like to have it work under > provider-deef and provider-pbs > > Thanks, > -Allan > > 2009/5/8 Mihael Hategan : >> Does it also happen if you use GT2 instead of PBS? >> >> On Fri, 2009-05-08 at 15:46 -0500, Allan Espinosa wrote: >>> Hi, >>> >>> SWIFT_JOBDIR_PATH only works in local executions. it looks like the >>> environment variable is not being passed for remote job submissions: >>> >>> sites.xml: >>> >>> >>> >>> fast >>> >>> /home/aespinosa/workflows/jobdirpath/sitedir >>> >> key="SWIFT_JOBDIR_PATH">/home/aespinosa/workflows/localnode >>> >>> >>> >>> resulting wrapper log: >>> Progress 2009-05-08 15:33:31.258501000-0500 LOG_START >>> >>> _____________________________________________________________________________ >>> >>> Wrapper >>> _____________________________________________________________________________ >>> >>> Job directory mode is: link on shared filesystem >>> DIR=jobs/v/cat-vbmpqiaj >>> EXEC=/bin/cat >>> STDIN= >>> STDOUT=061-cattwo.out >>> STDERR=stderr.txt >>> DIRS= >>> INF=061-cattwo.1.in|061-cattwo.2.in >>> OUTF=061-cattwo.out >>> KICKSTART= >>> ARGS=061-cattwo.1.in 061-cattwo.2.in >>> ARGC=2 >>> Progress 2009-05-08 15:33:31.277785000-0500 CREATE_JOBDIR >>> Created job directory: jobs/v/cat-vbmpqiaj >>> Progress 2009-05-08 15:33:31.289956000-0500 CREATE_INPUTDIR >>> Progress 2009-05-08 15:33:31.294467000-0500 LINK_INPUTS >>> Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.1.in >>> to jobs/v/cat-vbmpqiaj/061-cattwo.1.in >>> Linked input: /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.2.in >>> to jobs/v/cat-vbmpqiaj/061-cattwo.2.in >>> Progress 2009-05-08 15:33:31.327217000-0500 EXECUTE >>> Moving back to workflow directory >>> /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g >>> Progress 2009-05-08 15:33:31.338850000-0500 EXECUTE_DONE >>> Job ran successfully >>> Progress 2009-05-08 15:33:31.343910000-0500 COPYING_OUTPUTS >>> Progress 2009-05-08 15:33:31.373251000-0500 RM_JOBDIR >>> Progress 2009-05-08 15:33:31.383733000-0500 TOUCH_SUCCESS >>> Progress 2009-05-08 15:33:31.398652000-0500 END >>> >>> the directory "localnode" is also not created. Test peformed using >>> 061-cattwo. my testcase script is in ~aespinosa/workflows/jobdirpath >>> >>> [aespinosa at tp-login1 jobdirpath]$ ./runtest.sh >>> Swift svn swift-r2911 cog-r2394 >>> >>> RunID: remoterun >>> Progress: uninitialized:1 >>> Progress: Active:1 >>> Final status: Finished successfully:1 >>> Remote test FAIL >>> Swift svn swift-r2911 cog-r2394 >>> >>> RunID: localrun >>> Progress: >>> Final status: Finished successfully:1 >>> Local test PASS >>> >>> -Allan >>> >>> >> >> > > > From benc at hawaga.org.uk Sat May 9 04:22:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 9 May 2009 09:22:09 +0000 (GMT) Subject: [Swift-devel] SWIFT_JOBDIR_PATH env not passed on remote sites In-Reply-To: <4A04B3BC.2060600@mcs.anl.gov> References: <50b07b4b0905081346k647306bds6c9d86b2e22f8a52@mail.gmail.com> <1241817795.15432.0.camel@localhost> <50b07b4b0905081437v6f1035ddjd13bcfbe8095b146@mail.gmail.com> <4A04B3BC.2060600@mcs.anl.gov> Message-ID: Swift+falkon+envvars has come up on this list before. The test suite almost catches this - it has a test for environment variables, but that is not regularly run against non-local providers. I should probably fix that. On Fri, 8 May 2009, Michael Wilde wrote: > Yes, under provider-deef and Falkon the entire environment is lost. We think > that the C worker agent is not receiving or setting the environment. > > - Mike > > On 5/8/09 4:37 PM, Allan Espinosa wrote: > > It works with gt2. Is this a pbs problem? > > > > Other provider that I am testing that it does not work is on > > provider-deef. For my purposes, I would like to have it work under > > provider-deef and provider-pbs > > > > Thanks, > > -Allan > > > > 2009/5/8 Mihael Hategan : > > > Does it also happen if you use GT2 instead of PBS? > > > > > > On Fri, 2009-05-08 at 15:46 -0500, Allan Espinosa wrote: > > > > Hi, > > > > > > > > SWIFT_JOBDIR_PATH only works in local executions. it looks like the > > > > environment variable is not being passed for remote job submissions: > > > > > > > > sites.xml: > > > > > > > > > > > > > > > > fast > > > > > > > > > > > >/home/aespinosa/workflows/jobdirpath/sitedir > > > > > > > key="SWIFT_JOBDIR_PATH">/home/aespinosa/workflows/localnode > > > > > > > > > > > > > > > > resulting wrapper log: > > > > Progress 2009-05-08 15:33:31.258501000-0500 LOG_START > > > > > > > > _____________________________________________________________________________ > > > > > > > > Wrapper > > > > _____________________________________________________________________________ > > > > > > > > Job directory mode is: link on shared filesystem > > > > DIR=jobs/v/cat-vbmpqiaj > > > > EXEC=/bin/cat > > > > STDIN= > > > > STDOUT=061-cattwo.out > > > > STDERR=stderr.txt > > > > DIRS= > > > > INF=061-cattwo.1.in|061-cattwo.2.in > > > > OUTF=061-cattwo.out > > > > KICKSTART= > > > > ARGS=061-cattwo.1.in 061-cattwo.2.in > > > > ARGC=2 > > > > Progress 2009-05-08 15:33:31.277785000-0500 CREATE_JOBDIR > > > > Created job directory: jobs/v/cat-vbmpqiaj > > > > Progress 2009-05-08 15:33:31.289956000-0500 CREATE_INPUTDIR > > > > Progress 2009-05-08 15:33:31.294467000-0500 LINK_INPUTS > > > > Linked input: > > > > /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.1.in > > > > to jobs/v/cat-vbmpqiaj/061-cattwo.1.in > > > > Linked input: > > > > /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.2.in > > > > to jobs/v/cat-vbmpqiaj/061-cattwo.2.in > > > > Progress 2009-05-08 15:33:31.327217000-0500 EXECUTE > > > > Moving back to workflow directory > > > > /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g > > > > Progress 2009-05-08 15:33:31.338850000-0500 EXECUTE_DONE > > > > Job ran successfully > > > > Progress 2009-05-08 15:33:31.343910000-0500 COPYING_OUTPUTS > > > > Progress 2009-05-08 15:33:31.373251000-0500 RM_JOBDIR > > > > Progress 2009-05-08 15:33:31.383733000-0500 TOUCH_SUCCESS > > > > Progress 2009-05-08 15:33:31.398652000-0500 END > > > > > > > > the directory "localnode" is also not created. Test peformed using > > > > 061-cattwo. my testcase script is in ~aespinosa/workflows/jobdirpath > > > > > > > > [aespinosa at tp-login1 jobdirpath]$ ./runtest.sh > > > > Swift svn swift-r2911 cog-r2394 > > > > > > > > RunID: remoterun > > > > Progress: uninitialized:1 > > > > Progress: Active:1 > > > > Final status: Finished successfully:1 > > > > Remote test FAIL > > > > Swift svn swift-r2911 cog-r2394 > > > > > > > > RunID: localrun > > > > Progress: > > > > Final status: Finished successfully:1 > > > > Local test PASS > > > > > > > > -Allan > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Sat May 9 04:29:12 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 9 May 2009 09:29:12 +0000 (GMT) Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: <4A047FD6.1050909@uchicago.edu> References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> <4A032878.1080404@uchicago.edu> <4A047FD6.1050909@uchicago.edu> Message-ID: On Fri, 8 May 2009, Zhao Zhang wrote: > So, for the remote job manger, I could use pbs, condor, fork or SGE which ever > is available on the remote site, right? Yes. You must use whichever is installed on the remote site. For OSG thats what swift-osg-ress-site-catalog tells you. -- From wilde at mcs.anl.gov Sat May 9 11:13:18 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 09 May 2009 11:13:18 -0500 Subject: [Swift-devel] SWIFT_JOBDIR_PATH env not passed on remote sites In-Reply-To: References: <50b07b4b0905081346k647306bds6c9d86b2e22f8a52@mail.gmail.com> <1241817795.15432.0.camel@localhost> <50b07b4b0905081437v6f1035ddjd13bcfbe8095b146@mail.gmail.com> <4A04B3BC.2060600@mcs.anl.gov> Message-ID: <4A05AB9E.10202@mcs.anl.gov> On 5/9/09 4:22 AM, Ben Clifford wrote: > Swift+falkon+envvars has come up on this list before. > > The test suite almost catches this - it has a test for environment > variables, but that is not regularly run against non-local providers. > > I should probably fix that. > > On Fri, 8 May 2009, Michael Wilde wrote: > >> Yes, under provider-deef and Falkon the entire environment is lost. We think >> that the C worker agent is not receiving or setting the environment. And by the way, this should be easy to fix; Allan, Zhao or I should do that. - Mike >> - Mike >> >> On 5/8/09 4:37 PM, Allan Espinosa wrote: >>> It works with gt2. Is this a pbs problem? >>> >>> Other provider that I am testing that it does not work is on >>> provider-deef. For my purposes, I would like to have it work under >>> provider-deef and provider-pbs >>> >>> Thanks, >>> -Allan >>> >>> 2009/5/8 Mihael Hategan : >>>> Does it also happen if you use GT2 instead of PBS? >>>> >>>> On Fri, 2009-05-08 at 15:46 -0500, Allan Espinosa wrote: >>>>> Hi, >>>>> >>>>> SWIFT_JOBDIR_PATH only works in local executions. it looks like the >>>>> environment variable is not being passed for remote job submissions: >>>>> >>>>> sites.xml: >>>>> >>>>> >>>>> >>>>> fast >>>>> >>>>> >>>>> /home/aespinosa/workflows/jobdirpath/sitedir >>>>> >>>> key="SWIFT_JOBDIR_PATH">/home/aespinosa/workflows/localnode >>>>> >>>>> >>>>> >>>>> resulting wrapper log: >>>>> Progress 2009-05-08 15:33:31.258501000-0500 LOG_START >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> Wrapper >>>>> _____________________________________________________________________________ >>>>> >>>>> Job directory mode is: link on shared filesystem >>>>> DIR=jobs/v/cat-vbmpqiaj >>>>> EXEC=/bin/cat >>>>> STDIN= >>>>> STDOUT=061-cattwo.out >>>>> STDERR=stderr.txt >>>>> DIRS= >>>>> INF=061-cattwo.1.in|061-cattwo.2.in >>>>> OUTF=061-cattwo.out >>>>> KICKSTART= >>>>> ARGS=061-cattwo.1.in 061-cattwo.2.in >>>>> ARGC=2 >>>>> Progress 2009-05-08 15:33:31.277785000-0500 CREATE_JOBDIR >>>>> Created job directory: jobs/v/cat-vbmpqiaj >>>>> Progress 2009-05-08 15:33:31.289956000-0500 CREATE_INPUTDIR >>>>> Progress 2009-05-08 15:33:31.294467000-0500 LINK_INPUTS >>>>> Linked input: >>>>> /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.1.in >>>>> to jobs/v/cat-vbmpqiaj/061-cattwo.1.in >>>>> Linked input: >>>>> /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g/shared/061-cattwo.2.in >>>>> to jobs/v/cat-vbmpqiaj/061-cattwo.2.in >>>>> Progress 2009-05-08 15:33:31.327217000-0500 EXECUTE >>>>> Moving back to workflow directory >>>>> /home/aespinosa/workflows/jobdirpath/sitedir/061-cattwo-20090508-1533-x71gz29g >>>>> Progress 2009-05-08 15:33:31.338850000-0500 EXECUTE_DONE >>>>> Job ran successfully >>>>> Progress 2009-05-08 15:33:31.343910000-0500 COPYING_OUTPUTS >>>>> Progress 2009-05-08 15:33:31.373251000-0500 RM_JOBDIR >>>>> Progress 2009-05-08 15:33:31.383733000-0500 TOUCH_SUCCESS >>>>> Progress 2009-05-08 15:33:31.398652000-0500 END >>>>> >>>>> the directory "localnode" is also not created. Test peformed using >>>>> 061-cattwo. my testcase script is in ~aespinosa/workflows/jobdirpath >>>>> >>>>> [aespinosa at tp-login1 jobdirpath]$ ./runtest.sh >>>>> Swift svn swift-r2911 cog-r2394 >>>>> >>>>> RunID: remoterun >>>>> Progress: uninitialized:1 >>>>> Progress: Active:1 >>>>> Final status: Finished successfully:1 >>>>> Remote test FAIL >>>>> Swift svn swift-r2911 cog-r2394 >>>>> >>>>> RunID: localrun >>>>> Progress: >>>>> Final status: Finished successfully:1 >>>>> Local test PASS >>>>> >>>>> -Allan >>>>> >>>>> >>>> >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> From zhaozhang at uchicago.edu Mon May 11 12:59:56 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 11 May 2009 12:59:56 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> <4A032878.1080404@uchicago.edu> <4A047FD6.1050909@uchicago.edu> Message-ID: <4A08679C.5060107@uchicago.edu> Hi, Ben Ben Clifford wrote: > On Fri, 8 May 2009, Zhao Zhang wrote: > > >> So, for the remote job manger, I could use pbs, condor, fork or SGE which ever >> is available on the remote site, right? >> > > Yes. You must use whichever is installed on the remote site. For OSG thats > what swift-osg-ress-site-catalog tells you. > Yes, for most of the time, it is correct. Here is a case: FLTECH site. In the site information web site, it says both condor and fork are available, http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=322#SITELISTING, and swift-osg-ress-site-catalog chose condor as the default job manager. But I couldn't run any job there with condor job manager, while if I use fork, jobs were running. Again, I am attaching the sites.xml file. zhao [zzhang at tp-grid1 condor-g]$ cat FLTECH.xml /mnt/nas0/OSG/DATA/osg/tmp/FLTECH grid gt2 uscms1.fltech-grid3.fit.edu/jobmanager-fork From hategan at mcs.anl.gov Mon May 11 13:07:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 11 May 2009 13:07:02 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: <4A08679C.5060107@uchicago.edu> References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> <4A032878.1080404@uchicago.edu> <4A047FD6.1050909@uchicago.edu> <4A08679C.5060107@uchicago.edu> Message-ID: <1242065222.24534.1.camel@localhost> On Mon, 2009-05-11 at 12:59 -0500, Zhao Zhang wrote: > Hi, Ben > > > Ben Clifford wrote: > > On Fri, 8 May 2009, Zhao Zhang wrote: > > > > > >> So, for the remote job manger, I could use pbs, condor, fork or SGE which ever > >> is available on the remote site, right? > >> > > > > Yes. You must use whichever is installed on the remote site. For OSG thats > > what swift-osg-ress-site-catalog tells you. > > > Yes, for most of the time, it is correct. Here is a case: FLTECH site. > In the site information web site, it says both condor and fork are > available, Fork should always be available, but it's not something you run cluster jobs with. > http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=322#SITELISTING, > and swift-osg-ress-site-catalog chose condor as the default job manager. > But I couldn't run any job there with condor job manager, Why is that? > while if I use > fork, jobs were running. Again, I am > attaching the sites.xml file. > > zhao > > [zzhang at tp-grid1 condor-g]$ cat FLTECH.xml > > > > > > /mnt/nas0/OSG/DATA/osg/tmp/FLTECH > grid > gt2 > uscms1.fltech-grid3.fit.edu/jobmanager-fork > > From zhaozhang at uchicago.edu Mon May 11 13:07:39 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 11 May 2009 13:07:39 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: <1242065222.24534.1.camel@localhost> References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> <4A032878.1080404@uchicago.edu> <4A047FD6.1050909@uchicago.edu> <4A08679C.5060107@uchicago.edu> <1242065222.24534.1.camel@localhost> Message-ID: <4A08696B.2040408@uchicago.edu> Hi, Mihael Mihael Hategan wrote: > On Mon, 2009-05-11 at 12:59 -0500, Zhao Zhang wrote: > >> Hi, Ben >> >> >> Ben Clifford wrote: >> >>> On Fri, 8 May 2009, Zhao Zhang wrote: >>> >>> >>> >>>> So, for the remote job manger, I could use pbs, condor, fork or SGE which ever >>>> is available on the remote site, right? >>>> >>>> >>> Yes. You must use whichever is installed on the remote site. For OSG thats >>> what swift-osg-ress-site-catalog tells you. >>> >>> >> Yes, for most of the time, it is correct. Here is a case: FLTECH site. >> In the site information web site, it says both condor and fork are >> available, >> > > Fork should always be available, but it's not something you run cluster > jobs with. > What does "cluster jobs" mean? > >> http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=322#SITELISTING, >> and swift-osg-ress-site-catalog chose condor as the default job manager. >> But I couldn't run any job there with condor job manager, >> > > Why is that? > Hmm, it is working now. But when I tried it on Friday, my swift client was just hanging there. Is there any schedule policy difference between those two job manager? zhao > >> while if I use >> fork, jobs were running. Again, I am >> attaching the sites.xml file. >> >> zhao >> >> [zzhang at tp-grid1 condor-g]$ cat FLTECH.xml >> >> >> >> >> >> /mnt/nas0/OSG/DATA/osg/tmp/FLTECH >> grid >> gt2 >> uscms1.fltech-grid3.fit.edu/jobmanager-fork >> >> >> > > > From hategan at mcs.anl.gov Mon May 11 13:40:15 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 11 May 2009 13:40:15 -0500 Subject: [Swift-devel] Condor-G test on UJ-OSG site In-Reply-To: <4A08696B.2040408@uchicago.edu> References: <4A0311DF.9050003@uchicago.edu> <1241715819.5713.3.camel@localhost> <4A0319C9.3030803@uchicago.edu> <4A032878.1080404@uchicago.edu> <4A047FD6.1050909@uchicago.edu> <4A08679C.5060107@uchicago.edu> <1242065222.24534.1.camel@localhost> <4A08696B.2040408@uchicago.edu> Message-ID: <1242067215.24922.1.camel@localhost> On Mon, 2009-05-11 at 13:07 -0500, Zhao Zhang wrote: > Hi, Mihael > > Mihael Hategan wrote: > > On Mon, 2009-05-11 at 12:59 -0500, Zhao Zhang wrote: > > > >> Hi, Ben > >> > >> > >> Ben Clifford wrote: > >> > >>> On Fri, 8 May 2009, Zhao Zhang wrote: > >>> > >>> > >>> > >>>> So, for the remote job manger, I could use pbs, condor, fork or SGE which ever > >>>> is available on the remote site, right? > >>>> > >>>> > >>> Yes. You must use whichever is installed on the remote site. For OSG thats > >>> what swift-osg-ress-site-catalog tells you. > >>> > >>> > >> Yes, for most of the time, it is correct. Here is a case: FLTECH site. > >> In the site information web site, it says both condor and fork are > >> available, > >> > > > > Fork should always be available, but it's not something you run cluster > > jobs with. > > > What does "cluster jobs" mean? I shouldn't come up with terms. Fork runs jobs on the head node. PBS/condor/etc run jobs on the worker nodes. Fork is not to be used for swift jobs. > > > >> http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=322#SITELISTING, > >> and swift-osg-ress-site-catalog chose condor as the default job manager. > >> But I couldn't run any job there with condor job manager, > >> > > > > Why is that? > > > Hmm, it is working now. But when I tried it on Friday, my swift client > was just hanging there. Is there any schedule policy difference between > those two job manager? Yes there is. > > zhao > > > >> while if I use > >> fork, jobs were running. Again, I am > >> attaching the sites.xml file. > >> > >> zhao > >> > >> [zzhang at tp-grid1 condor-g]$ cat FLTECH.xml > >> > >> > >> > >> > >> > >> /mnt/nas0/OSG/DATA/osg/tmp/FLTECH > >> grid > >> gt2 > >> uscms1.fltech-grid3.fit.edu/jobmanager-fork > >> > >> > >> > > > > > > From yizhu at cs.uchicago.edu Mon May 11 18:08:36 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Mon, 11 May 2009 18:08:36 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A0323B7.8080206@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> Message-ID: <4A08AFF4.608@cs.uchicago.edu> Hi, I've now got the swift system running on Amazon EC2 with swift installed on head-node of the cluster. I will soon let it the user submit the job from its own machine after i solve the authentication issues of Globus. I think my next step is to write a sample swift code to check if we can let swift grab input files from S3, execute it, and then write the output files back to S3. Since each Amazon virtual node only has limited storage space( Small instance for 1.7GB, Large Instance for 7.5GB), we may need to use EBS(Elastic Block Store) to storage temp files created by swift. The EBS behaved like a hard disk volume and can be mounted by any virtual nodes. But here arise a problem, since a volume can only be attached to one instance at a time[1], so the files store in EBS mounted by one node won't be shared by any other nodes; we lost the file sharing ability which I think is a fundamental requirement for swift. -Yi Michael Wilde wrote: > > > On 5/7/09 12:54 PM, Tim Freeman wrote: >> On Thu, 07 May 2009 11:39:40 -0500 >> Yi Zhu wrote: >> >>> Michael Wilde wrote: >>>> Very good! >>>> >>>> Now, what kind of tests can you do next? >>> Next, I will try to let swift running on Amazon EC2. >>> >>>> Can you exercise the cluster with an interesting workflow? >>> Yes, Is there any complex sample/tools i can use (rahter than >>> first.swift) to test swift performance? Is there any benchmark >>> available i can compare with? >>> >>>> How large of a cluster can you assemble in a Nimbus workspace ? >>> Since the vm-image i use to test 'swift' is based on NFS shared file >>> system, the performance may not be satisfiable if the we have a large >>> scale of cluster. After I got the swift running on Amazon EC2, I will >>> try to make a dedicate vm-image by using GPFS or any other shared >>> file system you recommended. >>> >>> >>>> Can you aggregate VM's from a few different physical clusters into >>>> one Nimbus workspace? >>> I don't think so. Tim may make commit on it. >> >> There is some work going on right now making auto-configuration easier >> to do >> over multiple clusters (that is possible now, it's just very 'manual' and >> non-ideal unlike with one physical cluster). You wouldn't really want >> to do NFS >> across a WAN, though. > > Indeed. Now that I think this through more clearly, one workspace == one > cluster == one Swift "site", so we could aggregate the resources of > multiple workspaces through Swift to execute a multi-site workflow. > > - Mike > >>> >>>> What's the largest cluster you can assemble with Nimbus? >>> I am not quite sure,I will do some test onto it soon. since it is a >>> EC2-like cloud, it should easily be configured as a cluster with >>> hundreds of nodes. Tim may make commit on it. >> >> I've heard of EC2 deployments in the 1000s at once, it's up to your >> EC2 account >> limitations (they seem pretty efficient with making your quota >> ever-higher). >> Nimbus installation at Teraport maxes out at 16, there are other 'science >> clouds' but I don't know their node numbers. EC2 is the place where >> you will >> be able to really test scaling something. >> >> Tim > From wilde at mcs.anl.gov Tue May 12 07:30:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 12 May 2009 07:30:23 -0500 Subject: [Swift-devel] [Fwd: Re: adem-osg] Message-ID: <4A096BDF.5030908@mcs.anl.gov> Here is the latest "ADEM" software installer from Zhengxiong. - Mike -------- Original Message -------- Subject: Re: adem-osg Date: Mon, 11 May 2009 20:39:02 -0500 (CDT) From: Zhengxiong Hou To: Michael Wilde CC: zhaozhang at uchicago.edu Greetings, I remember that I did some modifications. Please download the tarball, including the document. http://www.ci.uchicago.edu/~houzx/adem-osg.tar.gz Thanks! Zhengxiong ---- Original message ---- >Date: Thu, 07 May 2009 14:40:32 -0500 >From: Michael Wilde >Subject: [Fwd: Re: adem user manual 0.2] >To: Zhao Zhang , Zhengxiong Hou >Cc: swift-devel > >Zhao, I think this is the latest ADEM document. > >Zhengxiong, if there is something more recent, can you sent it to us? > >We'd like to start testing it a bit more. > >Thanks, > >Mike > > >-------- Original Message -------- >Subject: Re: adem user manual 0.2 >Date: Mon, 16 Feb 2009 23:28:14 -0600 >From: Zhengxiong Hou >To: Michael Wilde >References: <49999702.3000701 at mcs.anl.gov> <4999A5CD.20109 at mcs.anl.gov> > >Hi Mike, > The attached is the updated adem user manual 0.2. > Please download the adem-osg.tar.gz and run it again. > > >Thanks much, >Regards! >Zhengxiong > >________________ >User Manual-ADEM for OSG-0.2.doc (837k bytes) From iraicu at cs.uchicago.edu Tue May 12 11:19:49 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 12 May 2009 11:19:49 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A08AFF4.608@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> Message-ID: <4A09A1A5.3080900@cs.uchicago.edu> Hi, yizhu wrote: > Hi, > > I've now got the swift system running on Amazon EC2 with swift > installed on head-node of the cluster. I will soon let it the user > submit the job from its own machine after i solve the authentication > issues of Globus. > > > I think my next step is to write a sample swift code to check if we > can let swift grab input files from S3, execute it, and then write the > output files back to S3. > > > Since each Amazon virtual node only has limited storage space( Small > instance for 1.7GB, Large Instance for 7.5GB), Are you sure? That sounds like like the amount of RAM each instance gets. The last time I used EC2 (more than a year ago), each instance had a disk of 100GB+ each, which could be treated as a scratch disk that would be lost when the instance was powered down. > we may need to use EBS(Elastic Block Store) to storage temp files > created by swift. The EBS behaved like a hard disk volume and can be > mounted by any virtual nodes. But here arise a problem, since a volume > can only be attached to one instance at a time[1], so the files store > in EBS mounted by one node won't be shared by any other nodes; The EBS sounds like the same thing as the local disk, except that it is persisted in S3, and can be recovered when the instance is started again later. You can use these local disks, or EBS disks, to create a shared/parallel file system, or manage them yourself. > we lost the file sharing ability which I think is a fundamental > requirement for swift. The last time I worked with the Workspace Service (now part of Nimbus), we were able to use NFS to create a shared file system across our virtual cluster. This allowed us to run Swift without modifications on our virtual cluster. Some of the more recent work on Swift might have eased up some of the requirements for a shared file system, so you might be able to run Swift without a shared file system, if you configure Swift just right. Mike, is this true, or we aren't quite there yet? Ioan > > > -Yi > > > > > > Michael Wilde wrote: >> >> >> On 5/7/09 12:54 PM, Tim Freeman wrote: >>> On Thu, 07 May 2009 11:39:40 -0500 >>> Yi Zhu wrote: >>> >>>> Michael Wilde wrote: >>>>> Very good! >>>>> >>>>> Now, what kind of tests can you do next? >>>> Next, I will try to let swift running on Amazon EC2. >>>> >>>>> Can you exercise the cluster with an interesting workflow? >>>> Yes, Is there any complex sample/tools i can use (rahter than >>>> first.swift) to test swift performance? Is there any benchmark >>>> available i can compare with? >>>> >>>>> How large of a cluster can you assemble in a Nimbus workspace ? >>>> Since the vm-image i use to test 'swift' is based on NFS shared >>>> file system, the performance may not be satisfiable if the we have >>>> a large scale of cluster. After I got the swift running on Amazon >>>> EC2, I will try to make a dedicate vm-image by using GPFS or any >>>> other shared file system you recommended. >>>> >>>> >>>>> Can you aggregate VM's from a few different physical clusters into >>>>> one Nimbus workspace? >>>> I don't think so. Tim may make commit on it. >>> >>> There is some work going on right now making auto-configuration >>> easier to do >>> over multiple clusters (that is possible now, it's just very >>> 'manual' and >>> non-ideal unlike with one physical cluster). You wouldn't really >>> want to do NFS >>> across a WAN, though. >> >> Indeed. Now that I think this through more clearly, one workspace == >> one cluster == one Swift "site", so we could aggregate the resources >> of multiple workspaces through Swift to execute a multi-site workflow. >> >> - Mike >> >>>> >>>>> What's the largest cluster you can assemble with Nimbus? >>>> I am not quite sure,I will do some test onto it soon. since it is a >>>> EC2-like cloud, it should easily be configured as a cluster with >>>> hundreds of nodes. Tim may make commit on it. >>> >>> I've heard of EC2 deployments in the 1000s at once, it's up to your >>> EC2 account >>> limitations (they seem pretty efficient with making your quota >>> ever-higher). >>> Nimbus installation at Teraport maxes out at 16, there are other >>> 'science >>> clouds' but I don't know their node numbers. EC2 is the place where >>> you will >>> be able to really test scaling something. >>> >>> Tim >> > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Tue May 12 11:32:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 12 May 2009 11:32:37 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A09A1A5.3080900@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> Message-ID: <4A09A4A5.3060506@mcs.anl.gov> Maybe relevant - there's a FUSE filesystem to mount an S3 bucket as a filesystem: http://code.google.com/p/s3fs/wiki/FuseOverAmazon - Mike On 5/12/09 11:19 AM, Ioan Raicu wrote: > Hi, > > yizhu wrote: >> Hi, >> >> I've now got the swift system running on Amazon EC2 with swift >> installed on head-node of the cluster. I will soon let it the user >> submit the job from its own machine after i solve the authentication >> issues of Globus. >> >> >> I think my next step is to write a sample swift code to check if we >> can let swift grab input files from S3, execute it, and then write the >> output files back to S3. >> >> >> Since each Amazon virtual node only has limited storage space( Small >> instance for 1.7GB, Large Instance for 7.5GB), > Are you sure? That sounds like like the amount of RAM each instance > gets. The last time I used EC2 (more than a year ago), each instance had > a disk of 100GB+ each, which could be treated as a scratch disk that > would be lost when the instance was powered down. >> we may need to use EBS(Elastic Block Store) to storage temp files >> created by swift. The EBS behaved like a hard disk volume and can be >> mounted by any virtual nodes. But here arise a problem, since a volume >> can only be attached to one instance at a time[1], so the files store >> in EBS mounted by one node won't be shared by any other nodes; > The EBS sounds like the same thing as the local disk, except that it is > persisted in S3, and can be recovered when the instance is started again > later. > > You can use these local disks, or EBS disks, to create a shared/parallel > file system, or manage them yourself. >> we lost the file sharing ability which I think is a fundamental >> requirement for swift. > The last time I worked with the Workspace Service (now part of Nimbus), > we were able to use NFS to create a shared file system across our > virtual cluster. This allowed us to run Swift without modifications on > our virtual cluster. Some of the more recent work on Swift might have > eased up some of the requirements for a shared file system, so you might > be able to run Swift without a shared file system, if you configure > Swift just right. Mike, is this true, or we aren't quite there yet? > > Ioan >> >> >> -Yi >> >> >> >> >> >> Michael Wilde wrote: >>> >>> >>> On 5/7/09 12:54 PM, Tim Freeman wrote: >>>> On Thu, 07 May 2009 11:39:40 -0500 >>>> Yi Zhu wrote: >>>> >>>>> Michael Wilde wrote: >>>>>> Very good! >>>>>> >>>>>> Now, what kind of tests can you do next? >>>>> Next, I will try to let swift running on Amazon EC2. >>>>> >>>>>> Can you exercise the cluster with an interesting workflow? >>>>> Yes, Is there any complex sample/tools i can use (rahter than >>>>> first.swift) to test swift performance? Is there any benchmark >>>>> available i can compare with? >>>>> >>>>>> How large of a cluster can you assemble in a Nimbus workspace ? >>>>> Since the vm-image i use to test 'swift' is based on NFS shared >>>>> file system, the performance may not be satisfiable if the we have >>>>> a large scale of cluster. After I got the swift running on Amazon >>>>> EC2, I will try to make a dedicate vm-image by using GPFS or any >>>>> other shared file system you recommended. >>>>> >>>>> >>>>>> Can you aggregate VM's from a few different physical clusters into >>>>>> one Nimbus workspace? >>>>> I don't think so. Tim may make commit on it. >>>> >>>> There is some work going on right now making auto-configuration >>>> easier to do >>>> over multiple clusters (that is possible now, it's just very >>>> 'manual' and >>>> non-ideal unlike with one physical cluster). You wouldn't really >>>> want to do NFS >>>> across a WAN, though. >>> >>> Indeed. Now that I think this through more clearly, one workspace == >>> one cluster == one Swift "site", so we could aggregate the resources >>> of multiple workspaces through Swift to execute a multi-site workflow. >>> >>> - Mike >>> >>>>> >>>>>> What's the largest cluster you can assemble with Nimbus? >>>>> I am not quite sure,I will do some test onto it soon. since it is a >>>>> EC2-like cloud, it should easily be configured as a cluster with >>>>> hundreds of nodes. Tim may make commit on it. >>>> >>>> I've heard of EC2 deployments in the 1000s at once, it's up to your >>>> EC2 account >>>> limitations (they seem pretty efficient with making your quota >>>> ever-higher). >>>> Nimbus installation at Teraport maxes out at 16, there are other >>>> 'science >>>> clouds' but I don't know their node numbers. EC2 is the place where >>>> you will >>>> be able to really test scaling something. >>>> >>>> Tim >>> >> >> > From yizhu at cs.uchicago.edu Tue May 12 11:35:29 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Tue, 12 May 2009 11:35:29 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A09A1A5.3080900@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> Message-ID: <4A09A551.6000605@cs.uchicago.edu> Ioan Raicu wrote: > Hi, > > yizhu wrote: >> Hi, >> >> I've now got the swift system running on Amazon EC2 with swift >> installed on head-node of the cluster. I will soon let it the user >> submit the job from its own machine after i solve the authentication >> issues of Globus. >> >> >> I think my next step is to write a sample swift code to check if we >> can let swift grab input files from S3, execute it, and then write the >> output files back to S3. >> >> >> Since each Amazon virtual node only has limited storage space( Small >> instance for 1.7GB, Large Instance for 7.5GB), > Are you sure? That sounds like like the amount of RAM each instance > gets. The last time I used EC2 (more than a year ago), each instance had > a disk of 100GB+ each, which could be treated as a scratch disk that > would be lost when the instance was powered down. you are right, the 1.7GB and 7.5GB stuff are RAM; we can get 160GB storage and 850GB storage for small and large instance, respectively. combine of those storage from each instance, The Shared File System should be large enough in most case. If it is not, we can use EBS to get extra storage, since the volume of EBS can be set and mount after instance has been launched, we can only acquire for EBS storage when the local storage isn't enough (EBS costs money ) >> we may need to use EBS(Elastic Block Store) to storage temp files >> created by swift. The EBS behaved like a hard disk volume and can be >> mounted by any virtual nodes. But here arise a problem, since a volume >> can only be attached to one instance at a time[1], so the files store >> in EBS mounted by one node won't be shared by any other nodes; > The EBS sounds like the same thing as the local disk, except that it is > persisted in S3, and can be recovered when the instance is started again > later. > > You can use these local disks, or EBS disks, to create a shared/parallel > file system, or manage them yourself. >> we lost the file sharing ability which I think is a fundamental >> requirement for swift. > The last time I worked with the Workspace Service (now part of Nimbus), > we were able to use NFS to create a shared file system across our > virtual cluster. This allowed us to run Swift without modifications on > our virtual cluster. Some of the more recent work on Swift might have > eased up some of the requirements for a shared file system, so you might > be able to run Swift without a shared file system, if you configure > Swift just right. Mike, is this true, or we aren't quite there yet? > > Ioan >> >> >> -Yi >> >> >> >> >> >> Michael Wilde wrote: >>> >>> >>> On 5/7/09 12:54 PM, Tim Freeman wrote: >>>> On Thu, 07 May 2009 11:39:40 -0500 >>>> Yi Zhu wrote: >>>> >>>>> Michael Wilde wrote: >>>>>> Very good! >>>>>> >>>>>> Now, what kind of tests can you do next? >>>>> Next, I will try to let swift running on Amazon EC2. >>>>> >>>>>> Can you exercise the cluster with an interesting workflow? >>>>> Yes, Is there any complex sample/tools i can use (rahter than >>>>> first.swift) to test swift performance? Is there any benchmark >>>>> available i can compare with? >>>>> >>>>>> How large of a cluster can you assemble in a Nimbus workspace ? >>>>> Since the vm-image i use to test 'swift' is based on NFS shared >>>>> file system, the performance may not be satisfiable if the we have >>>>> a large scale of cluster. After I got the swift running on Amazon >>>>> EC2, I will try to make a dedicate vm-image by using GPFS or any >>>>> other shared file system you recommended. >>>>> >>>>> >>>>>> Can you aggregate VM's from a few different physical clusters into >>>>>> one Nimbus workspace? >>>>> I don't think so. Tim may make commit on it. >>>> >>>> There is some work going on right now making auto-configuration >>>> easier to do >>>> over multiple clusters (that is possible now, it's just very >>>> 'manual' and >>>> non-ideal unlike with one physical cluster). You wouldn't really >>>> want to do NFS >>>> across a WAN, though. >>> >>> Indeed. Now that I think this through more clearly, one workspace == >>> one cluster == one Swift "site", so we could aggregate the resources >>> of multiple workspaces through Swift to execute a multi-site workflow. >>> >>> - Mike >>> >>>>> >>>>>> What's the largest cluster you can assemble with Nimbus? >>>>> I am not quite sure,I will do some test onto it soon. since it is a >>>>> EC2-like cloud, it should easily be configured as a cluster with >>>>> hundreds of nodes. Tim may make commit on it. >>>> >>>> I've heard of EC2 deployments in the 1000s at once, it's up to your >>>> EC2 account >>>> limitations (they seem pretty efficient with making your quota >>>> ever-higher). >>>> Nimbus installation at Teraport maxes out at 16, there are other >>>> 'science >>>> clouds' but I don't know their node numbers. EC2 is the place where >>>> you will >>>> be able to really test scaling something. >>>> >>>> Tim >>> >> >> > From yizhu at cs.uchicago.edu Tue May 12 11:42:00 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Tue, 12 May 2009 11:42:00 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A09A4A5.3060506@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> Message-ID: <4A09A6D8.1090707@cs.uchicago.edu> That's great. It makes file management between EC2 and S3 much easier, I will definitely check it out. Michael Wilde wrote: > Maybe relevant - there's a FUSE filesystem to mount an S3 bucket as a > filesystem: > > http://code.google.com/p/s3fs/wiki/FuseOverAmazon > > - Mike > > > On 5/12/09 11:19 AM, Ioan Raicu wrote: >> Hi, >> >> yizhu wrote: >>> Hi, >>> >>> I've now got the swift system running on Amazon EC2 with swift >>> installed on head-node of the cluster. I will soon let it the user >>> submit the job from its own machine after i solve the authentication >>> issues of Globus. >>> >>> >>> I think my next step is to write a sample swift code to check if we >>> can let swift grab input files from S3, execute it, and then write >>> the output files back to S3. >>> >>> >>> Since each Amazon virtual node only has limited storage space( Small >>> instance for 1.7GB, Large Instance for 7.5GB), >> Are you sure? That sounds like like the amount of RAM each instance >> gets. The last time I used EC2 (more than a year ago), each instance >> had a disk of 100GB+ each, which could be treated as a scratch disk >> that would be lost when the instance was powered down. >>> we may need to use EBS(Elastic Block Store) to storage temp files >>> created by swift. The EBS behaved like a hard disk volume and can be >>> mounted by any virtual nodes. But here arise a problem, since a >>> volume can only be attached to one instance at a time[1], so the >>> files store in EBS mounted by one node won't be shared by any other >>> nodes; >> The EBS sounds like the same thing as the local disk, except that it >> is persisted in S3, and can be recovered when the instance is started >> again later. >> >> You can use these local disks, or EBS disks, to create a >> shared/parallel file system, or manage them yourself. >>> we lost the file sharing ability which I think is a fundamental >>> requirement for swift. >> The last time I worked with the Workspace Service (now part of >> Nimbus), we were able to use NFS to create a shared file system across >> our virtual cluster. This allowed us to run Swift without >> modifications on our virtual cluster. Some of the more recent work on >> Swift might have eased up some of the requirements for a shared file >> system, so you might be able to run Swift without a shared file >> system, if you configure Swift just right. Mike, is this true, or we >> aren't quite there yet? >> >> Ioan >>> >>> >>> -Yi >>> >>> >>> >>> >>> >>> Michael Wilde wrote: >>>> >>>> >>>> On 5/7/09 12:54 PM, Tim Freeman wrote: >>>>> On Thu, 07 May 2009 11:39:40 -0500 >>>>> Yi Zhu wrote: >>>>> >>>>>> Michael Wilde wrote: >>>>>>> Very good! >>>>>>> >>>>>>> Now, what kind of tests can you do next? >>>>>> Next, I will try to let swift running on Amazon EC2. >>>>>> >>>>>>> Can you exercise the cluster with an interesting workflow? >>>>>> Yes, Is there any complex sample/tools i can use (rahter than >>>>>> first.swift) to test swift performance? Is there any benchmark >>>>>> available i can compare with? >>>>>> >>>>>>> How large of a cluster can you assemble in a Nimbus workspace ? >>>>>> Since the vm-image i use to test 'swift' is based on NFS shared >>>>>> file system, the performance may not be satisfiable if the we have >>>>>> a large scale of cluster. After I got the swift running on Amazon >>>>>> EC2, I will try to make a dedicate vm-image by using GPFS or any >>>>>> other shared file system you recommended. >>>>>> >>>>>> >>>>>>> Can you aggregate VM's from a few different physical clusters >>>>>>> into one Nimbus workspace? >>>>>> I don't think so. Tim may make commit on it. >>>>> >>>>> There is some work going on right now making auto-configuration >>>>> easier to do >>>>> over multiple clusters (that is possible now, it's just very >>>>> 'manual' and >>>>> non-ideal unlike with one physical cluster). You wouldn't really >>>>> want to do NFS >>>>> across a WAN, though. >>>> >>>> Indeed. Now that I think this through more clearly, one workspace == >>>> one cluster == one Swift "site", so we could aggregate the resources >>>> of multiple workspaces through Swift to execute a multi-site workflow. >>>> >>>> - Mike >>>> >>>>>> >>>>>>> What's the largest cluster you can assemble with Nimbus? >>>>>> I am not quite sure,I will do some test onto it soon. since it is >>>>>> a EC2-like cloud, it should easily be configured as a cluster with >>>>>> hundreds of nodes. Tim may make commit on it. >>>>> >>>>> I've heard of EC2 deployments in the 1000s at once, it's up to your >>>>> EC2 account >>>>> limitations (they seem pretty efficient with making your quota >>>>> ever-higher). >>>>> Nimbus installation at Teraport maxes out at 16, there are other >>>>> 'science >>>>> clouds' but I don't know their node numbers. EC2 is the place >>>>> where you will >>>>> be able to really test scaling something. >>>>> >>>>> Tim >>>> >>> >>> >> > From tfreeman at mcs.anl.gov Tue May 12 11:53:29 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 12 May 2009 11:53:29 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A09A6D8.1090707@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> Message-ID: <20090512115329.1be32f10@sietch> On Tue, 12 May 2009 11:42:00 -0500 yizhu wrote: > That's great. It makes file management between EC2 and S3 much easier, I > will definitely check it out. > Note that every I/O op to S3 will be significantly slower than EBS (and probably also slower than the local scratch disk). Add in FUSE (userspace) and that only compounds the problem. So if you use this option I would suggest taking measurements to make sure it is acceptable for the task at hand. Tim > Michael Wilde wrote: > > Maybe relevant - there's a FUSE filesystem to mount an S3 bucket as a > > filesystem: > > > > http://code.google.com/p/s3fs/wiki/FuseOverAmazon > > > > - Mike > > > > > > On 5/12/09 11:19 AM, Ioan Raicu wrote: > >> Hi, > >> > >> yizhu wrote: > >>> Hi, > >>> > >>> I've now got the swift system running on Amazon EC2 with swift > >>> installed on head-node of the cluster. I will soon let it the user > >>> submit the job from its own machine after i solve the authentication > >>> issues of Globus. > >>> > >>> > >>> I think my next step is to write a sample swift code to check if we > >>> can let swift grab input files from S3, execute it, and then write > >>> the output files back to S3. > >>> > >>> > >>> Since each Amazon virtual node only has limited storage space( Small > >>> instance for 1.7GB, Large Instance for 7.5GB), > >> Are you sure? That sounds like like the amount of RAM each instance > >> gets. The last time I used EC2 (more than a year ago), each instance > >> had a disk of 100GB+ each, which could be treated as a scratch disk > >> that would be lost when the instance was powered down. > >>> we may need to use EBS(Elastic Block Store) to storage temp files > >>> created by swift. The EBS behaved like a hard disk volume and can be > >>> mounted by any virtual nodes. But here arise a problem, since a > >>> volume can only be attached to one instance at a time[1], so the > >>> files store in EBS mounted by one node won't be shared by any other > >>> nodes; > >> The EBS sounds like the same thing as the local disk, except that it > >> is persisted in S3, and can be recovered when the instance is started > >> again later. > >> > >> You can use these local disks, or EBS disks, to create a > >> shared/parallel file system, or manage them yourself. > >>> we lost the file sharing ability which I think is a fundamental > >>> requirement for swift. > >> The last time I worked with the Workspace Service (now part of > >> Nimbus), we were able to use NFS to create a shared file system across > >> our virtual cluster. This allowed us to run Swift without > >> modifications on our virtual cluster. Some of the more recent work on > >> Swift might have eased up some of the requirements for a shared file > >> system, so you might be able to run Swift without a shared file > >> system, if you configure Swift just right. Mike, is this true, or we > >> aren't quite there yet? > >> > >> Ioan > >>> > >>> > >>> -Yi > >>> > >>> > >>> > >>> > >>> > >>> Michael Wilde wrote: > >>>> > >>>> > >>>> On 5/7/09 12:54 PM, Tim Freeman wrote: > >>>>> On Thu, 07 May 2009 11:39:40 -0500 > >>>>> Yi Zhu wrote: > >>>>> > >>>>>> Michael Wilde wrote: > >>>>>>> Very good! > >>>>>>> > >>>>>>> Now, what kind of tests can you do next? > >>>>>> Next, I will try to let swift running on Amazon EC2. > >>>>>> > >>>>>>> Can you exercise the cluster with an interesting workflow? > >>>>>> Yes, Is there any complex sample/tools i can use (rahter than > >>>>>> first.swift) to test swift performance? Is there any benchmark > >>>>>> available i can compare with? > >>>>>> > >>>>>>> How large of a cluster can you assemble in a Nimbus workspace ? > >>>>>> Since the vm-image i use to test 'swift' is based on NFS shared > >>>>>> file system, the performance may not be satisfiable if the we have > >>>>>> a large scale of cluster. After I got the swift running on Amazon > >>>>>> EC2, I will try to make a dedicate vm-image by using GPFS or any > >>>>>> other shared file system you recommended. > >>>>>> > >>>>>> > >>>>>>> Can you aggregate VM's from a few different physical clusters > >>>>>>> into one Nimbus workspace? > >>>>>> I don't think so. Tim may make commit on it. > >>>>> > >>>>> There is some work going on right now making auto-configuration > >>>>> easier to do > >>>>> over multiple clusters (that is possible now, it's just very > >>>>> 'manual' and > >>>>> non-ideal unlike with one physical cluster). You wouldn't really > >>>>> want to do NFS > >>>>> across a WAN, though. > >>>> > >>>> Indeed. Now that I think this through more clearly, one workspace == > >>>> one cluster == one Swift "site", so we could aggregate the resources > >>>> of multiple workspaces through Swift to execute a multi-site workflow. > >>>> > >>>> - Mike > >>>> > >>>>>> > >>>>>>> What's the largest cluster you can assemble with Nimbus? > >>>>>> I am not quite sure,I will do some test onto it soon. since it is > >>>>>> a EC2-like cloud, it should easily be configured as a cluster with > >>>>>> hundreds of nodes. Tim may make commit on it. > >>>>> > >>>>> I've heard of EC2 deployments in the 1000s at once, it's up to your > >>>>> EC2 account > >>>>> limitations (they seem pretty efficient with making your quota > >>>>> ever-higher). > >>>>> Nimbus installation at Teraport maxes out at 16, there are other > >>>>> 'science > >>>>> clouds' but I don't know their node numbers. EC2 is the place > >>>>> where you will > >>>>> be able to really test scaling something. > >>>>> > >>>>> Tim > >>>> > >>> > >>> > >> > > > From wilde at mcs.anl.gov Tue May 12 12:04:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 12 May 2009 12:04:23 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090512115329.1be32f10@sietch> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> Message-ID: <4A09AC17.6000008@mcs.anl.gov> Indeed. I should note that the intent in Swift is to reduce or eliminate accesses to shared filesystems, so EBS and S3 would become similar. The basic model of running a job would be: pull in the data you need, operate on it on local disk, push it back. Cache locally where possible. Only read direct from shared storage when the data is too big. This is pretty much what happens now, except that the "pull" and "push" are staged via a site shared filesystem. - Mike On 5/12/09 11:53 AM, Tim Freeman wrote: > On Tue, 12 May 2009 11:42:00 -0500 > yizhu wrote: > >> That's great. It makes file management between EC2 and S3 much easier, I >> will definitely check it out. >> > > Note that every I/O op to S3 will be significantly slower than EBS (and > probably also slower than the local scratch disk). Add in FUSE (userspace) and > that only compounds the problem. So if you use this option I would suggest > taking measurements to make sure it is acceptable for the task at hand. > > Tim > >> Michael Wilde wrote: >>> Maybe relevant - there's a FUSE filesystem to mount an S3 bucket as a >>> filesystem: >>> >>> http://code.google.com/p/s3fs/wiki/FuseOverAmazon >>> >>> - Mike >>> >>> >>> On 5/12/09 11:19 AM, Ioan Raicu wrote: >>>> Hi, >>>> >>>> yizhu wrote: >>>>> Hi, >>>>> >>>>> I've now got the swift system running on Amazon EC2 with swift >>>>> installed on head-node of the cluster. I will soon let it the user >>>>> submit the job from its own machine after i solve the authentication >>>>> issues of Globus. >>>>> >>>>> >>>>> I think my next step is to write a sample swift code to check if we >>>>> can let swift grab input files from S3, execute it, and then write >>>>> the output files back to S3. >>>>> >>>>> >>>>> Since each Amazon virtual node only has limited storage space( Small >>>>> instance for 1.7GB, Large Instance for 7.5GB), >>>> Are you sure? That sounds like like the amount of RAM each instance >>>> gets. The last time I used EC2 (more than a year ago), each instance >>>> had a disk of 100GB+ each, which could be treated as a scratch disk >>>> that would be lost when the instance was powered down. >>>>> we may need to use EBS(Elastic Block Store) to storage temp files >>>>> created by swift. The EBS behaved like a hard disk volume and can be >>>>> mounted by any virtual nodes. But here arise a problem, since a >>>>> volume can only be attached to one instance at a time[1], so the >>>>> files store in EBS mounted by one node won't be shared by any other >>>>> nodes; >>>> The EBS sounds like the same thing as the local disk, except that it >>>> is persisted in S3, and can be recovered when the instance is started >>>> again later. >>>> >>>> You can use these local disks, or EBS disks, to create a >>>> shared/parallel file system, or manage them yourself. >>>>> we lost the file sharing ability which I think is a fundamental >>>>> requirement for swift. >>>> The last time I worked with the Workspace Service (now part of >>>> Nimbus), we were able to use NFS to create a shared file system across >>>> our virtual cluster. This allowed us to run Swift without >>>> modifications on our virtual cluster. Some of the more recent work on >>>> Swift might have eased up some of the requirements for a shared file >>>> system, so you might be able to run Swift without a shared file >>>> system, if you configure Swift just right. Mike, is this true, or we >>>> aren't quite there yet? >>>> >>>> Ioan >>>>> >>>>> -Yi >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Michael Wilde wrote: >>>>>> >>>>>> On 5/7/09 12:54 PM, Tim Freeman wrote: >>>>>>> On Thu, 07 May 2009 11:39:40 -0500 >>>>>>> Yi Zhu wrote: >>>>>>> >>>>>>>> Michael Wilde wrote: >>>>>>>>> Very good! >>>>>>>>> >>>>>>>>> Now, what kind of tests can you do next? >>>>>>>> Next, I will try to let swift running on Amazon EC2. >>>>>>>> >>>>>>>>> Can you exercise the cluster with an interesting workflow? >>>>>>>> Yes, Is there any complex sample/tools i can use (rahter than >>>>>>>> first.swift) to test swift performance? Is there any benchmark >>>>>>>> available i can compare with? >>>>>>>> >>>>>>>>> How large of a cluster can you assemble in a Nimbus workspace ? >>>>>>>> Since the vm-image i use to test 'swift' is based on NFS shared >>>>>>>> file system, the performance may not be satisfiable if the we have >>>>>>>> a large scale of cluster. After I got the swift running on Amazon >>>>>>>> EC2, I will try to make a dedicate vm-image by using GPFS or any >>>>>>>> other shared file system you recommended. >>>>>>>> >>>>>>>> >>>>>>>>> Can you aggregate VM's from a few different physical clusters >>>>>>>>> into one Nimbus workspace? >>>>>>>> I don't think so. Tim may make commit on it. >>>>>>> There is some work going on right now making auto-configuration >>>>>>> easier to do >>>>>>> over multiple clusters (that is possible now, it's just very >>>>>>> 'manual' and >>>>>>> non-ideal unlike with one physical cluster). You wouldn't really >>>>>>> want to do NFS >>>>>>> across a WAN, though. >>>>>> Indeed. Now that I think this through more clearly, one workspace == >>>>>> one cluster == one Swift "site", so we could aggregate the resources >>>>>> of multiple workspaces through Swift to execute a multi-site workflow. >>>>>> >>>>>> - Mike >>>>>> >>>>>>>>> What's the largest cluster you can assemble with Nimbus? >>>>>>>> I am not quite sure,I will do some test onto it soon. since it is >>>>>>>> a EC2-like cloud, it should easily be configured as a cluster with >>>>>>>> hundreds of nodes. Tim may make commit on it. >>>>>>> I've heard of EC2 deployments in the 1000s at once, it's up to your >>>>>>> EC2 account >>>>>>> limitations (they seem pretty efficient with making your quota >>>>>>> ever-higher). >>>>>>> Nimbus installation at Teraport maxes out at 16, there are other >>>>>>> 'science >>>>>>> clouds' but I don't know their node numbers. EC2 is the place >>>>>>> where you will >>>>>>> be able to really test scaling something. >>>>>>> >>>>>>> Tim >>>>> From wilde at mcs.anl.gov Tue May 12 12:10:36 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 12 May 2009 12:10:36 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A09AC17.6000008@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> Message-ID: <4A09AD8C.7080508@mcs.anl.gov> On 5/12/09 12:04 PM, Michael Wilde wrote: > Indeed. > > I should note that the intent in Swift is to reduce or eliminate > accesses to shared filesystems, so EBS and S3 would become similar. > > The basic model of running a job would be: pull in the data you need, > operate on it on local disk, push it back. Cache locally where possible. > Only read direct from shared storage when the data is too big. > > This is pretty much what happens now, except that the "pull" and "push" > are staged via a site shared filesystem. (and that it doesnt do any size discrimination, I should add) > > - Mike > > > On 5/12/09 11:53 AM, Tim Freeman wrote: >> On Tue, 12 May 2009 11:42:00 -0500 >> yizhu wrote: >> >>> That's great. It makes file management between EC2 and S3 much >>> easier, I will definitely check it out. >>> >> >> Note that every I/O op to S3 will be significantly slower than EBS (and >> probably also slower than the local scratch disk). Add in FUSE >> (userspace) and >> that only compounds the problem. So if you use this option I would >> suggest >> taking measurements to make sure it is acceptable for the task at hand. >> >> Tim >> >>> Michael Wilde wrote: >>>> Maybe relevant - there's a FUSE filesystem to mount an S3 bucket as >>>> a filesystem: >>>> >>>> http://code.google.com/p/s3fs/wiki/FuseOverAmazon >>>> >>>> - Mike >>>> >>>> >>>> On 5/12/09 11:19 AM, Ioan Raicu wrote: >>>>> Hi, >>>>> >>>>> yizhu wrote: >>>>>> Hi, >>>>>> >>>>>> I've now got the swift system running on Amazon EC2 with swift >>>>>> installed on head-node of the cluster. I will soon let it the user >>>>>> submit the job from its own machine after i solve the >>>>>> authentication issues of Globus. >>>>>> >>>>>> >>>>>> I think my next step is to write a sample swift code to check if >>>>>> we can let swift grab input files from S3, execute it, and then >>>>>> write the output files back to S3. >>>>>> >>>>>> >>>>>> Since each Amazon virtual node only has limited storage space( >>>>>> Small instance for 1.7GB, Large Instance for 7.5GB), >>>>> Are you sure? That sounds like like the amount of RAM each instance >>>>> gets. The last time I used EC2 (more than a year ago), each >>>>> instance had a disk of 100GB+ each, which could be treated as a >>>>> scratch disk that would be lost when the instance was powered down. >>>>>> we may need to use EBS(Elastic Block Store) to storage temp files >>>>>> created by swift. The EBS behaved like a hard disk volume and can >>>>>> be mounted by any virtual nodes. But here arise a problem, since a >>>>>> volume can only be attached to one instance at a time[1], so the >>>>>> files store in EBS mounted by one node won't be shared by any >>>>>> other nodes; >>>>> The EBS sounds like the same thing as the local disk, except that >>>>> it is persisted in S3, and can be recovered when the instance is >>>>> started again later. >>>>> >>>>> You can use these local disks, or EBS disks, to create a >>>>> shared/parallel file system, or manage them yourself. >>>>>> we lost the file sharing ability which I think is a fundamental >>>>>> requirement for swift. >>>>> The last time I worked with the Workspace Service (now part of >>>>> Nimbus), we were able to use NFS to create a shared file system >>>>> across our virtual cluster. This allowed us to run Swift without >>>>> modifications on our virtual cluster. Some of the more recent work >>>>> on Swift might have eased up some of the requirements for a shared >>>>> file system, so you might be able to run Swift without a shared >>>>> file system, if you configure Swift just right. Mike, is this true, >>>>> or we aren't quite there yet? >>>>> >>>>> Ioan >>>>>> >>>>>> -Yi >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Michael Wilde wrote: >>>>>>> >>>>>>> On 5/7/09 12:54 PM, Tim Freeman wrote: >>>>>>>> On Thu, 07 May 2009 11:39:40 -0500 >>>>>>>> Yi Zhu wrote: >>>>>>>> >>>>>>>>> Michael Wilde wrote: >>>>>>>>>> Very good! >>>>>>>>>> >>>>>>>>>> Now, what kind of tests can you do next? >>>>>>>>> Next, I will try to let swift running on Amazon EC2. >>>>>>>>> >>>>>>>>>> Can you exercise the cluster with an interesting workflow? >>>>>>>>> Yes, Is there any complex sample/tools i can use (rahter than >>>>>>>>> first.swift) to test swift performance? Is there any benchmark >>>>>>>>> available i can compare with? >>>>>>>>> >>>>>>>>>> How large of a cluster can you assemble in a Nimbus workspace ? >>>>>>>>> Since the vm-image i use to test 'swift' is based on NFS shared >>>>>>>>> file system, the performance may not be satisfiable if the we >>>>>>>>> have a large scale of cluster. After I got the swift running on >>>>>>>>> Amazon EC2, I will try to make a dedicate vm-image by using >>>>>>>>> GPFS or any other shared file system you recommended. >>>>>>>>> >>>>>>>>> >>>>>>>>>> Can you aggregate VM's from a few different physical clusters >>>>>>>>>> into one Nimbus workspace? >>>>>>>>> I don't think so. Tim may make commit on it. >>>>>>>> There is some work going on right now making auto-configuration >>>>>>>> easier to do >>>>>>>> over multiple clusters (that is possible now, it's just very >>>>>>>> 'manual' and >>>>>>>> non-ideal unlike with one physical cluster). You wouldn't >>>>>>>> really want to do NFS >>>>>>>> across a WAN, though. >>>>>>> Indeed. Now that I think this through more clearly, one workspace >>>>>>> == one cluster == one Swift "site", so we could aggregate the >>>>>>> resources of multiple workspaces through Swift to execute a >>>>>>> multi-site workflow. >>>>>>> >>>>>>> - Mike >>>>>>> >>>>>>>>>> What's the largest cluster you can assemble with Nimbus? >>>>>>>>> I am not quite sure,I will do some test onto it soon. since it >>>>>>>>> is a EC2-like cloud, it should easily be configured as a >>>>>>>>> cluster with hundreds of nodes. Tim may make commit on it. >>>>>>>> I've heard of EC2 deployments in the 1000s at once, it's up to >>>>>>>> your EC2 account >>>>>>>> limitations (they seem pretty efficient with making your quota >>>>>>>> ever-higher). >>>>>>>> Nimbus installation at Teraport maxes out at 16, there are other >>>>>>>> 'science >>>>>>>> clouds' but I don't know their node numbers. EC2 is the place >>>>>>>> where you will >>>>>>>> be able to really test scaling something. >>>>>>>> >>>>>>>> Tim >>>>>> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From tfreeman at mcs.anl.gov Tue May 12 12:27:08 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 12 May 2009 12:27:08 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A09AC17.6000008@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> Message-ID: <20090512122708.4142d55c@sietch> On Tue, 12 May 2009 12:04:23 -0500 Michael Wilde wrote: > Indeed. > > I should note that the intent in Swift is to reduce or eliminate > accesses to shared filesystems, so EBS and S3 would become similar. Is that for speed or less moving parts? I think EBS is the fastest option they have for disk space (faster than local disk), fyi. And just to make sure everyone is on same page they are not shared in and of themselves. They are much like any block device that you mount to an OS. Multiple EC2 instances can not mount them simultaneously. They are durable (exist without ec2 instances attached to them), portable (can be mounted by other instances if the current instance dies etc), snapshottable (with explicit save to S3, it is not the same thing as S3), RAID-able (one instance can mount multiple EBS volumes and get even more performance), but not shared in the traditional sense of the word. Tim From foster at anl.gov Tue May 12 12:30:40 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 12 May 2009 12:30:40 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090512122708.4142d55c@sietch> References: <49EE1156.3090107@mcs.anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> <20090512122708.4142d55c@sietch> Message-ID: Tim: Can you "share" data by mounting an EBS volume on node A, sticking data on it, then unmounting the volume and mounting it on node B? Ian. On May 12, 2009, at 12:27 PM, Tim Freeman wrote: > On Tue, 12 May 2009 12:04:23 -0500 > Michael Wilde wrote: > >> Indeed. >> >> I should note that the intent in Swift is to reduce or eliminate >> accesses to shared filesystems, so EBS and S3 would become similar. > > Is that for speed or less moving parts? I think EBS is the fastest > option they > have for disk space (faster than local disk), fyi. > > And just to make sure everyone is on same page they are not shared > in and of > themselves. They are much like any block device that you mount to > an OS. > Multiple EC2 instances can not mount them simultaneously. > > They are durable (exist without ec2 instances attached to them), > portable (can > be mounted by other instances if the current instance dies etc), > snapshottable > (with explicit save to S3, it is not the same thing as S3), RAID- > able (one > instance can mount multiple EBS volumes and get even more > performance), but not > shared in the traditional sense of the word. > > Tim From foster at anl.gov Tue May 12 12:31:51 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 12 May 2009 12:31:51 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090512115329.1be32f10@sietch> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> Message-ID: <1CB0FBBE-091B-4E99-A03A-631A518BF1E5@anl.gov> Tim: Yes, running experiments is going to be an important early task. Ian. On May 12, 2009, at 11:53 AM, Tim Freeman wrote: > On Tue, 12 May 2009 11:42:00 -0500 > yizhu wrote: > >> That's great. It makes file management between EC2 and S3 much >> easier, I >> will definitely check it out. >> > > Note that every I/O op to S3 will be significantly slower than EBS > (and > probably also slower than the local scratch disk). Add in FUSE > (userspace) and > that only compounds the problem. So if you use this option I would > suggest > taking measurements to make sure it is acceptable for the task at > hand. > > Tim > >> Michael Wilde wrote: >>> Maybe relevant - there's a FUSE filesystem to mount an S3 bucket >>> as a >>> filesystem: >>> >>> http://code.google.com/p/s3fs/wiki/FuseOverAmazon >>> >>> - Mike >>> >>> >>> On 5/12/09 11:19 AM, Ioan Raicu wrote: >>>> Hi, >>>> >>>> yizhu wrote: >>>>> Hi, >>>>> >>>>> I've now got the swift system running on Amazon EC2 with swift >>>>> installed on head-node of the cluster. I will soon let it the user >>>>> submit the job from its own machine after i solve the >>>>> authentication >>>>> issues of Globus. >>>>> >>>>> >>>>> I think my next step is to write a sample swift code to check if >>>>> we >>>>> can let swift grab input files from S3, execute it, and then write >>>>> the output files back to S3. >>>>> >>>>> >>>>> Since each Amazon virtual node only has limited storage >>>>> space( Small >>>>> instance for 1.7GB, Large Instance for 7.5GB), >>>> Are you sure? That sounds like like the amount of RAM each instance >>>> gets. The last time I used EC2 (more than a year ago), each >>>> instance >>>> had a disk of 100GB+ each, which could be treated as a scratch disk >>>> that would be lost when the instance was powered down. >>>>> we may need to use EBS(Elastic Block Store) to storage temp files >>>>> created by swift. The EBS behaved like a hard disk volume and >>>>> can be >>>>> mounted by any virtual nodes. But here arise a problem, since a >>>>> volume can only be attached to one instance at a time[1], so the >>>>> files store in EBS mounted by one node won't be shared by any >>>>> other >>>>> nodes; >>>> The EBS sounds like the same thing as the local disk, except that >>>> it >>>> is persisted in S3, and can be recovered when the instance is >>>> started >>>> again later. >>>> >>>> You can use these local disks, or EBS disks, to create a >>>> shared/parallel file system, or manage them yourself. >>>>> we lost the file sharing ability which I think is a fundamental >>>>> requirement for swift. >>>> The last time I worked with the Workspace Service (now part of >>>> Nimbus), we were able to use NFS to create a shared file system >>>> across >>>> our virtual cluster. This allowed us to run Swift without >>>> modifications on our virtual cluster. Some of the more recent >>>> work on >>>> Swift might have eased up some of the requirements for a shared >>>> file >>>> system, so you might be able to run Swift without a shared file >>>> system, if you configure Swift just right. Mike, is this true, or >>>> we >>>> aren't quite there yet? >>>> >>>> Ioan >>>>> >>>>> >>>>> -Yi >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Michael Wilde wrote: >>>>>> >>>>>> >>>>>> On 5/7/09 12:54 PM, Tim Freeman wrote: >>>>>>> On Thu, 07 May 2009 11:39:40 -0500 >>>>>>> Yi Zhu wrote: >>>>>>> >>>>>>>> Michael Wilde wrote: >>>>>>>>> Very good! >>>>>>>>> >>>>>>>>> Now, what kind of tests can you do next? >>>>>>>> Next, I will try to let swift running on Amazon EC2. >>>>>>>> >>>>>>>>> Can you exercise the cluster with an interesting workflow? >>>>>>>> Yes, Is there any complex sample/tools i can use (rahter than >>>>>>>> first.swift) to test swift performance? Is there any benchmark >>>>>>>> available i can compare with? >>>>>>>> >>>>>>>>> How large of a cluster can you assemble in a Nimbus >>>>>>>>> workspace ? >>>>>>>> Since the vm-image i use to test 'swift' is based on NFS shared >>>>>>>> file system, the performance may not be satisfiable if the we >>>>>>>> have >>>>>>>> a large scale of cluster. After I got the swift running on >>>>>>>> Amazon >>>>>>>> EC2, I will try to make a dedicate vm-image by using GPFS or >>>>>>>> any >>>>>>>> other shared file system you recommended. >>>>>>>> >>>>>>>> >>>>>>>>> Can you aggregate VM's from a few different physical clusters >>>>>>>>> into one Nimbus workspace? >>>>>>>> I don't think so. Tim may make commit on it. >>>>>>> >>>>>>> There is some work going on right now making auto-configuration >>>>>>> easier to do >>>>>>> over multiple clusters (that is possible now, it's just very >>>>>>> 'manual' and >>>>>>> non-ideal unlike with one physical cluster). You wouldn't >>>>>>> really >>>>>>> want to do NFS >>>>>>> across a WAN, though. >>>>>> >>>>>> Indeed. Now that I think this through more clearly, one >>>>>> workspace == >>>>>> one cluster == one Swift "site", so we could aggregate the >>>>>> resources >>>>>> of multiple workspaces through Swift to execute a multi-site >>>>>> workflow. >>>>>> >>>>>> - Mike >>>>>> >>>>>>>> >>>>>>>>> What's the largest cluster you can assemble with Nimbus? >>>>>>>> I am not quite sure,I will do some test onto it soon. since >>>>>>>> it is >>>>>>>> a EC2-like cloud, it should easily be configured as a cluster >>>>>>>> with >>>>>>>> hundreds of nodes. Tim may make commit on it. >>>>>>> >>>>>>> I've heard of EC2 deployments in the 1000s at once, it's up to >>>>>>> your >>>>>>> EC2 account >>>>>>> limitations (they seem pretty efficient with making your quota >>>>>>> ever-higher). >>>>>>> Nimbus installation at Teraport maxes out at 16, there are other >>>>>>> 'science >>>>>>> clouds' but I don't know their node numbers. EC2 is the place >>>>>>> where you will >>>>>>> be able to really test scaling something. >>>>>>> >>>>>>> Tim >>>>>> >>>>> >>>>> >>>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfreeman at mcs.anl.gov Tue May 12 12:38:40 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 12 May 2009 12:38:40 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> <20090512122708.4142d55c@sietch> Message-ID: <20090512123840.68c4c4b5@sietch> On Tue, 12 May 2009 12:30:40 -0500 Ian Foster wrote: > Tim: > > Can you "share" data by mounting an EBS volume on node A, sticking > data on it, then unmounting the volume and mounting it on node B? Yes, as long as node B is in the same EC2 "availability zone." People set up something like that often: "production instance fails, hot backup instance takes over with the production data in tact". The only difference from your scenario being that when an instance fails there, it is detaching 'violently' from the EBS volume. There can be sync issues in the violent case (OS buffer cache --> disk). Tim > Ian. > > > On May 12, 2009, at 12:27 PM, Tim Freeman wrote: > > > On Tue, 12 May 2009 12:04:23 -0500 > > Michael Wilde wrote: > > > >> Indeed. > >> > >> I should note that the intent in Swift is to reduce or eliminate > >> accesses to shared filesystems, so EBS and S3 would become similar. > > > > Is that for speed or less moving parts? I think EBS is the fastest > > option they > > have for disk space (faster than local disk), fyi. > > > > And just to make sure everyone is on same page they are not shared > > in and of > > themselves. They are much like any block device that you mount to > > an OS. > > Multiple EC2 instances can not mount them simultaneously. > > > > They are durable (exist without ec2 instances attached to them), > > portable (can > > be mounted by other instances if the current instance dies etc), > > snapshottable > > (with explicit save to S3, it is not the same thing as S3), RAID- > > able (one > > instance can mount multiple EBS volumes and get even more > > performance), but not > > shared in the traditional sense of the word. > > > > Tim > From iraicu at cs.uchicago.edu Tue May 12 15:19:14 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 12 May 2009 15:19:14 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090512122708.4142d55c@sietch> References: <49EE1156.3090107@mcs.anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> <20090512122708.4142d55c@sietch> Message-ID: <4A09D9C2.30609@cs.uchicago.edu> Hi Tim, Tim Freeman wrote: > Is that for speed or less moving parts? I think EBS is the fastest option they > have for disk space (faster than local disk), fyi. > > > Can you elaborate more on how EBS, can be faster than local disk? I know nothing about EBS, but there are only 2 ways it can work. Since it is persistent, it likely lives on S3 (or something similar). When it is mounted, perhaps it still lives remotely, or perhaps it gets copied to a local disk, and runs locally. A locally running EBS, should be comparable in speed to a raw local disk, perhaps a bit slower for yet another layer of abstraction. If EBS lives remotely, perhaps in a completely idle EC2 cloud, a remote EBS might perform better than a local disk (assuming network communication is lighter-weight than SATA/PATA bus operations), but I can't imagine how this can hold true in a large scale and loaded cloud scenario, where shared infrastrucutre (network, S3, etc) can become congested under load. I don't see how EBS can be faster than local disk. Can you elaborate more on this claim? Thanks, Ioan -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From tfreeman at mcs.anl.gov Tue May 12 15:38:52 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 12 May 2009 15:38:52 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A09D9C2.30609@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> <20090512122708.4142d55c@sietch> <4A09D9C2.30609@cs.uchicago.edu> Message-ID: <20090512153852.29c68e5c@sietch> On Tue, 12 May 2009 15:19:14 -0500 Ioan Raicu wrote: > Hi Tim, > > Tim Freeman wrote: > > Is that for speed or less moving parts? I think EBS is the fastest option > > they have for disk space (faster than local disk), fyi. > > > > > > > Can you elaborate more on how EBS, can be faster than local disk? I know > nothing about EBS, but there are only 2 ways it can work. Since it is > persistent, it likely lives on S3 (or something similar). When it is > mounted, perhaps it still lives remotely, or perhaps it gets copied to a > local disk, and runs locally. A locally running EBS, should be > comparable in speed to a raw local disk, perhaps a bit slower for yet > another layer of abstraction. If EBS lives remotely, perhaps in a > completely idle EC2 cloud, a remote EBS might perform better than a > local disk (assuming network communication is lighter-weight than > SATA/PATA bus operations), but I can't imagine how this can hold true in > a large scale and loaded cloud scenario, where shared infrastrucutre > (network, S3, etc) can become congested under load. > > I don't see how EBS can be faster than local disk. Can you elaborate > more on this claim? They state that it is purely network based (they do not list the technologies they use but this could be infiniband, iscsi, etc.). It's not uncommon to see a SAN etc. faster than a local disk... I made my statements based on Amazon documentation: "The latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases." I just googled and found this person showing that EBS wins on medium and high powered instances (although someone comments at the end that 'dd' tests are not the best thing to measure why EBS is better): http://developer.amazonwebservices.com/connect/message.jspa?messageID=125197 Tim From tfreeman at mcs.anl.gov Tue May 12 15:43:40 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 12 May 2009 15:43:40 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090512153852.29c68e5c@sietch> References: <49EE1156.3090107@mcs.anl.gov> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> <20090512122708.4142d55c@sietch> <4A09D9C2.30609@cs.uchicago.edu> <20090512153852.29c68e5c@sietch> Message-ID: <20090512154340.60b88b3d@sietch> On Tue, 12 May 2009 15:38:52 -0500 Tim Freeman wrote: > On Tue, 12 May 2009 15:19:14 -0500 > Ioan Raicu wrote: > > > Hi Tim, > > > > Tim Freeman wrote: > > > Is that for speed or less moving parts? I think EBS is the fastest option > > > they have for disk space (faster than local disk), fyi. > > > > > > > > > > > Can you elaborate more on how EBS, can be faster than local disk? I know > > nothing about EBS, but there are only 2 ways it can work. Since it is > > persistent, it likely lives on S3 (or something similar). When it is > > mounted, perhaps it still lives remotely, or perhaps it gets copied to a > > local disk, and runs locally. A locally running EBS, should be > > comparable in speed to a raw local disk, perhaps a bit slower for yet > > another layer of abstraction. If EBS lives remotely, perhaps in a > > completely idle EC2 cloud, a remote EBS might perform better than a > > local disk (assuming network communication is lighter-weight than > > SATA/PATA bus operations), but I can't imagine how this can hold true in > > a large scale and loaded cloud scenario, where shared infrastrucutre > > (network, S3, etc) can become congested under load. > > > > I don't see how EBS can be faster than local disk. Can you elaborate > > more on this claim? > > They state that it is purely network based (they do not list the technologies > they use but this could be infiniband, iscsi, etc.). It's not uncommon to see > a SAN etc. faster than a local disk... > > I made my statements based on Amazon documentation: > > "The latency and throughput of Amazon EBS volumes is designed to be > significantly better than the Amazon EC2 instance stores in nearly all > cases." > > I just googled and found this person showing that EBS wins on medium and high > powered instances (although someone comments at the end that 'dd' tests are > not the best thing to measure why EBS is better): > > http://developer.amazonwebservices.com/connect/message.jspa?messageID=125197 Also, as I was saying, using RAID can make it even better, found some number for that: http://af-design.com/blog/2009/02/27/amazon-ec2-disk-performance/ Tim From iraicu at cs.uchicago.edu Tue May 12 15:59:50 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 12 May 2009 15:59:50 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090512154340.60b88b3d@sietch> References: <49EE1156.3090107@mcs.anl.gov> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> <20090512122708.4142d55c@sietch> <4A09D9C2.30609@cs.uchicago.edu> <20090512153852.29c68e5c@sietch> <20090512154340.60b88b3d@sietch> Message-ID: <4A09E346.80706@cs.uchicago.edu> I see the numbers, but my instinct says that they can't scale EBS performance linearly with every EC2 instance they can start. Perhaps EBS is currently underutilized, so much that the EBS servers are relatively idle. Once EBS becomes really popular, or if a single large user runs EBS to many EC2 instances and runs data intensive workloads, I bet EBS performance will start to suffer (in comparison to local disk performance). It is certainly an interesting dimension to explore, at what scale will local disks outperform EBS. Ioan Tim Freeman wrote: > On Tue, 12 May 2009 15:38:52 -0500 > Tim Freeman wrote: > > >> On Tue, 12 May 2009 15:19:14 -0500 >> Ioan Raicu wrote: >> >> >>> Hi Tim, >>> >>> Tim Freeman wrote: >>> >>>> Is that for speed or less moving parts? I think EBS is the fastest option >>>> they have for disk space (faster than local disk), fyi. >>>> >>>> >>>> >>>> >>> Can you elaborate more on how EBS, can be faster than local disk? I know >>> nothing about EBS, but there are only 2 ways it can work. Since it is >>> persistent, it likely lives on S3 (or something similar). When it is >>> mounted, perhaps it still lives remotely, or perhaps it gets copied to a >>> local disk, and runs locally. A locally running EBS, should be >>> comparable in speed to a raw local disk, perhaps a bit slower for yet >>> another layer of abstraction. If EBS lives remotely, perhaps in a >>> completely idle EC2 cloud, a remote EBS might perform better than a >>> local disk (assuming network communication is lighter-weight than >>> SATA/PATA bus operations), but I can't imagine how this can hold true in >>> a large scale and loaded cloud scenario, where shared infrastrucutre >>> (network, S3, etc) can become congested under load. >>> >>> I don't see how EBS can be faster than local disk. Can you elaborate >>> more on this claim? >>> >> They state that it is purely network based (they do not list the technologies >> they use but this could be infiniband, iscsi, etc.). It's not uncommon to see >> a SAN etc. faster than a local disk... >> >> I made my statements based on Amazon documentation: >> >> "The latency and throughput of Amazon EBS volumes is designed to be >> significantly better than the Amazon EC2 instance stores in nearly all >> cases." >> >> I just googled and found this person showing that EBS wins on medium and high >> powered instances (although someone comments at the end that 'dd' tests are >> not the best thing to measure why EBS is better): >> >> http://developer.amazonwebservices.com/connect/message.jspa?messageID=125197 >> > > Also, as I was saying, using RAID can make it even better, found some number > for that: > > http://af-design.com/blog/2009/02/27/amazon-ec2-disk-performance/ > > Tim > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Tue May 12 16:11:56 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 12 May 2009 16:11:56 -0500 Subject: [Swift-devel] CFP: Special issue on Scientific Workflows in IJBPIM Message-ID: <4A09E61C.6060701@cs.uchicago.edu> Call for Papers Special Issue on Scientific Workflows International Journal of Business Process Integration and Management (IJBPIM) Description Scientific workflows have recently emerged as a new paradigm for scientists to formalize and structure complex scientific processes to enable and accelerate many significant scientific discoveries. A scientific workflow is a formal specification of a scientific process, which represents, streamlines, and automates the analytical and computational steps that a scientist needs to go through from dataset selection and integration, computation and analysis, to final data product presentation and visualization. A scientific workflow management system (SWFMS) is a system that supports the specification, modification, execution, failure recovery, and monitoring of a scientific workflow using the workflow logic to control the order of executing workflow tasks. The goal of this special issue is to present critical challenges, requirements, and issues related to scientific workflows. This collection of manuscripts will discuss key aspects in the development of a broad range of novel and innovative scientific workflow technologies. The emphasis of the special issue is on critical challenges in the development of various scientific workflows specifically as they relate to business workflow and service technologies. Particular emphasis will be placed on examples where innovative solutions to these challenges have resulted in scientific workflows which impact the scientific discovery process. Topics include but are not limited to: List of topic ? Scientific workflow provenance management ? Scientific workflow provenance analytics ? Scientific workflow data, metadata, service, and task management ? Scientific workflow architectures, models, and languages ? Scientific workflow monitoring and failure handling ? Streaming data processing in scientific workflows ? Pipelined, data, workflow, and task parallelism in scientific workflows ? Service, Grid, or Cloud-based scientific workflows ? Data, metadata, compute, user-interaction, or visualization-intensive scientific workflows ? Scientific workflow composition ? Security issues in scientific workflows ? Data integration and service integration in scientific workflows ? Scientific workflow mapping, optimization, and scheduling ? Scientific workflow modeling, verification, and validation ? Scalability, reliability, extensibility, agility, and interoperability ? Scientific workflow real-life applications Important dates ? July 1, 2009, paper submission ? October 1, 2009, notification ? January 1, 2010, camera-ready version ? Planned publication, middle of 2010 Guest editors ? Shiyong Lu , Wayne State University, U.S.A., Email: shiyong at wayne.edu ? Ewa Deelman , USC Information Sciences Institute, U.S.A., Email: deelman at isi.edu ? Zhiming Zhao , University of Amsterdam, the Netherlands, Email: z.zhao at uva.nl Submission details Submitted papers should not have been previously published nor be currently under consideration for publication elsewhere. All papers are refereed through a peer review process. Papers should be submitted to http://199.212.32.161/myreview/SubmitAbstract.php. Please also send an abstract and a copy of your paper to Shiyong at wayne.edu to ensure a reliable submission. Contact information All enquires about the special issue should be sent to Shiyong Lu at shiyong at wayne.edu. For more information, see http://www.cs.wayne.edu/~shiyong/swf/ijbpim09.html. -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfreeman at mcs.anl.gov Tue May 12 17:24:04 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 12 May 2009 17:24:04 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A09E346.80706@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> <20090512122708.4142d55c@sietch> <4A09D9C2.30609@cs.uchicago.edu> <20090512153852.29c68e5c@sietch> <20090512154340.60b88b3d@sietch> <4A09E346.80706@cs.uchicago.edu> Message-ID: <20090512172404.0eb27aa3@sietch> On Tue, 12 May 2009 15:59:50 -0500 Ioan Raicu wrote: > I see the numbers, but my instinct says that they can't scale EBS > performance linearly with every EC2 instance they can start. Perhaps EBS > is currently underutilized, so much that the EBS servers are relatively > idle. Well, they charge extra money when you use it, presumably using some kind of cost-plus model. If demand (relative to raw instance demand) increases noticeably, I'd bet they will then be able to afford to add more EBS capacity to keep up. They explicitly advertise it as better than local disk (and AWS has an excellent track record of doing what they say). You may be right about not scaling with every EC2 instance they can start at the moment, but that doesn't really matter when you have 100ks of customers and nodes, you just need to look at trends and plan accordingly. Tim From tfreeman at mcs.anl.gov Tue May 12 17:26:49 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Tue, 12 May 2009 17:26:49 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <20090512172404.0eb27aa3@sietch> References: <49EE1156.3090107@mcs.anl.gov> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <4A09A1A5.3080900@cs.uchicago.edu> <4A09A4A5.3060506@mcs.anl.gov> <4A09A6D8.1090707@cs.uchicago.edu> <20090512115329.1be32f10@sietch> <4A09AC17.6000008@mcs.anl.gov> <20090512122708.4142d55c@sietch> <4A09D9C2.30609@cs.uchicago.edu> <20090512153852.29c68e5c@sietch> <20090512154340.60b88b3d@sietch> <4A09E346.80706@cs.uchicago.edu> <20090512172404.0eb27aa3@sietch> Message-ID: <20090512172649.04bd5c07@sietch> On Tue, 12 May 2009 17:24:04 -0500 Tim Freeman wrote: > On Tue, 12 May 2009 15:59:50 -0500 > Ioan Raicu wrote: > > > I see the numbers, but my instinct says that they can't scale EBS > > performance linearly with every EC2 instance they can start. Perhaps EBS > > is currently underutilized, so much that the EBS servers are relatively > > idle. > > Well, they charge extra money when you use it, presumably using some kind of > cost-plus model. If demand (relative to raw instance demand) increases > noticeably, I'd bet they will then be able to afford to add more EBS capacity > to keep up. They explicitly advertise it as better than local disk (and AWS > has an excellent track record of doing what they say). > > You may be right about not scaling with every EC2 instance they can start at > the moment, but that doesn't really matter when you have 100ks of customers > and nodes, you just need to look at trends and plan accordingly. Also, who knows, maybe they start rejecting EBS requests at some point to maintain the target performance. I'm not going to spend the money to find out :-) Tim From benc at hawaga.org.uk Wed May 13 08:53:36 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 13 May 2009 13:53:36 +0000 (GMT) Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <4A08AFF4.608@cs.uchicago.edu> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> Message-ID: On Mon, 11 May 2009, yizhu wrote: > I think my next step is to write a sample swift code to check if we can let > swift grab input files from S3, execute it, and then write the output files > back to S3. Swift won't be able to do this out of the box. In the present model, you must have a shared filesystem which is accessible from the Swift client and from each worker. I've looked at making the Swift worker-side script, libexec_swiftwrap able to invoke arbitrary commandlines to pull in data instead of using posix cp or ln. You would need commandline tools to transfer a file from S3 to the local system, and from the local system to S3. Do you have those commands? What do they look like? > Since each Amazon virtual node only has limited storage space( Small instance > for 1.7GB, Large Instance for 7.5GB), we may need to use EBS(Elastic Block > Store) to storage temp files created by swift. The EBS behaved like a hard > disk volume and can be mounted by any virtual nodes. But here arise a problem, > since a volume can only be attached to one instance at a time[1], so the files > store in EBS mounted by one node won't be shared by any other nodes; we lost > the file sharing ability which I think is a fundamental requirement for swift. For getting a probably poorly performing prototype, the fuse idea mentioned elsewhere seems interesting. From my understanidng, you should be able to mount an EBS volume on the cluster head node and re-export it using the existing NFS shared file system. -- From zhaozhang at uchicago.edu Wed May 13 10:59:10 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 13 May 2009 10:59:10 -0500 Subject: [Swift-devel] Error in Coaster test on Clemson-IT site Message-ID: <4A0AEE4E.1070502@uchicago.edu> Hi, Mihael I got a error returned for the language-behavior test on Clemson-IT site with coaster. All gram logs are at /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs_clemson. Could you help on checking that error there is. sites.xml [zzhang at tp-grid1 sites]$ cat coaster_test/CLEMSON-IT.xml /scratch/data/osg/tmp/CLEMSON-IT Here is the stdout: testing site configuration: coaster_test/CLEMSON-IT.xml Removing files from previous runs Running test 061-cattwo at Wed May 13 10:27:04 CDT 2009 Swift svn swift-r2896 cog-r2392 RunID: 20090513-1027-d406ke07 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/u on CLEMSON-IT Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/x on CLEMSON-IT Progress: Stage in:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/z on CLEMSON-IT Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: CLEMSON-IT Directory: 061-cattwo-20090513-1027-d406ke07/jobs/z/cat-zsqcmqaj stderr.txt: stdout.txt: ---- Caused by: Failed to start worker: Worker ended prematurely Cleaning up... Shutting down service at https://130.127.11.89:43784 Got channel MetaChannel: 1707082587 -> GSSSChannel-null(1) - Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo zhao From benc at hawaga.org.uk Wed May 13 11:01:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 13 May 2009 16:01:25 +0000 (GMT) Subject: [Swift-devel] Error in Coaster test on Clemson-IT site In-Reply-To: <4A0AEE4E.1070502@uchicago.edu> References: <4A0AEE4E.1070502@uchicago.edu> Message-ID: When you tests these sites with coasters, and they fail, it would be useful perhaps to also test them without coasters and report if that works. This also gives more testing for the non-coaster case generally. On Wed, 13 May 2009, Zhao Zhang wrote: > Hi, Mihael > > I got a error returned for the language-behavior test on Clemson-IT site with > coaster. > All gram logs are at > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs_clemson. > Could you help on checking that error there is. > > sites.xml > [zzhang at tp-grid1 sites]$ cat coaster_test/CLEMSON-IT.xml > > > > > jobManager="gt2:gt2:condor"/> > /scratch/data/osg/tmp/CLEMSON-IT > > > > > Here is the stdout: > > testing site configuration: coaster_test/CLEMSON-IT.xml > Removing files from previous runs > Running test 061-cattwo at Wed May 13 10:27:04 CDT 2009 > Swift svn swift-r2896 cog-r2392 > > RunID: 20090513-1027-d406ke07 > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/u > on CLEMSON-IT > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/x > on CLEMSON-IT > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/z > on CLEMSON-IT > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: CLEMSON-IT > Directory: 061-cattwo-20090513-1027-d406ke07/jobs/z/cat-zsqcmqaj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Failed to start worker: Worker ended prematurely > Cleaning up... > Shutting down service at https://130.127.11.89:43784 > Got channel MetaChannel: 1707082587 -> GSSSChannel-null(1) > - Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > zhao > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From zhaozhang at uchicago.edu Wed May 13 11:04:57 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 13 May 2009 11:04:57 -0500 Subject: [Swift-devel] Error in Coaster test on Clemson-IT site In-Reply-To: References: <4A0AEE4E.1070502@uchicago.edu> Message-ID: <4A0AEFA9.3050508@uchicago.edu> Ah, yes, I ran them, here are the results. [zzhang at tp-grid1 coaster_test]$ globus-job-run osggate.clemson.edu /usr/bin/id uid=145767(osg) gid=10000(cuuser) groups=10000(cuuser) [zzhang at tp-grid1 coaster_test]$ globus-job-run osggate.clemson.edu/jobmanager-condor /usr/bin/id uid=145767(osg) gid=94175(osg) groups=94175(osg) [zzhang at tp-grid1 sites]$ globus-url-copy gsiftp://osggate.clemson.edu/home/osg/logs_clemson.tar file://`pwd`/logs_clemson.tar So both globus-job-run and globus-url-copy are working. zhao Ben Clifford wrote: > When you tests these sites with coasters, and they fail, it would be > useful perhaps to also test them without coasters and report if that > works. This also gives more testing for the non-coaster case generally. > > On Wed, 13 May 2009, Zhao Zhang wrote: > > >> Hi, Mihael >> >> I got a error returned for the language-behavior test on Clemson-IT site with >> coaster. >> All gram logs are at >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs_clemson. >> Could you help on checking that error there is. >> >> sites.xml >> [zzhang at tp-grid1 sites]$ cat coaster_test/CLEMSON-IT.xml >> >> >> >> >> > jobManager="gt2:gt2:condor"/> >> /scratch/data/osg/tmp/CLEMSON-IT >> >> >> >> >> Here is the stdout: >> >> testing site configuration: coaster_test/CLEMSON-IT.xml >> Removing files from previous runs >> Running test 061-cattwo at Wed May 13 10:27:04 CDT 2009 >> Swift svn swift-r2896 cog-r2392 >> >> RunID: 20090513-1027-d406ke07 >> Progress: >> Progress: Stage in:1 >> Progress: Submitting:1 >> Progress: Submitting:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/u >> on CLEMSON-IT >> Progress: Submitted:1 >> Progress: Active:1 >> Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/x >> on CLEMSON-IT >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Failed to transfer wrapper log from 061-cattwo-20090513-1027-d406ke07/info/z >> on CLEMSON-IT >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >> Host: CLEMSON-IT >> Directory: 061-cattwo-20090513-1027-d406ke07/jobs/z/cat-zsqcmqaj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Failed to start worker: Worker ended prematurely >> Cleaning up... >> Shutting down service at https://130.127.11.89:43784 >> Got channel MetaChannel: 1707082587 -> GSSSChannel-null(1) >> - Done >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >> >> zhao >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >> > > From benc at hawaga.org.uk Wed May 13 11:11:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 13 May 2009 16:11:37 +0000 (GMT) Subject: [Swift-devel] Error in Coaster test on Clemson-IT site In-Reply-To: <4A0AEFA9.3050508@uchicago.edu> References: <4A0AEE4E.1070502@uchicago.edu> <4A0AEFA9.3050508@uchicago.edu> Message-ID: I meant to also make a sites file for Swift that does not use coasters. That will find some problems that I think the below tests will not find. On Wed, 13 May 2009, Zhao Zhang wrote: > Ah, yes, I ran them, here are the results. > [zzhang at tp-grid1 coaster_test]$ globus-job-run osggate.clemson.edu /usr/bin/id > uid=145767(osg) gid=10000(cuuser) groups=10000(cuuser) > [zzhang at tp-grid1 coaster_test]$ globus-job-run > osggate.clemson.edu/jobmanager-condor /usr/bin/id > uid=145767(osg) gid=94175(osg) groups=94175(osg) > > [zzhang at tp-grid1 sites]$ globus-url-copy > gsiftp://osggate.clemson.edu/home/osg/logs_clemson.tar > file://`pwd`/logs_clemson.tar > > So both globus-job-run and globus-url-copy are working. > > zhao > > Ben Clifford wrote: > > When you tests these sites with coasters, and they fail, it would be useful > > perhaps to also test them without coasters and report if that works. This > > also gives more testing for the non-coaster case generally. > > > > On Wed, 13 May 2009, Zhao Zhang wrote: > > > > > > > Hi, Mihael > > > > > > I got a error returned for the language-behavior test on Clemson-IT site > > > with > > > coaster. > > > All gram logs are at > > > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs_clemson. > > > Could you help on checking that error there is. > > > > > > sites.xml > > > [zzhang at tp-grid1 sites]$ cat coaster_test/CLEMSON-IT.xml > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:condor"/> > > > /scratch/data/osg/tmp/CLEMSON-IT > > > > > > > > > > > > > > > Here is the stdout: > > > > > > testing site configuration: coaster_test/CLEMSON-IT.xml > > > Removing files from previous runs > > > Running test 061-cattwo at Wed May 13 10:27:04 CDT 2009 > > > Swift svn swift-r2896 cog-r2392 > > > > > > RunID: 20090513-1027-d406ke07 > > > Progress: > > > Progress: Stage in:1 > > > Progress: Submitting:1 > > > Progress: Submitting:1 > > > Progress: Submitted:1 > > > Progress: Active:1 > > > Failed to transfer wrapper log from > > > 061-cattwo-20090513-1027-d406ke07/info/u > > > on CLEMSON-IT > > > Progress: Submitted:1 > > > Progress: Active:1 > > > Failed to transfer wrapper log from > > > 061-cattwo-20090513-1027-d406ke07/info/x > > > on CLEMSON-IT > > > Progress: Stage in:1 > > > Progress: Submitted:1 > > > Progress: Active:1 > > > Failed to transfer wrapper log from > > > 061-cattwo-20090513-1027-d406ke07/info/z > > > on CLEMSON-IT > > > Progress: Failed:1 > > > Execution failed: > > > Exception in cat: > > > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > > > Host: CLEMSON-IT > > > Directory: 061-cattwo-20090513-1027-d406ke07/jobs/z/cat-zsqcmqaj > > > stderr.txt: > > > > > > stdout.txt: > > > > > > ---- > > > > > > Caused by: > > > Failed to start worker: Worker ended prematurely > > > Cleaning up... > > > Shutting down service at https://130.127.11.89:43784 > > > Got channel MetaChannel: 1707082587 -> GSSSChannel-null(1) > > > - Done > > > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > > > > > > > zhao > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > From benc at hawaga.org.uk Wed May 13 11:25:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 13 May 2009 16:25:34 +0000 (GMT) Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> Message-ID: On Wed, 13 May 2009, Ben Clifford wrote: > > Since each Amazon virtual node only has limited storage space( Small instance > > for 1.7GB, Large Instance for 7.5GB), we may need to use EBS(Elastic Block > > Store) to storage temp files created by swift. The EBS behaved like a hard For many applications, 1.7G is a very large amount of data. 7.5G of storage for your shared filesystem and 1.7G for worker node temporary files seems like a lot to me. So I think its important to clarify: do you want to use s3 because you think it solves a storage problem, and if so, does that problem really exist now? or do you want to use s3 because it is an interesting thing to play with? (either way is fine, but I think it is important to clarify) -- From foster at anl.gov Wed May 13 11:46:21 2009 From: foster at anl.gov (foster at anl.gov) Date: Wed, 13 May 2009 11:46:21 -0500 (CDT) Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> Message-ID: <1281ADF3-E3D8-483C-AF98-01C3A009083F@anl.gov> The reason for wanting to use S3 is that one pays to move data to/from amazon. So, we want to allow analysis of data that sits on S3, and puts it's output on s3 Ian -- from mobile On May 13, 2009, at 12:25 PM, Ben Clifford wrote: > > On Wed, 13 May 2009, Ben Clifford wrote: > >>> Since each Amazon virtual node only has limited storage >>> space( Small instance >>> for 1.7GB, Large Instance for 7.5GB), we may need to use >>> EBS(Elastic Block >>> Store) to storage temp files created by swift. The EBS behaved >>> like a hard > > For many applications, 1.7G is a very large amount of data. > > 7.5G of storage for your shared filesystem and 1.7G for worker node > temporary files seems like a lot to me. > > So I think its important to clarify: do you want to use s3 because you > think it solves a storage problem, and if so, does that problem really > exist now? or do you want to use s3 because it is an interesting > thing to > play with? (either way is fine, but I think it is important to > clarify) > > -- > From benc at hawaga.org.uk Wed May 13 11:58:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 13 May 2009 16:58:50 +0000 (GMT) Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: <1281ADF3-E3D8-483C-AF98-01C3A009083F@anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EFE566.8050005@cs.uchicago.edu> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <1281ADF3-E3D8-483C-AF98-01C3A009083F@anl.gov> Message-ID: OK. Specifically for that problem, then, something that makes the S3 storage available at the swift submit side would avoid transferring data across the $-boundary and would avoid having to mess too much with the data management architecture. That could be either a cog provider to pull files from s3 (the existing dcache provider might serve as an example - it seems conceptually quite similar); or a posix mount of s3 on the submit node (in which case the local filesystem cog provider would work). This would not transfer data from s3 directly to the worker nodes, but that is not the problem being addressed here. On Wed, 13 May 2009, foster at anl.gov wrote: > The reason for wanting to use S3 is that one pays to move data to/from amazon. > So, we want to allow analysis of data that sits on S3, and puts it's output on > s3 > > > Ian -- from mobile > > On May 13, 2009, at 12:25 PM, Ben Clifford wrote: > > > > > On Wed, 13 May 2009, Ben Clifford wrote: > > > > > > Since each Amazon virtual node only has limited storage space( Small > > > > instance > > > > for 1.7GB, Large Instance for 7.5GB), we may need to use EBS(Elastic > > > > Block > > > > Store) to storage temp files created by swift. The EBS behaved like a > > > > hard > > > > For many applications, 1.7G is a very large amount of data. > > > > 7.5G of storage for your shared filesystem and 1.7G for worker node > > temporary files seems like a lot to me. > > > > So I think its important to clarify: do you want to use s3 because you > > think it solves a storage problem, and if so, does that problem really > > exist now? or do you want to use s3 because it is an interesting thing to > > play with? (either way is fine, but I think it is important to clarify) > > > > -- > > > From zhaozhang at uchicago.edu Wed May 13 12:38:20 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 13 May 2009 12:38:20 -0500 Subject: [Swift-devel] Error in Coaster test on Clemson-IT site In-Reply-To: References: <4A0AEE4E.1070502@uchicago.edu> <4A0AEFA9.3050508@uchicago.edu> Message-ID: <4A0B058C.9050200@uchicago.edu> ok, I ran the same test case with condor-g, it failed. [zzhang at tp-grid1 sites]$ ./run-site condor-g_back/CLEMSON-IT.xml testing site configuration: condor-g_back/CLEMSON-IT.xml Removing files from previous runs Running test 061-cattwo at Wed May 13 11:14:31 CDT 2009 Swift svn swift-r2896 cog-r2392 RunID: 20090513-1114-0tvjyz67 Progress: uninitialized:1 Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Progress: Active:1 Progress: Active:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/6 on CLEMSON-IT Progress: Stage in:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Progress: Active:1 Progress: Checking status:1 Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/8 on CLEMSON-IT Progress: Stage in:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Progress: Active:1 Progress: Active:1 Progress: Checking status:1 Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/a on CLEMSON-IT Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: CLEMSON-IT Directory: 061-cattwo-20090513-1114-0tvjyz67/jobs/a/cat-as69oqaj stderr.txt: stdout.txt: ---- Caused by: No status file was found. Check the shared filesystem on CLEMSON-IT Ben Clifford wrote: > I meant to also make a sites file for Swift that does not use coasters. > That will find some problems that I think the below tests will not find. > > On Wed, 13 May 2009, Zhao Zhang wrote: > > >> Ah, yes, I ran them, here are the results. >> [zzhang at tp-grid1 coaster_test]$ globus-job-run osggate.clemson.edu /usr/bin/id >> uid=145767(osg) gid=10000(cuuser) groups=10000(cuuser) >> [zzhang at tp-grid1 coaster_test]$ globus-job-run >> osggate.clemson.edu/jobmanager-condor /usr/bin/id >> uid=145767(osg) gid=94175(osg) groups=94175(osg) >> >> [zzhang at tp-grid1 sites]$ globus-url-copy >> gsiftp://osggate.clemson.edu/home/osg/logs_clemson.tar >> file://`pwd`/logs_clemson.tar >> >> So both globus-job-run and globus-url-copy are working. >> >> zhao >> >> Ben Clifford wrote: >> >>> When you tests these sites with coasters, and they fail, it would be useful >>> perhaps to also test them without coasters and report if that works. This >>> also gives more testing for the non-coaster case generally. >>> >>> On Wed, 13 May 2009, Zhao Zhang wrote: >>> >>> >>> >>>> Hi, Mihael >>>> >>>> I got a error returned for the language-behavior test on Clemson-IT site >>>> with >>>> coaster. >>>> All gram logs are at >>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs_clemson. >>>> Could you help on checking that error there is. >>>> >>>> sites.xml >>>> [zzhang at tp-grid1 sites]$ cat coaster_test/CLEMSON-IT.xml >>>> >>>> >>>> >>>> >>>> >>> jobManager="gt2:gt2:condor"/> >>>> /scratch/data/osg/tmp/CLEMSON-IT >>>> >>>> >>>> >>>> >>>> Here is the stdout: >>>> >>>> testing site configuration: coaster_test/CLEMSON-IT.xml >>>> Removing files from previous runs >>>> Running test 061-cattwo at Wed May 13 10:27:04 CDT 2009 >>>> Swift svn swift-r2896 cog-r2392 >>>> >>>> RunID: 20090513-1027-d406ke07 >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitting:1 >>>> Progress: Submitting:1 >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Failed to transfer wrapper log from >>>> 061-cattwo-20090513-1027-d406ke07/info/u >>>> on CLEMSON-IT >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Failed to transfer wrapper log from >>>> 061-cattwo-20090513-1027-d406ke07/info/x >>>> on CLEMSON-IT >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Failed to transfer wrapper log from >>>> 061-cattwo-20090513-1027-d406ke07/info/z >>>> on CLEMSON-IT >>>> Progress: Failed:1 >>>> Execution failed: >>>> Exception in cat: >>>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >>>> Host: CLEMSON-IT >>>> Directory: 061-cattwo-20090513-1027-d406ke07/jobs/z/cat-zsqcmqaj >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> >>>> ---- >>>> >>>> Caused by: >>>> Failed to start worker: Worker ended prematurely >>>> Cleaning up... >>>> Shutting down service at https://130.127.11.89:43784 >>>> Got channel MetaChannel: 1707082587 -> GSSSChannel-null(1) >>>> - Done >>>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >>>> >>>> >>>> zhao >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >>>> >>>> >>> >>> >> > > From benc at hawaga.org.uk Wed May 13 12:40:15 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 13 May 2009 17:40:15 +0000 (GMT) Subject: [Swift-devel] Error in Coaster test on Clemson-IT site In-Reply-To: <4A0B058C.9050200@uchicago.edu> References: <4A0AEE4E.1070502@uchicago.edu> <4A0AEFA9.3050508@uchicago.edu> <4A0B058C.9050200@uchicago.edu> Message-ID: OK. But try with plain swift + gt2, using a element in the sites file. No condor-g, no coasters. On Wed, 13 May 2009, Zhao Zhang wrote: > ok, I ran the same test case with condor-g, it failed. > > [zzhang at tp-grid1 sites]$ ./run-site condor-g_back/CLEMSON-IT.xml > testing site configuration: condor-g_back/CLEMSON-IT.xml > Removing files from previous runs > Running test 061-cattwo at Wed May 13 11:14:31 CDT 2009 > Swift svn swift-r2896 cog-r2392 > > RunID: 20090513-1114-0tvjyz67 > Progress: uninitialized:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Active:1 > Progress: Active:1 > Progress: Failed but can retry:1 > Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/6 > on CLEMSON-IT > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Active:1 > Progress: Checking status:1 > Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/8 > on CLEMSON-IT > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Active:1 > Progress: Active:1 > Progress: Checking status:1 > Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/a > on CLEMSON-IT > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: CLEMSON-IT > Directory: 061-cattwo-20090513-1114-0tvjyz67/jobs/a/cat-as69oqaj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > No status file was found. Check the shared filesystem on CLEMSON-IT > > > Ben Clifford wrote: > > I meant to also make a sites file for Swift that does not use coasters. That > > will find some problems that I think the below tests will not find. > > > > On Wed, 13 May 2009, Zhao Zhang wrote: > > > > > > > Ah, yes, I ran them, here are the results. > > > [zzhang at tp-grid1 coaster_test]$ globus-job-run osggate.clemson.edu > > > /usr/bin/id > > > uid=145767(osg) gid=10000(cuuser) groups=10000(cuuser) > > > [zzhang at tp-grid1 coaster_test]$ globus-job-run > > > osggate.clemson.edu/jobmanager-condor /usr/bin/id > > > uid=145767(osg) gid=94175(osg) groups=94175(osg) > > > > > > [zzhang at tp-grid1 sites]$ globus-url-copy > > > gsiftp://osggate.clemson.edu/home/osg/logs_clemson.tar > > > file://`pwd`/logs_clemson.tar > > > > > > So both globus-job-run and globus-url-copy are working. > > > > > > zhao > > > > > > Ben Clifford wrote: > > > > > > > When you tests these sites with coasters, and they fail, it would be > > > > useful > > > > perhaps to also test them without coasters and report if that works. > > > > This > > > > also gives more testing for the non-coaster case generally. > > > > > > > > On Wed, 13 May 2009, Zhao Zhang wrote: > > > > > > > > > > > > > Hi, Mihael > > > > > > > > > > I got a error returned for the language-behavior test on Clemson-IT > > > > > site > > > > > with > > > > > coaster. > > > > > All gram logs are at > > > > > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs_clemson. > > > > > Could you help on checking that error there is. > > > > > > > > > > sites.xml > > > > > [zzhang at tp-grid1 sites]$ cat coaster_test/CLEMSON-IT.xml > > > > > > > > > > > > > > > > > > > > > > > > > > > > > jobManager="gt2:gt2:condor"/> > > > > > /scratch/data/osg/tmp/CLEMSON-IT > > > > > > > > > > > > > > > > > > > > > > > > > Here is the stdout: > > > > > > > > > > testing site configuration: coaster_test/CLEMSON-IT.xml > > > > > Removing files from previous runs > > > > > Running test 061-cattwo at Wed May 13 10:27:04 CDT 2009 > > > > > Swift svn swift-r2896 cog-r2392 > > > > > > > > > > RunID: 20090513-1027-d406ke07 > > > > > Progress: > > > > > Progress: Stage in:1 > > > > > Progress: Submitting:1 > > > > > Progress: Submitting:1 > > > > > Progress: Submitted:1 > > > > > Progress: Active:1 > > > > > Failed to transfer wrapper log from > > > > > 061-cattwo-20090513-1027-d406ke07/info/u > > > > > on CLEMSON-IT > > > > > Progress: Submitted:1 > > > > > Progress: Active:1 > > > > > Failed to transfer wrapper log from > > > > > 061-cattwo-20090513-1027-d406ke07/info/x > > > > > on CLEMSON-IT > > > > > Progress: Stage in:1 > > > > > Progress: Submitted:1 > > > > > Progress: Active:1 > > > > > Failed to transfer wrapper log from > > > > > 061-cattwo-20090513-1027-d406ke07/info/z > > > > > on CLEMSON-IT > > > > > Progress: Failed:1 > > > > > Execution failed: > > > > > Exception in cat: > > > > > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > > > > > Host: CLEMSON-IT > > > > > Directory: 061-cattwo-20090513-1027-d406ke07/jobs/z/cat-zsqcmqaj > > > > > stderr.txt: > > > > > > > > > > stdout.txt: > > > > > > > > > > ---- > > > > > > > > > > Caused by: > > > > > Failed to start worker: Worker ended prematurely > > > > > Cleaning up... > > > > > Shutting down service at https://130.127.11.89:43784 > > > > > Got channel MetaChannel: 1707082587 -> GSSSChannel-null(1) > > > > > - Done > > > > > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > > > > > > > > > > > > > zhao > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > From skenny at uchicago.edu Wed May 13 12:44:04 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 13 May 2009 12:44:04 -0500 (CDT) Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <49F87BE6.8020603@mcs.anl.gov> References: <49F87BE6.8020603@mcs.anl.gov> Message-ID: <20090513124404.BXN59020@m4500-02.uchicago.edu> hey ben, we are looking at using glen's code as a starting point for a tool that could be included in swift/committed to swift's svn. there will need to be quite a bit of testing first as i'm unsure about committing someone else's code 'as is'...however, my other question regarding this is that there is apparently some *paperwork* involved in this process which would require me (and glen?) to sign some sort of globus licensing agreement (?) i'm unsure of the specifics but could you let me know the best way to get the ball rolling? thanks ~sarah ---- Original message ---- >Date: Wed, 29 Apr 2009 11:10:14 -0500 >From: Michael Wilde >Subject: [Swift-devel] Tool for TeraGrid site info >To: Glen Hocky , swift-devel > >Glen, where is the latest version of your TG sites info tool? > >Ben, what should we do to start testing that, making it uniform with the >osg tool, and make it available for comments and heading towards >inclusion in Swift? > >(I think this was commented on earlier - I will track down when I get >back to the planning list in a few days from now) >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed May 13 12:56:15 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 13 May 2009 17:56:15 +0000 (GMT) Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <20090513124404.BXN59020@m4500-02.uchicago.edu> References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> Message-ID: On Wed, 13 May 2009, skenny at uchicago.edu wrote: > we are looking at using glen's code as a starting point for a > tool that could be included in swift/committed to swift's svn. > there will need to be quite a bit of testing first as i'm > unsure about committing someone else's code 'as is'... I think a bunch of testing will show whether code is useful or not. But it needs additional stuff like documentation. Mats' work on the OSG sites file generator went in quite easily - it has some basic documentation (the command line help) and it worked in a bunch of informal tests. Ideally the test suite and the userguide would be appropraitely amended too... > my other question regarding this is that there is apparently > some *paperwork* involved in this process which would require > me (and glen?) to sign some sort of globus licensing agreement > (?) mmm paperwork. If you're working on glen's code, then at least you, glen, glen's employer and your employer need to sign globus contributor licenses or have existing licenses appropriately amended to cover you/him. I think maybe you don't already have a license signed - I don't remember you doing one. In which case look at http://dev.globus.org/wiki/Guidelines#License_and_Contributor_Agreements for the individual license agreement and give a copy to Gigi. Glen needs to do the same, and have his employer(s) give a license - I don't know in his case who employs him. (the above reflects my understanding of the dev.globus guideliens and is not to be construed as legal advice under US, UK or South African law) -- From hockyg at uchicago.edu Wed May 13 13:03:08 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 13 May 2009 13:03:08 -0500 Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> Message-ID: <9DD89F57-D536-4BFF-8A62-1A6AEC327963@uchicago.edu> Though I would happy to be a contributor, I would say my code at best should be a guideline. Oh, and I have a commandline help On May 13, 2009, at 12:56 PM, Ben Clifford wrote: > > On Wed, 13 May 2009, skenny at uchicago.edu wrote: > >> we are looking at using glen's code as a starting point for a >> tool that could be included in swift/committed to swift's svn. >> there will need to be quite a bit of testing first as i'm >> unsure about committing someone else's code 'as is'... > > I think a bunch of testing will show whether code is useful or not. > But it > needs additional stuff like documentation. Mats' work on the OSG sites > file generator went in quite easily - it has some basic > documentation (the > command line help) and it worked in a bunch of informal tests. > Ideally the > test suite and the userguide would be appropraitely amended too... > >> my other question regarding this is that there is apparently >> some *paperwork* involved in this process which would require >> me (and glen?) to sign some sort of globus licensing agreement >> (?) > > mmm paperwork. If you're working on glen's code, then at least you, > glen, > glen's employer and your employer need to sign globus contributor > licenses > or have existing licenses appropriately amended to cover you/him. > > I think maybe you don't already have a license signed - I don't > remember > you doing one. In which case look at > http://dev.globus.org/wiki/Guidelines#License_and_Contributor_Agreements > for the individual license agreement and give a copy to Gigi. > > Glen needs to do the same, and have his employer(s) give a license - I > don't know in his case who employs him. > > (the above reflects my understanding of the dev.globus guideliens > and is > not to be construed as legal advice under US, UK or South African law) > > -- > From wilde at mcs.anl.gov Wed May 13 13:06:43 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 13 May 2009 13:06:43 -0500 Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <20090513124404.BXN59020@m4500-02.uchicago.edu> References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> Message-ID: <4A0B0C33.3060201@mcs.anl.gov> Sarah, Glen, The licensing guidelines that govern Swift code (as a Globus incubator project) are at: http://dev.globus.org/wiki/Guidelines#License_and_Contributor_Agreements Hopefully you are both OK with the individual licensing agreement for contributing code, and can sign that as individuals and send a copy to Gigi Rohder to keep on file. (Argonne, MCS, Bldg. 221 Rm. D-146) This ensures that the code that becomes part of Swift can be freely distributed for any legitimate use. If you give me the signed original, I will get it back to Gigi. We have not been diligent about tracking this, but if anyone else that is contributing code (who should be on this list) has not signed the agreement, please either do so or let me know. Thanks, Mike On 5/13/09 12:44 PM, skenny at uchicago.edu wrote: > hey ben, > > we are looking at using glen's code as a starting point for a > tool that could be included in swift/committed to swift's svn. > there will need to be quite a bit of testing first as i'm > unsure about committing someone else's code 'as is'...however, > my other question regarding this is that there is apparently > some *paperwork* involved in this process which would require > me (and glen?) to sign some sort of globus licensing agreement > (?) > > i'm unsure of the specifics but could you let me know the best > way to get the ball rolling? > > thanks > ~sarah > > > ---- Original message ---- >> Date: Wed, 29 Apr 2009 11:10:14 -0500 >> From: Michael Wilde >> Subject: [Swift-devel] Tool for TeraGrid site info >> To: Glen Hocky , swift-devel > >> Glen, where is the latest version of your TG sites info tool? >> >> Ben, what should we do to start testing that, making it > uniform with the >> osg tool, and make it available for comments and heading towards >> inclusion in Swift? >> >> (I think this was commented on earlier - I will track down > when I get >> back to the planning list in a few days from now) >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Wed May 13 13:09:26 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 13 May 2009 13:09:26 -0500 Subject: [Swift-devel] Error in Coaster test on Clemson-IT site In-Reply-To: References: <4A0AEE4E.1070502@uchicago.edu> <4A0AEFA9.3050508@uchicago.edu> <4A0B058C.9050200@uchicago.edu> Message-ID: <4A0B0CD6.5070709@uchicago.edu> Hi, Ben I rerun it with the following sites.xml, it still failed /scratch/data/osg/tmp/CLEMSON-IT h [zzhang at tp-grid1 sites]$ ./run-site regular/CLEMSON-IT.xml testing site configuration: regular/CLEMSON-IT.xml Removing files from previous runs Running test 061-cattwo at Wed May 13 13:06:51 CDT 2009 Swift svn swift-r2896 cog-r2392 RunID: 20090513-1306-yiec15sd Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090513-1306-yiec15sd/info/g on CLEMSON-IT Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090513-1306-yiec15sd/info/i on CLEMSON-IT Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090513-1306-yiec15sd/info/k on CLEMSON-IT Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: CLEMSON-IT Directory: 061-cattwo-20090513-1306-yiec15sd/jobs/k/cat-k8dqsqaj stderr.txt: stdout.txt: ---- Caused by: No status file was found. Check the shared filesystem on CLEMSON-IT SWIFT RETURN CODE NON-ZERO - test 061-cattwo Ben Clifford wrote: > OK. But try with plain swift + gt2, using a element in the > sites file. No condor-g, no coasters. > > On Wed, 13 May 2009, Zhao Zhang wrote: > > >> ok, I ran the same test case with condor-g, it failed. >> >> [zzhang at tp-grid1 sites]$ ./run-site condor-g_back/CLEMSON-IT.xml >> testing site configuration: condor-g_back/CLEMSON-IT.xml >> Removing files from previous runs >> Running test 061-cattwo at Wed May 13 11:14:31 CDT 2009 >> Swift svn swift-r2896 cog-r2392 >> >> RunID: 20090513-1114-0tvjyz67 >> Progress: uninitialized:1 >> Progress: Stage in:1 >> Progress: Submitting:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Progress: Active:1 >> Progress: Active:1 >> Progress: Failed but can retry:1 >> Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/6 >> on CLEMSON-IT >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Progress: Active:1 >> Progress: Checking status:1 >> Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/8 >> on CLEMSON-IT >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Progress: Active:1 >> Progress: Active:1 >> Progress: Checking status:1 >> Failed to transfer wrapper log from 061-cattwo-20090513-1114-0tvjyz67/info/a >> on CLEMSON-IT >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >> Host: CLEMSON-IT >> Directory: 061-cattwo-20090513-1114-0tvjyz67/jobs/a/cat-as69oqaj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> No status file was found. Check the shared filesystem on CLEMSON-IT >> >> >> Ben Clifford wrote: >> >>> I meant to also make a sites file for Swift that does not use coasters. That >>> will find some problems that I think the below tests will not find. >>> >>> On Wed, 13 May 2009, Zhao Zhang wrote: >>> >>> >>> >>>> Ah, yes, I ran them, here are the results. >>>> [zzhang at tp-grid1 coaster_test]$ globus-job-run osggate.clemson.edu >>>> /usr/bin/id >>>> uid=145767(osg) gid=10000(cuuser) groups=10000(cuuser) >>>> [zzhang at tp-grid1 coaster_test]$ globus-job-run >>>> osggate.clemson.edu/jobmanager-condor /usr/bin/id >>>> uid=145767(osg) gid=94175(osg) groups=94175(osg) >>>> >>>> [zzhang at tp-grid1 sites]$ globus-url-copy >>>> gsiftp://osggate.clemson.edu/home/osg/logs_clemson.tar >>>> file://`pwd`/logs_clemson.tar >>>> >>>> So both globus-job-run and globus-url-copy are working. >>>> >>>> zhao >>>> >>>> Ben Clifford wrote: >>>> >>>> >>>>> When you tests these sites with coasters, and they fail, it would be >>>>> useful >>>>> perhaps to also test them without coasters and report if that works. >>>>> This >>>>> also gives more testing for the non-coaster case generally. >>>>> >>>>> On Wed, 13 May 2009, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Hi, Mihael >>>>>> >>>>>> I got a error returned for the language-behavior test on Clemson-IT >>>>>> site >>>>>> with >>>>>> coaster. >>>>>> All gram logs are at >>>>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs_clemson. >>>>>> Could you help on checking that error there is. >>>>>> >>>>>> sites.xml >>>>>> [zzhang at tp-grid1 sites]$ cat coaster_test/CLEMSON-IT.xml >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> jobManager="gt2:gt2:condor"/> >>>>>> /scratch/data/osg/tmp/CLEMSON-IT >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Here is the stdout: >>>>>> >>>>>> testing site configuration: coaster_test/CLEMSON-IT.xml >>>>>> Removing files from previous runs >>>>>> Running test 061-cattwo at Wed May 13 10:27:04 CDT 2009 >>>>>> Swift svn swift-r2896 cog-r2392 >>>>>> >>>>>> RunID: 20090513-1027-d406ke07 >>>>>> Progress: >>>>>> Progress: Stage in:1 >>>>>> Progress: Submitting:1 >>>>>> Progress: Submitting:1 >>>>>> Progress: Submitted:1 >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> 061-cattwo-20090513-1027-d406ke07/info/u >>>>>> on CLEMSON-IT >>>>>> Progress: Submitted:1 >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> 061-cattwo-20090513-1027-d406ke07/info/x >>>>>> on CLEMSON-IT >>>>>> Progress: Stage in:1 >>>>>> Progress: Submitted:1 >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> 061-cattwo-20090513-1027-d406ke07/info/z >>>>>> on CLEMSON-IT >>>>>> Progress: Failed:1 >>>>>> Execution failed: >>>>>> Exception in cat: >>>>>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >>>>>> Host: CLEMSON-IT >>>>>> Directory: 061-cattwo-20090513-1027-d406ke07/jobs/z/cat-zsqcmqaj >>>>>> stderr.txt: >>>>>> >>>>>> stdout.txt: >>>>>> >>>>>> ---- >>>>>> >>>>>> Caused by: >>>>>> Failed to start worker: Worker ended prematurely >>>>>> Cleaning up... >>>>>> Shutting down service at https://130.127.11.89:43784 >>>>>> Got channel MetaChannel: 1707082587 -> GSSSChannel-null(1) >>>>>> - Done >>>>>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >>>>>> >>>>>> >>>>>> zhao >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> > > From wilde at mcs.anl.gov Wed May 13 13:10:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 13 May 2009 13:10:05 -0500 Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <9DD89F57-D536-4BFF-8A62-1A6AEC327963@uchicago.edu> References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> <9DD89F57-D536-4BFF-8A62-1A6AEC327963@uchicago.edu> Message-ID: <4A0B0CFD.2060909@mcs.anl.gov> Glen, your code would in this case be used as a base, for as long as its useful. Like any initial cut of a tool, many hands may contribute over time, and the code may get partially or totally re-written. Signing the contributor's agreement ensures that the code (and code derived form it) remains available for open usage. So assuming that you're OK with that, you should do so. - Mike On 5/13/09 1:03 PM, Glen Hocky wrote: > Though I would happy to be a contributor, I would say my code at best > should be a guideline. > > Oh, and I have a commandline help > > On May 13, 2009, at 12:56 PM, Ben Clifford wrote: > >> >> On Wed, 13 May 2009, skenny at uchicago.edu wrote: >> >>> we are looking at using glen's code as a starting point for a >>> tool that could be included in swift/committed to swift's svn. >>> there will need to be quite a bit of testing first as i'm >>> unsure about committing someone else's code 'as is'... >> >> I think a bunch of testing will show whether code is useful or not. >> But it >> needs additional stuff like documentation. Mats' work on the OSG sites >> file generator went in quite easily - it has some basic documentation >> (the >> command line help) and it worked in a bunch of informal tests. Ideally >> the >> test suite and the userguide would be appropraitely amended too... >> >>> my other question regarding this is that there is apparently >>> some *paperwork* involved in this process which would require >>> me (and glen?) to sign some sort of globus licensing agreement >>> (?) >> >> mmm paperwork. If you're working on glen's code, then at least you, glen, >> glen's employer and your employer need to sign globus contributor >> licenses >> or have existing licenses appropriately amended to cover you/him. >> >> I think maybe you don't already have a license signed - I don't remember >> you doing one. In which case look at >> http://dev.globus.org/wiki/Guidelines#License_and_Contributor_Agreements >> for the individual license agreement and give a copy to Gigi. >> >> Glen needs to do the same, and have his employer(s) give a license - I >> don't know in his case who employs him. >> >> (the above reflects my understanding of the dev.globus guideliens and is >> not to be construed as legal advice under US, UK or South African law) >> >> -- >> From wilde at mcs.anl.gov Wed May 13 13:18:32 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 13 May 2009 13:18:32 -0500 Subject: [Swift-devel] Swift over Falkon on Sicortex Message-ID: <4A0B0EF8.3030905@mcs.anl.gov> Allan will be trying to gte the mechanisms for running Swift over Falkon on the Sicortex running, and documented for the Argonne user who asked for this recently on swift-user. He'll be using Falkon as the job provider and ssh as the data provider, testing initially from communicado to the Sicortex. Ioan has an initial config running for a Falkon service on communicado to Falkon agents on the Sicortex using ssh tunnels. We're using communicado because we dont have any machine that mounts the Sicortex filesystem which runs Java, which we need for Swift and Falkon. Ive suggested that they use swift-devel to discuss progress and problems on this effort. - Mike From wilde at mcs.anl.gov Wed May 13 13:30:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 13 May 2009 13:30:00 -0500 Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <20090513124404.BXN59020@m4500-02.uchicago.edu> References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> Message-ID: <4A0B11A8.1000303@mcs.anl.gov> On 5/13/09 12:44 PM, skenny at uchicago.edu wrote: > hey ben, > > we are looking at using glen's code as a starting point for a > tool that could be included in swift/committed to swift's svn. > there will need to be quite a bit of testing first as i'm > unsure about committing someone else's code 'as is' If Glen agrees to contribute the code (by signing the license), then I think you should commit his exact code as the initial revision, and edit from there. - Mike ...however, > my other question regarding this is that there is apparently > some *paperwork* involved in this process which would require > me (and glen?) to sign some sort of globus licensing agreement > (?) > > i'm unsure of the specifics but could you let me know the best > way to get the ball rolling? > > thanks > ~sarah > > > ---- Original message ---- >> Date: Wed, 29 Apr 2009 11:10:14 -0500 >> From: Michael Wilde >> Subject: [Swift-devel] Tool for TeraGrid site info >> To: Glen Hocky , swift-devel > >> Glen, where is the latest version of your TG sites info tool? >> >> Ben, what should we do to start testing that, making it > uniform with the >> osg tool, and make it available for comments and heading towards >> inclusion in Swift? >> >> (I think this was commented on earlier - I will track down > when I get >> back to the planning list in a few days from now) >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hockyg at uchicago.edu Wed May 13 13:33:42 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 13 May 2009 13:33:42 -0500 Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <4A0B11A8.1000303@mcs.anl.gov> References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> <4A0B11A8.1000303@mcs.anl.gov> Message-ID: <4A0B1286.20807@uchicago.edu> I'm ok with contributing my code. One thing to note though is rather than using the info as posted on info.teragrid, my script downloads their tool for querying info.teragrid. I did this because I was hoping that their tool would keep a uniform formatting whereas the back-end might change. There could however be licensing issues with this usage of their tool.... Michael Wilde wrote: > > > On 5/13/09 12:44 PM, skenny at uchicago.edu wrote: >> hey ben, >> >> we are looking at using glen's code as a starting point for a >> tool that could be included in swift/committed to swift's svn. >> there will need to be quite a bit of testing first as i'm >> unsure about committing someone else's code 'as is' > > If Glen agrees to contribute the code (by signing the license), then I > think you should commit his exact code as the initial revision, and > edit from there. > > - Mike > > ...however, >> my other question regarding this is that there is apparently >> some *paperwork* involved in this process which would require >> me (and glen?) to sign some sort of globus licensing agreement >> (?) >> i'm unsure of the specifics but could you let me know the best >> way to get the ball rolling? >> >> thanks >> ~sarah >> >> >> ---- Original message ---- >>> Date: Wed, 29 Apr 2009 11:10:14 -0500 >>> From: Michael Wilde Subject: [Swift-devel] Tool >>> for TeraGrid site info To: Glen Hocky , >>> swift-devel >> >>> Glen, where is the latest version of your TG sites info tool? >>> >>> Ben, what should we do to start testing that, making it >> uniform with the >>> osg tool, and make it available for comments and heading towards >>> inclusion in Swift? >>> >>> (I think this was commented on earlier - I will track down >> when I get >>> back to the planning list in a few days from now) >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed May 13 13:48:06 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 13 May 2009 13:48:06 -0500 Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <4A0B1286.20807@uchicago.edu> References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> <4A0B11A8.1000303@mcs.anl.gov> <4A0B1286.20807@uchicago.edu> Message-ID: <4A0B15E6.40205@mcs.anl.gov> I think its best if we do not copy the tg tool. I dont think there are any licensing issue with using the tg sites info tool, but certainly are issues with putting it into the swift source tree and hence re-distributing it. We definitely do not want to do the latter. Dependencies on the format of the tool we should live with, and adapt our code as needed. I think we want to use, e.g., the REST interface, which I would hope they either seldom change, or they attempt to keep changes to it backwards compatible. - Mike On 5/13/09 1:33 PM, Glen Hocky wrote: > I'm ok with contributing my code. One thing to note though is rather > than using the info as posted on info.teragrid, my script downloads > their tool for querying info.teragrid. I did this because I was hoping > that their tool would keep a uniform formatting whereas the back-end > might change. > > There could however be licensing issues with this usage of their tool.... > > Michael Wilde wrote: >> >> >> On 5/13/09 12:44 PM, skenny at uchicago.edu wrote: >>> hey ben, >>> >>> we are looking at using glen's code as a starting point for a >>> tool that could be included in swift/committed to swift's svn. >>> there will need to be quite a bit of testing first as i'm >>> unsure about committing someone else's code 'as is' >> >> If Glen agrees to contribute the code (by signing the license), then I >> think you should commit his exact code as the initial revision, and >> edit from there. >> >> - Mike >> >> ...however, >>> my other question regarding this is that there is apparently >>> some *paperwork* involved in this process which would require >>> me (and glen?) to sign some sort of globus licensing agreement >>> (?) >>> i'm unsure of the specifics but could you let me know the best >>> way to get the ball rolling? >>> >>> thanks >>> ~sarah >>> >>> >>> ---- Original message ---- >>>> Date: Wed, 29 Apr 2009 11:10:14 -0500 >>>> From: Michael Wilde Subject: [Swift-devel] Tool >>>> for TeraGrid site info To: Glen Hocky , >>>> swift-devel >>> >>>> Glen, where is the latest version of your TG sites info tool? >>>> >>>> Ben, what should we do to start testing that, making it >>> uniform with the >>>> osg tool, and make it available for comments and heading towards >>>> inclusion in Swift? >>>> >>>> (I think this was commented on earlier - I will track down >>> when I get >>>> back to the planning list in a few days from now) >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Wed May 13 13:59:46 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 13 May 2009 13:59:46 -0500 Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <4A0B15E6.40205@mcs.anl.gov> References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> <4A0B11A8.1000303@mcs.anl.gov> <4A0B1286.20807@uchicago.edu> <4A0B15E6.40205@mcs.anl.gov> Message-ID: <4A0B18A2.9080204@mcs.anl.gov> Glen, I re-read your mail and realize maybe I misunderstood. When you said "my script downloads their tool for querying info.teragrid" did you mean that tgsites first dynamically downloads the tg info tool, and then runs it, unmodified? If so, that seems perhaps not the best technical solution, but as long as the tg tool is not placed into svn and is not redistributed by us, using it by downloading seems to me to not have any licensing issues, and I would go ahead and start with whatever is working as the base version to build on and improve. - Mike On 5/13/09 1:48 PM, Michael Wilde wrote: > I think its best if we do not copy the tg tool. I dont think there are > any licensing issue with using the tg sites info tool, but certainly are > issues with putting it into the swift source tree and hence > re-distributing it. We definitely do not want to do the latter. > > Dependencies on the format of the tool we should live with, and adapt > our code as needed. I think we want to use, e.g., the REST interface, > which I would hope they either seldom change, or they attempt to keep > changes to it backwards compatible. > > - Mike > > > On 5/13/09 1:33 PM, Glen Hocky wrote: >> I'm ok with contributing my code. One thing to note though is rather >> than using the info as posted on info.teragrid, my script downloads >> their tool for querying info.teragrid. I did this because I was hoping >> that their tool would keep a uniform formatting whereas the back-end >> might change. >> >> There could however be licensing issues with this usage of their tool.... >> >> Michael Wilde wrote: >>> >>> >>> On 5/13/09 12:44 PM, skenny at uchicago.edu wrote: >>>> hey ben, >>>> >>>> we are looking at using glen's code as a starting point for a >>>> tool that could be included in swift/committed to swift's svn. >>>> there will need to be quite a bit of testing first as i'm >>>> unsure about committing someone else's code 'as is' >>> >>> If Glen agrees to contribute the code (by signing the license), then >>> I think you should commit his exact code as the initial revision, and >>> edit from there. >>> >>> - Mike >>> >>> ...however, >>>> my other question regarding this is that there is apparently >>>> some *paperwork* involved in this process which would require >>>> me (and glen?) to sign some sort of globus licensing agreement >>>> (?) >>>> i'm unsure of the specifics but could you let me know the best >>>> way to get the ball rolling? >>>> >>>> thanks >>>> ~sarah >>>> >>>> >>>> ---- Original message ---- >>>>> Date: Wed, 29 Apr 2009 11:10:14 -0500 >>>>> From: Michael Wilde Subject: [Swift-devel] >>>>> Tool for TeraGrid site info To: Glen Hocky , >>>>> swift-devel >>>> >>>>> Glen, where is the latest version of your TG sites info tool? >>>>> >>>>> Ben, what should we do to start testing that, making it >>>> uniform with the >>>>> osg tool, and make it available for comments and heading towards >>>>> inclusion in Swift? >>>>> >>>>> (I think this was commented on earlier - I will track down >>>> when I get >>>>> back to the planning list in a few days from now) >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hockyg at uchicago.edu Wed May 13 14:04:18 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 13 May 2009 14:04:18 -0500 Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <4A0B18A2.9080204@mcs.anl.gov> References: <49F87BE6.8020603@mcs.anl.gov> <20090513124404.BXN59020@m4500-02.uchicago.edu> <4A0B11A8.1000303@mcs.anl.gov> <4A0B1286.20807@uchicago.edu> <4A0B15E6.40205@mcs.anl.gov> <4A0B18A2.9080204@mcs.anl.gov> Message-ID: <4A0B19B2.3030402@uchicago.edu> yes that is what it does...i was just lazy but there are other better ways, certainly Michael Wilde wrote: > Glen, I re-read your mail and realize maybe I misunderstood. When you > said "my script downloads their tool for querying info.teragrid" did > you mean that tgsites first dynamically downloads the tg info tool, > and then runs it, unmodified? > > If so, that seems perhaps not the best technical solution, but as long > as the tg tool is not placed into svn and is not redistributed by us, > using it by downloading seems to me to not have any licensing issues, > and I would go ahead and start with whatever is working as the base > version to build on and improve. > > - Mike > > > On 5/13/09 1:48 PM, Michael Wilde wrote: >> I think its best if we do not copy the tg tool. I dont think there >> are any licensing issue with using the tg sites info tool, but >> certainly are issues with putting it into the swift source tree and >> hence re-distributing it. We definitely do not want to do the latter. >> >> Dependencies on the format of the tool we should live with, and adapt >> our code as needed. I think we want to use, e.g., the REST >> interface, which I would hope they either seldom change, or they >> attempt to keep changes to it backwards compatible. >> >> - Mike >> >> >> On 5/13/09 1:33 PM, Glen Hocky wrote: >>> I'm ok with contributing my code. One thing to note though is rather >>> than using the info as posted on info.teragrid, my script downloads >>> their tool for querying info.teragrid. I did this because I was >>> hoping that their tool would keep a uniform formatting whereas the >>> back-end might change. >>> >>> There could however be licensing issues with this usage of their >>> tool.... >>> >>> Michael Wilde wrote: >>>> >>>> >>>> On 5/13/09 12:44 PM, skenny at uchicago.edu wrote: >>>>> hey ben, >>>>> >>>>> we are looking at using glen's code as a starting point for a >>>>> tool that could be included in swift/committed to swift's svn. >>>>> there will need to be quite a bit of testing first as i'm >>>>> unsure about committing someone else's code 'as is' >>>> >>>> If Glen agrees to contribute the code (by signing the license), >>>> then I think you should commit his exact code as the initial >>>> revision, and edit from there. >>>> >>>> - Mike >>>> >>>> ...however, >>>>> my other question regarding this is that there is apparently >>>>> some *paperwork* involved in this process which would require >>>>> me (and glen?) to sign some sort of globus licensing agreement >>>>> (?) >>>>> i'm unsure of the specifics but could you let me know the best >>>>> way to get the ball rolling? >>>>> >>>>> thanks >>>>> ~sarah >>>>> >>>>> >>>>> ---- Original message ---- >>>>>> Date: Wed, 29 Apr 2009 11:10:14 -0500 >>>>>> From: Michael Wilde Subject: [Swift-devel] >>>>>> Tool for TeraGrid site info To: Glen Hocky >>>>>> , swift-devel >>>>> >>>>>> Glen, where is the latest version of your TG sites info tool? >>>>>> >>>>>> Ben, what should we do to start testing that, making it >>>>> uniform with the >>>>>> osg tool, and make it available for comments and heading towards >>>>>> inclusion in Swift? >>>>>> >>>>>> (I think this was commented on earlier - I will track down >>>>> when I get >>>>>> back to the planning list in a few days from now) >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu May 14 11:34:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 14 May 2009 11:34:50 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <4A0C3134.6060907@gmail.com> References: <4A0C3134.6060907@gmail.com> Message-ID: <1242318890.2365.14.camel@localhost> On Thu, 2009-05-14 at 09:56 -0500, Zhao Zhang wrote: > Hi, Mihael > > I am making a coaster test plan now, so that will be good if you could > help me clear what features of coaster are available in the current > implementation. > > 1) Multiple coaster on 1 compute node. > This feature is documented, but I am not sure if is working or not. Yes, this is supposed to work. > 2) coasterWorkerMaxwalltime > This is working. > 3) coasterInternalIP > I know this one is working, as I have already complete tests with > this parameter. > I am not clear under what context do we need this feature. One I > could think of is that "using notification instead of status file". Is > there any other > condition or any specific site requires this feature? I'm not sure how you know it's working if you don't know what it's supposed to do. Without this, workers use the external IP (the one the service is started with) to contact the service. With it, one can force the workers to contact a specific IP after starting. This is needed if the worker nodes don't generally have access to the outside of the cluster. This has nothing to do with notification or status files. If this setting is required (due to the network configuration of the cluster) and it is not specified, the coasters won't work at all. > 4) coasterDataProvider > This is working now. I don't know what that is. > 5) Block Allocation > Is this being implemented soon, or do we decide to postpone this > feature? The priority of this item has, again, become the highest. From zhaozhang at uchicago.edu Thu May 14 11:36:14 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 14 May 2009 11:36:14 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <1242318890.2365.14.camel@localhost> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> Message-ID: <4A0C487E.1010203@uchicago.edu> Cool, Thanks, Mihael. Mihael Hategan wrote: > On Thu, 2009-05-14 at 09:56 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I am making a coaster test plan now, so that will be good if you could >> help me clear what features of coaster are available in the current >> implementation. >> >> 1) Multiple coaster on 1 compute node. >> This feature is documented, but I am not sure if is working or not. >> > > Yes, this is supposed to work. > > >> 2) coasterWorkerMaxwalltime >> This is working. >> 3) coasterInternalIP >> I know this one is working, as I have already complete tests with >> this parameter. >> I am not clear under what context do we need this feature. One I >> could think of is that "using notification instead of status file". Is >> there any other >> condition or any specific site requires this feature? >> > > I'm not sure how you know it's working if you don't know what it's > supposed to do. > Yes, there is a test case in Ben's test suite using the coasterInternalIP variable. Things I know is that the test returned successful. > Without this, workers use the external IP (the one the service is > started with) to contact the service. With it, one can force the workers > to contact a specific IP after starting. This is needed if the worker > nodes don't generally have access to the outside of the cluster. > > This has nothing to do with notification or status files. If this > setting is required (due to the network configuration of the cluster) > and it is not specified, the coasters won't work at all. > > >> 4) coasterDataProvider >> This is working now. >> > > I don't know what that is. > This is the "" from http://www.ci.uchicago.edu/swift/guides/userguide/coasters.php. zhao > >> 5) Block Allocation >> Is this being implemented soon, or do we decide to postpone this >> feature? >> > > The priority of this item has, again, become the highest. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From zhangzhao0718 at gmail.com Thu May 14 09:56:52 2009 From: zhangzhao0718 at gmail.com (Zhao Zhang) Date: Thu, 14 May 2009 09:56:52 -0500 Subject: [Swift-devel] coaster features Message-ID: <4A0C3134.6060907@gmail.com> Hi, Mihael I am making a coaster test plan now, so that will be good if you could help me clear what features of coaster are available in the current implementation. 1) Multiple coaster on 1 compute node. This feature is documented, but I am not sure if is working or not. 2) coasterWorkerMaxwalltime This is working. 3) coasterInternalIP I know this one is working, as I have already complete tests with this parameter. I am not clear under what context do we need this feature. One I could think of is that "using notification instead of status file". Is there any other condition or any specific site requires this feature? 4) coasterDataProvider This is working now. 5) Block Allocation Is this being implemented soon, or do we decide to postpone this feature? 6) Coaster over Condor-G I am not sure about the priority of 5) and 6). If there is any other coaster features I need to test, please remind me. Thanks. best wishes zhao From hategan at mcs.anl.gov Thu May 14 11:52:30 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 14 May 2009 11:52:30 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <4A0C487E.1010203@uchicago.edu> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0C487E.1010203@uchicago.edu> Message-ID: <1242319950.3216.2.camel@localhost> On Thu, 2009-05-14 at 11:36 -0500, Zhao Zhang wrote: > > > > I'm not sure how you know it's working if you don't know what it's > > supposed to do. > > > Yes, there is a test case in Ben's test suite using the > coasterInternalIP variable. Things I know is that the test returned > successful. Did you try to remove that parameter and see if there is a difference? > > Without this, workers use the external IP (the one the service is > > started with) to contact the service. With it, one can force the workers > > to contact a specific IP after starting. This is needed if the worker > > nodes don't generally have access to the outside of the cluster. > > > > This has nothing to do with notification or status files. If this > > setting is required (due to the network configuration of the cluster) > > and it is not specified, the coasters won't work at all. > > > > > >> 4) coasterDataProvider > >> This is working now. > >> > > > > I don't know what that is. > > > This is the " />" from http://www.ci.uchicago.edu/swift/guides/userguide/coasters.php. Right. The coaster fs provider. Glen found some data corruption with it when transferring files larger than a few KB. So it's only somewhat working. From benc at hawaga.org.uk Thu May 14 12:26:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 14 May 2009 17:26:04 +0000 (GMT) Subject: [Swift-devel] Re: coaster features In-Reply-To: <1242319950.3216.2.camel@localhost> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0C487E.1010203@uchicago.edu> <1242319950.3216.2.camel@localhost> Message-ID: On Thu, 14 May 2009, Mihael Hategan wrote: > Right. The coaster fs provider. Glen found some data corruption with it > when transferring files larger than a few KB. So it's only somewhat > working. Probably shoudl get some automated or semi-automated tests to test various interesting sizes of files for different data transfer providers. -- From zhaozhang at uchicago.edu Thu May 14 13:52:21 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 14 May 2009 13:52:21 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <1242319950.3216.2.camel@localhost> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0C487E.1010203@uchicago.edu> <1242319950.3216.2.camel@localhost> Message-ID: <4A0C6865.2050403@uchicago.edu> Hi, Mihael Mihael Hategan wrote: > On Thu, 2009-05-14 at 11:36 -0500, Zhao Zhang wrote: > >>> I'm not sure how you know it's working if you don't know what it's >>> supposed to do. >>> >>> >> Yes, there is a test case in Ben's test suite using the >> coasterInternalIP variable. Things I know is that the test returned >> successful. >> > > Did you try to remove that parameter and see if there is a difference? > I just try this out with/without that internalIP option. It did make difference on renci-engage site. The log for the test with InternalIP is at /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster-log/renci-log-internal-IP The log for the test without InternalIP is at /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster-log/renci-log-no-internalIP The gram logs are at /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster-log/renci-gram-log. Could you help make sure that it is really this internalIP option to make the difference? Thanks. zhao > >>> Without this, workers use the external IP (the one the service is >>> started with) to contact the service. With it, one can force the workers >>> to contact a specific IP after starting. This is needed if the worker >>> nodes don't generally have access to the outside of the cluster. >>> >>> This has nothing to do with notification or status files. If this >>> setting is required (due to the network configuration of the cluster) >>> and it is not specified, the coasters won't work at all. >>> >>> >>> >>>> 4) coasterDataProvider >>>> This is working now. >>>> >>>> >>> I don't know what that is. >>> >>> >> This is the "> />" from http://www.ci.uchicago.edu/swift/guides/userguide/coasters.php. >> > > Right. The coaster fs provider. Glen found some data corruption with it > when transferring files larger than a few KB. So it's only somewhat > working. > > > From benc at hawaga.org.uk Fri May 15 07:37:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 May 2009 12:37:47 +0000 (GMT) Subject: [Swift-devel] nmi build test died Message-ID: NMI build and test appears to have been broken for the past week. So there are no Swift nightly builds happening at the moment. caveat commitor. -- From zhaozhang at uchicago.edu Fri May 15 11:41:23 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 15 May 2009 11:41:23 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <1242318890.2365.14.camel@localhost> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> Message-ID: <4A0D9B33.2020601@uchicago.edu> Hi, Mihael Mihael Hategan wrote: > On Thu, 2009-05-14 at 09:56 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I am making a coaster test plan now, so that will be good if you could >> help me clear what features of coaster are available in the current >> implementation. >> >> 1) Multiple coaster on 1 compute node. >> This feature is documented, but I am not sure if is working or not. >> > > Yes, this is supposed to work. > > >> 2) coasterWorkerMaxwalltime >> This is working. >> 3) coasterInternalIP >> I know this one is working, as I have already complete tests with >> this parameter. >> I am not clear under what context do we need this feature. One I >> could think of is that "using notification instead of status file". Is >> there any other >> condition or any specific site requires this feature? >> > > I'm not sure how you know it's working if you don't know what it's > supposed to do. > > Without this, workers use the external IP (the one the service is > started with) to contact the service. With it, one can force the workers > to contact a specific IP after starting. This is needed if the worker > nodes don't generally have access to the outside of the cluster. > > This has nothing to do with notification or status files. If this > setting is required (due to the network configuration of the cluster) > and it is not specified, the coasters won't work at all. > > Again, about this feature. How could we know that we need internal IP for the site before we run anything on it? Is is by experience or some tool could tell us about it? Thanks. zhao >> 4) coasterDataProvider >> This is working now. >> > > I don't know what that is. > > >> 5) Block Allocation >> Is this being implemented soon, or do we decide to postpone this >> feature? >> > > The priority of this item has, again, become the highest. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Fri May 15 11:52:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 15 May 2009 11:52:49 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <4A0D9B33.2020601@uchicago.edu> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0D9B33.2020601@uchicago.edu> Message-ID: <1242406369.738.2.camel@localhost> On Fri, 2009-05-15 at 11:41 -0500, Zhao Zhang wrote: > > > > I'm not sure how you know it's working if you don't know what it's > > supposed to do. > > > > Without this, workers use the external IP (the one the service is > > started with) to contact the service. With it, one can force the workers > > to contact a specific IP after starting. This is needed if the worker > > nodes don't generally have access to the outside of the cluster. > > > > This has nothing to do with notification or status files. If this > > setting is required (due to the network configuration of the cluster) > > and it is not specified, the coasters won't work at all. > > > > > Again, about this feature. How could we know that we need internal IP > for the site before we run anything on it? > Is is by experience or some tool could tell us about it? Thanks. You need it when you can't access the head node's public IP from a worker node. There is no tool I know of to test whether that's the case. From benc at hawaga.org.uk Fri May 15 12:14:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 May 2009 17:14:16 +0000 (GMT) Subject: [Swift-devel] Re: coaster features In-Reply-To: <1242406369.738.2.camel@localhost> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0D9B33.2020601@uchicago.edu> <1242406369.738.2.camel@localhost> Message-ID: On Fri, 15 May 2009, Mihael Hategan wrote: > > Again, about this feature. How could we know that we need internal IP > > for the site before we run anything on it? > > Is is by experience or some tool could tell us about it? Thanks. > > You need it when you can't access the head node's public IP from a > worker node. There is no tool I know of to test whether that's the case. I'm not aware of any particularly reliable way to do this. Lots of messy hacks that could be done but bleugh. -- From benc at hawaga.org.uk Fri May 15 12:25:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 May 2009 17:25:10 +0000 (GMT) Subject: [Swift-devel] Globus Job Manager Scaling (fwd) with Swift Message-ID: Hi, I've heard that there is some work within OSG for deploying GRAM5 test installations. I'm interested in pointing a Swift client at this and seeing how it compares to other mechanisms we have for submitting to OSG. > From: Stuart Martin > Date: May 12, 2009 4:40:27 PM CDT > To: Chander Sehgal > Cc: Stuart Martin , 'Steve Tuecke' , > 'Dan Fraser' , 'Alain Roy' , 'Ruth' > > Subject: Re: Globus Job Manager Scaling > > Sounds good. Dan, lemme know what your plan is. > > -Stu > > On May 12, 2009, at May 12, 4:24 PM, Chander Sehgal wrote: > > > Stu, > > > > Thanks for the update. I am going to pass this baton to Dan Fraser in his > > OSG Production Coordinator role. He is trying to sort out how to move LIGO > > E at H forward and this is part of that... > > > > Chander > > > > > > -----Original Message----- > > From: Stuart Martin [mailto:smartin at mcs.anl.gov] > > Sent: Tuesday, May 12, 2009 9:18 AM > > To: Chander Sehgal > > Cc: Stuart Martin; Steve Tuecke; Dan Fraser; Alain Roy > > Subject: Re: Globus Job Manager Scaling > > > > Hi Chander, > > > > I just sent this to gram-dev as a general announcement: > > > > http://lists.globus.org/pipermail/gram-dev/2009-May/001427.html > > > > Yes - GRAM5 is designed to solve the scalability issues that non- > > condor-g clients have using GRAM2. So I'd recommend this Alpha being > > setup on a Nebraska test host and have E at Home try it out. It has > > looked good in our testing, we'd love to get some real world > > deployments and testing to see how it holds up. > > > > Do you want to convene a telecon with the various people that will be > > involved with this? > > > > Lemme know how I can help. > > > > -Stu > > From foster at anl.gov Fri May 15 17:04:22 2009 From: foster at anl.gov (Ian Foster) Date: Fri, 15 May 2009 17:04:22 -0500 Subject: [Swift-devel] Fwd: Beowulf cluster run on Amazon EC2 References: Message-ID: <38F2209D-2186-4831-9170-1F6DF1DD2B47@anl.gov> An interesting project, some similarities to what we want to do Begin forwarded message: > > http://code.google.com/p/elasticwulf/ > > From aespinosa at cs.uchicago.edu Sat May 16 13:52:31 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sat, 16 May 2009 13:52:31 -0500 Subject: [Swift-devel] [dsl-seminar] This week's seminar on Fault Tolerance Computing Message-ID: <50b07b4b0905161152o2cd840a2x54254be53ff6d995@mail.gmail.com> Hi all, For this week's seminar, we invited Rinku Gupta from ANL to share their group's on fault tolerant systems. Below is the information for the talk Title: Moving towards a Coordinated Infrastructure for Fault Tolerant Systems Speaker: Rinku Gupta, Argonne MCS Date: Thursday May 21, 2009 Time: 4:30pm Venue: RI 405 Abstract: The need for leadership class fault-tolerance has steadily increased and continues to increase as emerging high performance systems move towards offering petascale level performance. While most high-end systems do provide mechanisms for detection, notification and perhaps handling of hardware and software related faults, the individual components present in the system perform these actions separately. Knowledge about occurring faults is seldom shared between different programs and almost never on a system-wide basis. A typical system contains numerous programs that could benefit from such knowledge, include applications, middleware libraries, job schedulers, file systems, math libraries, monitoring software, operating systems, and check pointing software. The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) initiative provides the foundation necessary to enable systems to adapt to faults in a holistic manner. CIFTS achieves this through the Fault Tolerance Backplane (FTB), providing a unified management and communication framework, which can be used by any program to publish fault-related information. In this talk, I will present some of the work done by the CIFTS group towards the development of FTB and FTB-enabled components. You can also checkout their groups page in http://www.mcs.anl.gov/research/cifts/ See you guys on thursday, -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Sat May 16 14:14:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 16 May 2009 19:14:11 +0000 (GMT) Subject: [Swift-devel] Re: nmi build test died In-Reply-To: References: Message-ID: On Fri, 15 May 2009, Ben Clifford wrote: > NMI build and test appears to have been broken for the past week. So there > are no Swift nightly builds happening at the moment. caveat commitor. This is fixed now. -- From aespinosa at cs.uchicago.edu Sat May 16 18:04:28 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sat, 16 May 2009 18:04:28 -0500 Subject: [Swift-devel] [dsl-seminar] This week seminar on Fault Tolerance Computing Message-ID: <50b07b4b0905161604l1a4c6b41g563e5958b451cc9b@mail.gmail.com> Hi all, For this week's seminar, we invited Rinku Gupta from ANL to share their group's on fault tolerant systems. Below is the information for the talk Title: Moving towards a Coordinated Infrastructure for Fault Tolerant Systems Speaker: Rinku Gupta, Argonne MCS Date: Thursday May 21, 2009 Time: 4:30pm Venue: RI 405 Abstract: The need for leadership class fault-tolerance has steadily increased and continues to increase as emerging high performance systems move towards offering petascale level performance. While most high-end systems do provide mechanisms for detection, notification and perhaps handling of hardware and software related faults, the individual components present in the system perform these actions separately. Knowledge about occurring faults is seldom shared between different programs and almost never on a system-wide basis. A typical system contains numerous programs that could benefit from such knowledge, include applications, middleware libraries, job schedulers, file systems, math libraries, monitoring software, operating systems, and check pointing software. The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) initiative provides the foundation necessary to enable systems to adapt to faults in a holistic manner. CIFTS achieves this through the Fault Tolerance Backplane (FTB), providing a unified management and communication framework, which can be used by any program to publish fault-related information. In this talk, I will present some of the work done by the CIFTS group towards the development of FTB and FTB-enabled components. You can also checkout their groups page in http://www.mcs.anl.gov/research/cifts/ See you guys on thursday, -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From bugzilla-daemon at mcs.anl.gov Sun May 17 07:06:34 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 17 May 2009 07:06:34 -0500 (CDT) Subject: [Swift-devel] [Bug 201] Pass long lists of in/out files to _swiftwrapper in a file In-Reply-To: References: Message-ID: <20090517120634.17B502CAF2@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=201 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #1 from Ben Clifford 2009-05-17 07:06:33 --- r2940 adds a configuration parameter wrapper.parameter.mode that can be set either in swift.properties or in sites.xml to control argument passing conventions. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From benc at hawaga.org.uk Sun May 17 12:38:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 17 May 2009 17:38:31 +0000 (GMT) Subject: [Swift-devel] serialised execution of @function and operator arguments Message-ID: I noticed that execution of @function arguments and operator operands is serialised. One day I'm sure that this will confuse or upset someone, though it seems to have been ok so far in production. When evaluation of @function arguments or operator operands was trivial, this did not really matter, but recently (in the past year) these can contain non-trivial code so this is perhaps something to change. In this paragraph, by trivial, I mean not involving any external appliation execution, and being evaluated entirely within the Swift client. -- From tfreeman at mcs.anl.gov Sun May 17 15:24:55 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Sun, 17 May 2009 15:24:55 -0500 Subject: [Swift-devel] Swift-issues (PBS+NFS Cluster) In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <49F3D235.70507@cs.uchicago.edu> <49F5AB7C.8020705@mcs.anl.gov> <49F65C55.6020509@cs.uchicago.edu> <20090428090709.4b242cc5@sietch> <49F74027.4030108@cs.uchicago.edu> <20090428130628.7202797d@sietch> <20090428205852.796e1e95@sietch> <20090429071241.37e18f24@sietch> <49FB42AE.5000703@cs.uchicago.edu> <20090501195400.6abe9c7d@sietch> <4A022A2A.6070907@cs.uchicago.edu> <4A02512E.8060809@cs.uchicago.edu> <4A025A1C.4030505@mcs.anl.gov> <4A030ECC.2030303@cs.uchicago.edu> <20090507125435.62306f23@sietch> <4A0323B7.8080206@mcs.anl.gov> <4A08AFF4.608@cs.uchicago.edu> <1281ADF3-E3D8-483C-AF98-01C3A009083F@anl.gov> Message-ID: <20090517152455.656c4eab@sietch> On Wed, 13 May 2009 16:58:50 +0000 (GMT) Ben Clifford wrote: > > OK. > > Specifically for that problem, then, something that makes the S3 storage > available at the swift submit side would avoid transferring data across > the $-boundary and would avoid having to mess too much with the data > management architecture. > > That could be either a cog provider to pull files from s3 (the existing > dcache provider might serve as an example - it seems conceptually quite > similar); or a posix mount of s3 on the submit node (in which case the > local filesystem cog provider would work). This package looks good and is apache2 licensed: http://code.google.com/p/jclouds/ Tim > > This would not transfer data from s3 directly to the worker nodes, but > that is not the problem being addressed here. > > On Wed, 13 May 2009, foster at anl.gov wrote: > > > The reason for wanting to use S3 is that one pays to move data to/from > > amazon. So, we want to allow analysis of data that sits on S3, and puts > > it's output on s3 > > > > > > Ian -- from mobile > > > > On May 13, 2009, at 12:25 PM, Ben Clifford wrote: > > > > > > > > On Wed, 13 May 2009, Ben Clifford wrote: > > > > > > > > Since each Amazon virtual node only has limited storage space( Small > > > > > instance > > > > > for 1.7GB, Large Instance for 7.5GB), we may need to use EBS(Elastic > > > > > Block > > > > > Store) to storage temp files created by swift. The EBS behaved like a > > > > > hard > > > > > > For many applications, 1.7G is a very large amount of data. > > > > > > 7.5G of storage for your shared filesystem and 1.7G for worker node > > > temporary files seems like a lot to me. > > > > > > So I think its important to clarify: do you want to use s3 because you > > > think it solves a storage problem, and if so, does that problem really > > > exist now? or do you want to use s3 because it is an interesting thing to > > > play with? (either way is fine, but I think it is important to clarify) > > > > > > -- > > > > > From wilde at mcs.anl.gov Sun May 17 22:37:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 17 May 2009 22:37:34 -0500 Subject: [Swift-devel] serialised execution of @function and operator arguments In-Reply-To: References: Message-ID: <4A10D7FE.3020903@mcs.anl.gov> Ben, can you clarify? By 'serialized', I assume you mean that if the @function has multiple args, they are evaluated one at a time? And the same for any binary (2-arg) operator? And is *each* arg evaluation done serially, all the way down to the leaves of the evaluatiion? By "confuse or upset" you mean it will confuse users who would expect all the args to be evaluated in parallel, when the args are expressions that could directly or indirectly involve an app() execution? Is this easy, or hard, to fix? (Just curious; doesnt seem urgent till it causes problems for a real application). On 5/17/09 12:38 PM, Ben Clifford wrote: > I noticed that execution of @function arguments and operator operands is > serialised. > > One day I'm sure that this will confuse or upset someone, though it seems > to have been ok so far in production. > > When evaluation of @function arguments or operator operands was trivial, > this did not really matter, but recently (in the past year) these can > contain non-trivial code so this is perhaps something to change. In this > paragraph, by trivial, I mean not involving any external appliation > execution, and being evaluated entirely within the Swift client. > From benc at hawaga.org.uk Mon May 18 02:27:29 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 18 May 2009 07:27:29 +0000 (GMT) Subject: [Swift-devel] serialised execution of @function and operator arguments In-Reply-To: <4A10D7FE.3020903@mcs.anl.gov> References: <4A10D7FE.3020903@mcs.anl.gov> Message-ID: On Sun, 17 May 2009, Michael Wilde wrote: > By 'serialized', I assume you mean that if the @function has multiple args, > they are evaluated one at a time? And the same for any binary (2-arg) > operator? yes > And is *each* arg evaluation done serially, all the way down to the leaves of > the evaluatiion? yes > By "confuse or upset" you mean it will confuse users who would expect all the > args to be evaluated in parallel, when the args are expressions that could > directly or indirectly involve an app() execution? yes > Is this easy, or hard, to fix? (Just curious; doesnt seem urgent till it > causes problems for a real application). seems pretty easy - likely that will happen as a side-effect of my provenance work (it seems to be working in my dev tree already). I commented mostly because I wasn't expecting it. -- From hategan at mcs.anl.gov Mon May 18 11:36:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 May 2009 11:36:14 -0500 Subject: [Swift-devel] new coaster code Message-ID: <1242664574.24334.5.camel@localhost> In the development of the block allocation code, I had to make a decision not to keep backwards compatibility (i.e. have both the old code and the block allocation code simultaneously useable). This is due to: 1. The old code, while working, had problems that in the long run didn't seem like a good idea to have. 2. It's easier, from a development perspective, to not have to abstract over two fairly different implementations of a sub-system. What this means is that once I commit the block allocation code, the old behavior is gone, and if you learned to rely on its peculiarities, that won't work. So I'm trying to ask folks how they would like to see this handled. Opinions, suggestions, etc. Mihael From zhaozhang at uchicago.edu Mon May 18 11:47:30 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 18 May 2009 11:47:30 -0500 Subject: [Swift-devel] new coaster code In-Reply-To: <1242664574.24334.5.camel@localhost> References: <1242664574.24334.5.camel@localhost> Message-ID: <4A119122.6050703@uchicago.edu> Hi, Mihael Does this mean that the "Block Allocation" and "Dynamic Allocation" could not coexist? By Dynamic Allocation, I mean the one is currently working in coaster. zhao Mihael Hategan wrote: > In the development of the block allocation code, I had to make a > decision not to keep backwards compatibility (i.e. have both the old > code and the block allocation code simultaneously useable). This is due > to: > > 1. The old code, while working, had problems that in the long run didn't > seem like a good idea to have. > 2. It's easier, from a development perspective, to not have to abstract > over two fairly different implementations of a sub-system. > > What this means is that once I commit the block allocation code, the old > behavior is gone, and if you learned to rely on its peculiarities, that > won't work. > > So I'm trying to ask folks how they would like to see this handled. > Opinions, suggestions, etc. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Mon May 18 11:53:05 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 May 2009 11:53:05 -0500 Subject: [Swift-devel] new coaster code In-Reply-To: <4A119122.6050703@uchicago.edu> References: <1242664574.24334.5.camel@localhost> <4A119122.6050703@uchicago.edu> Message-ID: <1242665585.24700.3.camel@localhost> On Mon, 2009-05-18 at 11:47 -0500, Zhao Zhang wrote: > Hi, Mihael > > Does this mean that the "Block Allocation" and "Dynamic Allocation" > could not coexist? By Dynamic Allocation, I mean > the one is currently working in coaster. Don't use the term "Dynamic allocation". You can make the block allocation code behave in a manner similar to the old code by tweaking the settings. My point is not about intended behavior, but about actual code. The old code won't be there along with the new code. > > zhao > > Mihael Hategan wrote: > > In the development of the block allocation code, I had to make a > > decision not to keep backwards compatibility (i.e. have both the old > > code and the block allocation code simultaneously useable). This is due > > to: > > > > 1. The old code, while working, had problems that in the long run didn't > > seem like a good idea to have. > > 2. It's easier, from a development perspective, to not have to abstract > > over two fairly different implementations of a sub-system. > > > > What this means is that once I commit the block allocation code, the old > > behavior is gone, and if you learned to rely on its peculiarities, that > > won't work. > > > > So I'm trying to ask folks how they would like to see this handled. > > Opinions, suggestions, etc. > > > > Mihael > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From benc at hawaga.org.uk Mon May 18 12:12:49 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 18 May 2009 17:12:49 +0000 (GMT) Subject: [Swift-devel] new coaster code In-Reply-To: <1242664574.24334.5.camel@localhost> References: <1242664574.24334.5.camel@localhost> Message-ID: Gratuituous behaviour changing is to be avoided. However coasters are not at the point in stability where lack-of-behaviour-change is desirable in preference to significant-enhancement-in-functionality. You should at least go through the userguide and delete any coaster-related text that no longer applies, so that it contains nothing in preference to mistruth. -- From hategan at mcs.anl.gov Mon May 18 12:32:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 May 2009 12:32:40 -0500 Subject: [Swift-devel] new coaster code In-Reply-To: References: <1242664574.24334.5.camel@localhost> Message-ID: <1242667960.25303.2.camel@localhost> On Mon, 2009-05-18 at 17:12 +0000, Ben Clifford wrote: > Gratuituous behaviour changing is to be avoided. However coasters are not > at the point in stability where lack-of-behaviour-change is desirable in > preference to significant-enhancement-in-functionality. Right. I'd rather this didn't have to be, but it seems like having this feature sooner is better than having both things. > > You should at least go through the userguide and delete any > coaster-related text that no longer applies, so that it contains nothing > in preference to mistruth. > Noted. From zhaozhang at uchicago.edu Mon May 18 14:08:54 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 18 May 2009 14:08:54 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <1242319950.3216.2.camel@localhost> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0C487E.1010203@uchicago.edu> <1242319950.3216.2.camel@localhost> Message-ID: <4A11B246.4030804@uchicago.edu> Hi, All Mihael Hategan wrote: > On Thu, 2009-05-14 at 11:36 -0500, Zhao Zhang wrote: > >>> I'm not sure how you know it's working if you don't know what it's >>> supposed to do. >>> >>> >> Yes, there is a test case in Ben's test suite using the >> coasterInternalIP variable. Things I know is that the test returned >> successful. >> > > Did you try to remove that parameter and see if there is a difference? > > >>> Without this, workers use the external IP (the one the service is >>> started with) to contact the service. With it, one can force the workers >>> to contact a specific IP after starting. This is needed if the worker >>> nodes don't generally have access to the outside of the cluster. >>> >>> This has nothing to do with notification or status files. If this >>> setting is required (due to the network configuration of the cluster) >>> and it is not specified, the coasters won't work at all. >>> >>> >>> >>>> 4) coasterDataProvider >>>> This is working now. >>>> >>>> >>> I don't know what that is. >>> >>> >> This is the "> />" from http://www.ci.uchicago.edu/swift/guides/userguide/coasters.php. >> > > Right. The coaster fs provider. Glen found some data corruption with it > when transferring files larger than a few KB. So it's only somewhat > working. > Has this feature been fixed? I found it again in my data movement test on OSG site. zhao > > From hategan at mcs.anl.gov Mon May 18 14:39:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 May 2009 14:39:31 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <4A11B246.4030804@uchicago.edu> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0C487E.1010203@uchicago.edu> <1242319950.3216.2.camel@localhost> <4A11B246.4030804@uchicago.edu> Message-ID: <1242675571.27345.2.camel@localhost> On Mon, 2009-05-18 at 14:08 -0500, Zhao Zhang wrote: > >>> > >> This is the " >> />" from http://www.ci.uchicago.edu/swift/guides/userguide/coasters.php. > >> > > > > Right. The coaster fs provider. Glen found some data corruption with it > > when transferring files larger than a few KB. So it's only somewhat > > working. > > > Has this feature been fixed? No. > I found it again in my data movement test > on OSG site. Can you be more specific? In general "feature X doesn't work properly" is not very useful. I recommend reading this: http://www.chiark.greenend.org.uk/~sgtatham/bugs.html From zhaozhang at uchicago.edu Mon May 18 16:32:03 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 18 May 2009 16:32:03 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <1242675571.27345.2.camel@localhost> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0C487E.1010203@uchicago.edu> <1242319950.3216.2.camel@localhost> <4A11B246.4030804@uchicago.edu> <1242675571.27345.2.camel@localhost> Message-ID: <4A11D3D3.9030501@uchicago.edu> Thanks, Mihael. I am stating that bug again. I wrote a swift script to test the coaster-data-provider. [zzhang at communicado 1KB]$ cat data-test.swift type Mol {} type DOCKOut {} app (DOCKOut ofile) cat (Mol molfile) { cat @filename(molfile) stdout=@filename(ofile); } Mol texts[] ; doall(Mol texts[]) { foreach p in texts { DOCKOut r ; (r) = cat(p); } } // Main doall(texts); It copies a file "D00001.mol2", then redirects the file to "D00001.out", "D00001.out" is taken as the output file. I have two test cases on CI netowrk, /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/swift_data_test/1KB and /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/swift_data_test/1MB If you want to repeat the test, you will also need /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/tc.data and /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster_test/AGLT2.xml [zzhang at communicado 1MB]$ cat ../../coaster_test/AGLT2.xml /atlas/data08/OSG/DATA/osg/tmp/AGLT2 So, I ran the test twice with those two cases. [zzhang at communicado 1KB]$ swift -sites.file ../../coaster_test/AGLT2.xml -tc.file ../../tc.data data-test.swift Swift svn swift-r2896 cog-r2392 RunID: 20090518-1355-dksv64l5 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://192.41.230.11:41507 Got channel MetaChannel: 7132795 -> GSSSChannel-null(1) - Done [zzhang at communicado input]$ cksum D00001.mol2 2023409842 1054 D00001.mol2 [zzhang at communicado input]$ cksum ../result/D00001.out 2023409842 1054 ../result/D00001.out The 1KB test case is fine. Then I tried 1MB test case [zzhang at communicado 1MB]$ swift -sites.file ../../coaster_test/AGLT2.xml -tc.file ../../tc.data data-test.swift Swift svn swift-r2896 cog-r2392 RunID: 20090518-1357-ad024vu1 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://192.41.230.11:47431 Got channel MetaChannel: 8676252 -> GSSSChannel-null(1) - Done [zzhang at communicado 1MB]$ cksum input/D00001.mol2 1748450002 1000000 input/D00001.mol2 [zzhang at communicado 1MB]$ cksum result/D00001.out 4159559983 1000000 result/D00001.out as you could tell from the check sum, the input and output are not the same. I repeated the 1MB test case, the same thing happened [zzhang at communicado 1MB]$ swift -sites.file ../../coaster_test/AGLT2.xml -tc.file ../../tc.data data-test.swift Swift svn swift-r2896 cog-r2392 RunID: 20090518-1629-1gef4tm8 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://192.41.230.11:43400 Got channel MetaChannel: 31518612 -> GSSSChannel-null(1) - Done [zzhang at communicado 1MB]$ cksum input/D00001.mol2 1748450002 1000000 input/D00001.mol2 [zzhang at communicado 1MB]$ cksum result/D00001.out 2414944118 1000000 result/D00001.out zhao Mihael Hategan wrote: > On Mon, 2009-05-18 at 14:08 -0500, Zhao Zhang wrote: > >>>>> >>>>> >>>> This is the ">>> />" from http://www.ci.uchicago.edu/swift/guides/userguide/coasters.php. >>>> >>>> >>> Right. The coaster fs provider. Glen found some data corruption with it >>> when transferring files larger than a few KB. So it's only somewhat >>> working. >>> >>> >> Has this feature been fixed? >> > > No. > > >> I found it again in my data movement test >> on OSG site. >> > > Can you be more specific? > > In general "feature X doesn't work properly" is not very useful. I > recommend reading this: > > http://www.chiark.greenend.org.uk/~sgtatham/bugs.html > > > From hategan at mcs.anl.gov Mon May 18 16:39:56 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 May 2009 16:39:56 -0500 Subject: [Swift-devel] Re: coaster features In-Reply-To: <4A11D3D3.9030501@uchicago.edu> References: <4A0C3134.6060907@gmail.com> <1242318890.2365.14.camel@localhost> <4A0C487E.1010203@uchicago.edu> <1242319950.3216.2.camel@localhost> <4A11B246.4030804@uchicago.edu> <1242675571.27345.2.camel@localhost> <4A11D3D3.9030501@uchicago.edu> Message-ID: <1242682796.30481.6.camel@localhost> Excellent! Please put this in bugzilla, and refer to the bug number whenever necessary. Mihael On Mon, 2009-05-18 at 16:32 -0500, Zhao Zhang wrote: > Thanks, Mihael. I am stating that bug again. > > I wrote a swift script to test the coaster-data-provider. > [zzhang at communicado 1KB]$ cat data-test.swift > type Mol {} > type DOCKOut {} > > app (DOCKOut ofile) cat (Mol molfile) > { > cat @filename(molfile) stdout=@filename(ofile); > } > > Mol texts[] ; > > doall(Mol texts[]) > { > foreach p in texts { > DOCKOut r source=@p, > match="input/(.*)\\.mol2", > transform=@strcat("result/", "\\1", ".out") > >; > > (r) = cat(p); > } > } > > // Main > doall(texts); > > > It copies a file "D00001.mol2", then redirects the file to "D00001.out", > "D00001.out" is taken as the output file. > > I have two test cases on CI netowrk, > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/swift_data_test/1KB > and > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/swift_data_test/1MB > If you want to repeat the test, you will also need > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/tc.data and > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster_test/AGLT2.xml > [zzhang at communicado 1MB]$ cat ../../coaster_test/AGLT2.xml > > > > > jobManager="gt2:gt2:condor"/> > /atlas/data08/OSG/DATA/osg/tmp/AGLT2 > > > > > So, I ran the test twice with those two cases. > [zzhang at communicado 1KB]$ swift -sites.file ../../coaster_test/AGLT2.xml > -tc.file ../../tc.data data-test.swift > Swift svn swift-r2896 cog-r2392 > > RunID: 20090518-1355-dksv64l5 > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://192.41.230.11:41507 > Got channel MetaChannel: 7132795 -> GSSSChannel-null(1) > - Done > [zzhang at communicado input]$ cksum D00001.mol2 > 2023409842 1054 D00001.mol2 > [zzhang at communicado input]$ cksum ../result/D00001.out > 2023409842 1054 ../result/D00001.out > > The 1KB test case is fine. Then I tried 1MB test case > > [zzhang at communicado 1MB]$ swift -sites.file ../../coaster_test/AGLT2.xml > -tc.file ../../tc.data data-test.swift > Swift svn swift-r2896 cog-r2392 > RunID: 20090518-1357-ad024vu1 > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Finished successfully:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://192.41.230.11:47431 > Got channel MetaChannel: 8676252 -> GSSSChannel-null(1) > - Done > [zzhang at communicado 1MB]$ cksum input/D00001.mol2 > 1748450002 1000000 input/D00001.mol2 > [zzhang at communicado 1MB]$ cksum result/D00001.out > 4159559983 1000000 result/D00001.out > > as you could tell from the check sum, the input and output are not the > same. I repeated the 1MB test case, the same > thing happened > [zzhang at communicado 1MB]$ swift -sites.file ../../coaster_test/AGLT2.xml > -tc.file ../../tc.data data-test.swift > Swift svn swift-r2896 cog-r2392 > > RunID: 20090518-1629-1gef4tm8 > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Finished successfully:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://192.41.230.11:43400 > Got channel MetaChannel: 31518612 -> GSSSChannel-null(1) > - Done > [zzhang at communicado 1MB]$ cksum input/D00001.mol2 > 1748450002 1000000 input/D00001.mol2 > [zzhang at communicado 1MB]$ cksum result/D00001.out > 2414944118 1000000 result/D00001.out > > > zhao > > > Mihael Hategan wrote: > > On Mon, 2009-05-18 at 14:08 -0500, Zhao Zhang wrote: > > > >>>>> > >>>>> > >>>> This is the " >>>> />" from http://www.ci.uchicago.edu/swift/guides/userguide/coasters.php. > >>>> > >>>> > >>> Right. The coaster fs provider. Glen found some data corruption with it > >>> when transferring files larger than a few KB. So it's only somewhat > >>> working. > >>> > >>> > >> Has this feature been fixed? > >> > > > > No. > > > > > >> I found it again in my data movement test > >> on OSG site. > >> > > > > Can you be more specific? > > > > In general "feature X doesn't work properly" is not very useful. I > > recommend reading this: > > > > http://www.chiark.greenend.org.uk/~sgtatham/bugs.html > > > > > > From bugzilla-daemon at mcs.anl.gov Mon May 18 19:20:23 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 18 May 2009 19:20:23 -0500 (CDT) Subject: [Swift-devel] [Bug 209] New: 1MB file corrupted when transferred using Coaster-Data-Provider Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=209 Summary: 1MB file corrupted when transferred using Coaster-Data-Provider Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: zhaozhang at uchicago.edu [zzhang at communicado 1KB]$ cat data-test.swift type Mol {} type DOCKOut {} app (DOCKOut ofile) cat (Mol molfile) { cat @filename(molfile) stdout=@filename(ofile); } Mol texts[] ; doall(Mol texts[]) { foreach p in texts { DOCKOut r ; (r) = cat(p); } } // Main doall(texts); It copies a file "D00001.mol2", then redirects the file to "D00001.out", "D00001.out" is taken as the output file. I have two test cases on CI netowrk, /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/swift_data_test/1KB and /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/swift_data_test/1MB If you want to repeat the test, you will also need /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/tc.data and /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster_test/AGLT2.xml [zzhang at communicado 1MB]$ cat ../../coaster_test/AGLT2.xml /atlas/data08/OSG/DATA/osg/tmp/AGLT2 So, I ran the test twice with those two cases. [zzhang at communicado 1KB]$ swift -sites.file ../../coaster_test/AGLT2.xml -tc.file ../../tc.data data-test.swift Swift svn swift-r2896 cog-r2392 RunID: 20090518-1355-dksv64l5 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://192.41.230.11:41507 Got channel MetaChannel: 7132795 -> GSSSChannel-null(1) - Done [zzhang at communicado input]$ cksum D00001.mol2 2023409842 1054 D00001.mol2 [zzhang at communicado input]$ cksum ../result/D00001.out 2023409842 1054 ../result/D00001.out The 1KB test case is fine. Then I tried 1MB test case [zzhang at communicado 1MB]$ swift -sites.file ../../coaster_test/AGLT2.xml -tc.file ../../tc.data data-test.swift Swift svn swift-r2896 cog-r2392 RunID: 20090518-1357-ad024vu1 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://192.41.230.11:47431 Got channel MetaChannel: 8676252 -> GSSSChannel-null(1) - Done [zzhang at communicado 1MB]$ cksum input/D00001.mol2 1748450002 1000000 input/D00001.mol2 [zzhang at communicado 1MB]$ cksum result/D00001.out 4159559983 1000000 result/D00001.out as you could tell from the check sum, the input and output are not the same. I repeated the 1MB test case, the same thing happened [zzhang at communicado 1MB]$ swift -sites.file ../../coaster_test/AGLT2.xml -tc.file ../../tc.data data-test.swift Swift svn swift-r2896 cog-r2392 RunID: 20090518-1629-1gef4tm8 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://192.41.230.11:43400 Got channel MetaChannel: 31518612 -> GSSSChannel-null(1) - Done [zzhang at communicado 1MB]$ cksum input/D00001.mol2 1748450002 1000000 input/D00001.mol2 [zzhang at communicado 1MB]$ cksum result/D00001.out 2414944118 1000000 result/D00001.out -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From wilde at mcs.anl.gov Tue May 19 10:28:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 May 2009 10:28:37 -0500 Subject: [Swift-devel] Can we run Swift on SGE cluster w/o Globus? Message-ID: <4A12D025.9000101@mcs.anl.gov> Chris Henry of MCS/Bio wants to run Swift on an local cluster of the SEED project, which is running SGE but not Globus. Is there any way to do that with the current code? (Just checking - this is low prio, as we're trying to get his apps compiled for execution elsewhere). - Mike From hategan at mcs.anl.gov Tue May 19 10:30:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 May 2009 10:30:41 -0500 Subject: [Swift-devel] Can we run Swift on SGE cluster w/o Globus? In-Reply-To: <4A12D025.9000101@mcs.anl.gov> References: <4A12D025.9000101@mcs.anl.gov> Message-ID: <1242747041.7948.0.camel@localhost> On Tue, 2009-05-19 at 10:28 -0500, Michael Wilde wrote: > Chris Henry of MCS/Bio wants to run Swift on an local cluster of the > SEED project, which is running SGE but not Globus. Is there any way to > do that with the current code? No. Swift would need some form of SGE provider in order to submit jobs directly to SGE. From aespinosa at cs.uchicago.edu Tue May 19 11:46:08 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 19 May 2009 11:46:08 -0500 Subject: [Swift-devel] Can we run Swift on SGE cluster w/o Globus? In-Reply-To: <4A12D025.9000101@mcs.anl.gov> References: <4A12D025.9000101@mcs.anl.gov> Message-ID: <50b07b4b0905190946q8337d93t8bbe5deeb704f7f2@mail.gmail.com> what we did in Ranger was create a Falkon block and swift submits to the Falkon service. -Allan 2009/5/19 Michael Wilde : > Chris Henry of MCS/Bio wants to run Swift on an local cluster of the SEED > project, which is running SGE but not Globus. Is there any way to do that > with the current code? > > (Just checking - this is low prio, as we're trying to get his apps compiled > for execution elsewhere). > > - Mike From wilde at mcs.anl.gov Tue May 19 12:03:46 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 May 2009 12:03:46 -0500 Subject: [Swift-devel] Can we run Swift on SGE cluster w/o Globus? In-Reply-To: <50b07b4b0905190946q8337d93t8bbe5deeb704f7f2@mail.gmail.com> References: <4A12D025.9000101@mcs.anl.gov> <50b07b4b0905190946q8337d93t8bbe5deeb704f7f2@mail.gmail.com> Message-ID: <4A12E672.6020007@mcs.anl.gov> So you have a script on Ranger that creates a static-sized job of Falkon workers? Seems like that's something I could readily duplicate on the SEED machine. - Mike On 5/19/09 11:46 AM, Allan Espinosa wrote: > what we did in Ranger was create a Falkon block and swift submits to > the Falkon service. > > -Allan > > 2009/5/19 Michael Wilde : >> Chris Henry of MCS/Bio wants to run Swift on an local cluster of the SEED >> project, which is running SGE but not Globus. Is there any way to do that >> with the current code? >> >> (Just checking - this is low prio, as we're trying to get his apps compiled >> for execution elsewhere). >> >> - Mike From benc at hawaga.org.uk Tue May 19 12:07:27 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 19 May 2009 17:07:27 +0000 (GMT) Subject: [Swift-devel] Can we run Swift on SGE cluster w/o Globus? In-Reply-To: <50b07b4b0905190946q8337d93t8bbe5deeb704f7f2@mail.gmail.com> References: <4A12D025.9000101@mcs.anl.gov> <50b07b4b0905190946q8337d93t8bbe5deeb704f7f2@mail.gmail.com> Message-ID: On Tue, 19 May 2009, Allan Espinosa wrote: > what we did in Ranger was create a Falkon block and swift submits to > the Falkon service. If you're running services, you could also run a GRAM2 personal gatekeeper. -- From aespinosa at cs.uchicago.edu Tue May 19 13:02:03 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 19 May 2009 13:02:03 -0500 Subject: [Swift-devel] Can we run Swift on SGE cluster w/o Globus? In-Reply-To: <4A12E672.6020007@mcs.anl.gov> References: <4A12D025.9000101@mcs.anl.gov> <50b07b4b0905190946q8337d93t8bbe5deeb704f7f2@mail.gmail.com> <4A12E672.6020007@mcs.anl.gov> Message-ID: <50b07b4b0905191102m51211a03ie49f525e8109e5b6@mail.gmail.com> yes. basically all falkon workers are started as an mpi process which in turn connects to the falkon service. i made a snapshot of my code in /share/home/01035/tg802895/falkonranger.tar -Allan 2009/5/19 Michael Wilde : > So you have a script on Ranger that creates a static-sized job of Falkon > workers? > > Seems like that's something I could readily duplicate on the SEED machine. > > - Mike > > > On 5/19/09 11:46 AM, Allan Espinosa wrote: >> >> what we did in Ranger was create a Falkon block and swift submits to >> the Falkon service. >> >> -Allan >> >> 2009/5/19 Michael Wilde : >>> >>> Chris Henry of MCS/Bio wants to run Swift on an local cluster of the SEED >>> project, which is running SGE but not Globus. Is there any way to do that >>> with the current code? >>> >>> (Just checking - this is low prio, as we're trying to get his apps >>> compiled >>> for execution elsewhere). >>> >>> - Mike From iraicu at cs.uchicago.edu Tue May 19 14:09:37 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 19 May 2009 12:09:37 -0700 Subject: [Swift-devel] Can we run Swift on SGE cluster w/o Globus? In-Reply-To: <50b07b4b0905191102m51211a03ie49f525e8109e5b6@mail.gmail.com> References: <4A12D025.9000101@mcs.anl.gov> <50b07b4b0905190946q8337d93t8bbe5deeb704f7f2@mail.gmail.com> <4A12E672.6020007@mcs.anl.gov> <50b07b4b0905191102m51211a03ie49f525e8109e5b6@mail.gmail.com> Message-ID: <4A1303F1.70508@cs.uchicago.edu> Mike, there are scripts in the Falkon SVN (probably different than Allan's code), that mimic the BG/P setup, but for Ranger (calling out to SGE, instead of Cobalt). The scripts are all called *ranger*.sh, and can be found in $FALKON_HOME/bin. Ioan Allan Espinosa wrote: > yes. basically all falkon workers are started as an mpi process which > in turn connects to the falkon service. > > i made a snapshot of my code in /share/home/01035/tg802895/falkonranger.tar > > -Allan > > 2009/5/19 Michael Wilde : > >> So you have a script on Ranger that creates a static-sized job of Falkon >> workers? >> >> Seems like that's something I could readily duplicate on the SEED machine. >> >> - Mike >> >> >> On 5/19/09 11:46 AM, Allan Espinosa wrote: >> >>> what we did in Ranger was create a Falkon block and swift submits to >>> the Falkon service. >>> >>> -Allan >>> >>> 2009/5/19 Michael Wilde : >>> >>>> Chris Henry of MCS/Bio wants to run Swift on an local cluster of the SEED >>>> project, which is running SGE but not Globus. Is there any way to do that >>>> with the current code? >>>> >>>> (Just checking - this is low prio, as we're trying to get his apps >>>> compiled >>>> for execution elsewhere). >>>> >>>> - Mike >>>> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Wed May 20 11:51:18 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 20 May 2009 11:51:18 -0500 Subject: [Swift-devel] could not start coaster on Ranger Message-ID: <4A143506.6010809@uchicago.edu> Hi, I am trying to run a sanity test with coaster on ranger. Here is my sites.xml definition: [zzhang at communicado sites]$ cat coaster_test/tgranger-sge-gram2.xml TG-MCB080099N /work/00946/zzhang/work /tmp/zzhang/jobdir 16 development 00:40:00 31 50 10 Then it failed with the following errors. Is there any configuration error I put there in the sites.xml? or any other information you need, please let me know. zhao [zzhang at communicado sites]$ ./run-site coaster_test/tgranger-sge-gram2.xml testing site configuration: coaster_test/tgranger-sge-gram2.xml Removing files from previous runs Running test 061-cattwo at Wed May 20 11:47:23 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090520-1147-yov15zb6 Progress: uninitialized:1 Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090520-1147-yov15zb6/info/k on tgtacc Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090520-1147-yov15zb6/info/m on tgtacc Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090520-1147-yov15zb6/info/o on tgtacc Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: tgtacc Directory: 061-cattwo-20090520-1147-yov15zb6/jobs/o/cat-oa9l83bj stderr.txt: stdout.txt: ---- Caused by: Failed to start worker: null null org.globus.gram.GramException: The job manager detected an invalid script response at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:619) Cleaning up... Shutting down service at https://129.114.50.163:60748 Got channel MetaChannel: 4609608 -> GSSSChannel-null(1) - Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo From zhaozhang at uchicago.edu Wed May 20 11:54:29 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 20 May 2009 11:54:29 -0500 Subject: [Swift-devel] Re: could not start coaster on Ranger In-Reply-To: <4A143506.6010809@uchicago.edu> References: <4A143506.6010809@uchicago.edu> Message-ID: <4A1435C5.2080802@uchicago.edu> Also, all gram logs are here at CI network: /home/zzhang/ranger-gram-logs zhao Zhao Zhang wrote: > Hi, > > I am trying to run a sanity test with coaster on ranger. > Here is my sites.xml definition: > [zzhang at communicado sites]$ cat coaster_test/tgranger-sge-gram2.xml > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > TG-MCB080099N > /work/00946/zzhang/work > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir > 16 > development > key="coasterWorkerMaxwalltime">00:40:00 > 31 > 50 > 10 > > > > > Then it failed with the following errors. Is there any configuration > error I put there in the sites.xml? or any > other information you need, please let me know. > > zhao > > [zzhang at communicado sites]$ ./run-site > coaster_test/tgranger-sge-gram2.xml > testing site configuration: coaster_test/tgranger-sge-gram2.xml > Removing files from previous runs > Running test 061-cattwo at Wed May 20 11:47:23 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090520-1147-yov15zb6 > Progress: uninitialized:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/k on tgtacc > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/m on tgtacc > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/o on tgtacc > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: tgtacc > Directory: 061-cattwo-20090520-1147-yov15zb6/jobs/o/cat-oa9l83bj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Failed to start worker: > null > null > org.globus.gram.GramException: The job manager detected an invalid > script response > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:619) > > Cleaning up... > Shutting down service at https://129.114.50.163:60748 > Got channel MetaChannel: 4609608 -> GSSSChannel-null(1) > - Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > From zhaozhang at uchicago.edu Wed May 20 11:58:00 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 20 May 2009 11:58:00 -0500 Subject: [Swift-devel] Re: could not start coaster on Ranger In-Reply-To: <4A143506.6010809@uchicago.edu> References: <4A143506.6010809@uchicago.edu> Message-ID: <4A143698.8010805@uchicago.edu> got it, I used the wrong project name. zhao Zhao Zhang wrote: > Hi, > > I am trying to run a sanity test with coaster on ranger. > Here is my sites.xml definition: > [zzhang at communicado sites]$ cat coaster_test/tgranger-sge-gram2.xml > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > TG-MCB080099N > /work/00946/zzhang/work > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir > 16 > development > key="coasterWorkerMaxwalltime">00:40:00 > 31 > 50 > 10 > > > > > Then it failed with the following errors. Is there any configuration > error I put there in the sites.xml? or any > other information you need, please let me know. > > zhao > > [zzhang at communicado sites]$ ./run-site > coaster_test/tgranger-sge-gram2.xml > testing site configuration: coaster_test/tgranger-sge-gram2.xml > Removing files from previous runs > Running test 061-cattwo at Wed May 20 11:47:23 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090520-1147-yov15zb6 > Progress: uninitialized:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/k on tgtacc > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/m on tgtacc > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/o on tgtacc > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: tgtacc > Directory: 061-cattwo-20090520-1147-yov15zb6/jobs/o/cat-oa9l83bj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Failed to start worker: > null > null > org.globus.gram.GramException: The job manager detected an invalid > script response > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:619) > > Cleaning up... > Shutting down service at https://129.114.50.163:60748 > Got channel MetaChannel: 4609608 -> GSSSChannel-null(1) > - Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > From hategan at mcs.anl.gov Wed May 20 12:03:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 20 May 2009 12:03:02 -0500 Subject: [Swift-devel] could not start coaster on Ranger In-Reply-To: <4A143506.6010809@uchicago.edu> References: <4A143506.6010809@uchicago.edu> Message-ID: <1242838982.862.0.camel@localhost> Did you try a simple globusrun job? On Wed, 2009-05-20 at 11:51 -0500, Zhao Zhang wrote: > Hi, > > I am trying to run a sanity test with coaster on ranger. > Here is my sites.xml definition: > [zzhang at communicado sites]$ cat coaster_test/tgranger-sge-gram2.xml > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > TG-MCB080099N > /work/00946/zzhang/work > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir > 16 > development > key="coasterWorkerMaxwalltime">00:40:00 > 31 > 50 > 10 > > > > > Then it failed with the following errors. Is there any configuration > error I put there in the sites.xml? or any > other information you need, please let me know. > > zhao > > [zzhang at communicado sites]$ ./run-site coaster_test/tgranger-sge-gram2.xml > testing site configuration: coaster_test/tgranger-sge-gram2.xml > Removing files from previous runs > Running test 061-cattwo at Wed May 20 11:47:23 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090520-1147-yov15zb6 > Progress: uninitialized:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/k on tgtacc > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/m on tgtacc > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090520-1147-yov15zb6/info/o on tgtacc > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: tgtacc > Directory: 061-cattwo-20090520-1147-yov15zb6/jobs/o/cat-oa9l83bj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Failed to start worker: > null > null > org.globus.gram.GramException: The job manager detected an invalid > script response > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:619) > > Cleaning up... > Shutting down service at https://129.114.50.163:60748 > Got channel MetaChannel: 4609608 -> GSSSChannel-null(1) > - Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed May 20 12:40:02 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 20 May 2009 12:40:02 -0500 Subject: [Swift-devel] How to determine TG project code validity? Message-ID: <4A144072.1070801@mcs.anl.gov> Zhao, please post this as a swift enhancement req if its *not* already in bugzilla: -- One of the problems we run into using Swift on TeraGrid is that when the project code is invalid, its hard to determine this from pre-WS-GRAM results and present a nice message to the user. The error is seen on the client side as a GRAM response, which (from CoG) says: org.globus.gram.GramException: The job manager detected an invalid script response I think this can occur from: - invalid project - invalid time request for a specific queue - other similar things Swift should try to diagnose this, but if thats not possible, should print a message with the known causes; or we should have an error message document in which the user can lookup the message and find the known causes. From wilde at mcs.anl.gov Wed May 20 12:59:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 20 May 2009 12:59:39 -0500 Subject: [Swift-devel] Heads-up on new Swift user - running scip for genomics Message-ID: <4A14450B.6010304@mcs.anl.gov> We are starting to work with Chris Henry, a postdoc (I think) of Rick Stevens who is running "scip" integer programming jobs for genomics studies. His jobs will run (eg on Ranger) for 1-4 hours, and he needs about 1000 jobs per study. I suspect the times will cluster under 2 hours, and he sets 4 hours as the max. The jobs each have 1 input and 1 output file, O(1MB) each. We'll be trying this on Ranger and then other TG systems as we get the code more widely compiled. Its running now on Ranger. SiCortex and BGP will also be of interest, as will Kraken. Ive asked Zhao to adopt this as a user test case; more on that approach later. (Basically, for steady tests that consume a lot of resources, Im asking Zhao to explore ways to run tests from lists of scientifically useful results. Thats in addition to basic sanity tests of simpe cases that are easier to troubleshoot). - Mike From wilde at mcs.anl.gov Thu May 21 13:14:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 21 May 2009 13:14:34 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error]] In-Reply-To: <4A1591E7.1060601@uchicago.edu> References: <4A142677.6090705@mcs.anl.gov> <4A1591E7.1060601@uchicago.edu> Message-ID: <4A159A0A.8050406@mcs.anl.gov> On 5/21/09 12:39 PM, Zhao Zhang wrote: > Hi, Mike > > I did repeat this bug on ranger with 100 jobs. The log is at > /home/zzhang/scip/scip_100_fail_log > > Then I set > 5 > The same work flow ran perfectly. OK, thats good to hear. I thought the limit on Ranger was 50, though. Can you repeat the test at coasterMaxJobs 50 (or whatever the documented limit is)? > I ran scip workflow with 16, 64 128, 256, 512, 1024 jobs. > One thing I concerned is that there might be noise in the tests, the > end-to-end time could not be the real running time > for the workflow. I dont understand what you mean? That the numbers below are too fast to be believable? - Mike > Here is the summary for running time: > 16 173 seconds > 64 132 seconds > 128 166 seconds > 256 202 seconds > 512 221 seconds > 1024 400 seconds > > The logs are at > /home/zzhang/scip/scip_16_log > /home/zzhang/scip/scip_64_log > /home/zzhang/scip/scip_128_log > /home/zzhang/scip/scip_256_log > /home/zzhang/scip/scip_512_log > /home/zzhang/scip/scip_1024_log > > zhao > > Michael Wilde wrote: >> Zhao, here is the latest Coaster bug fix I am aware of, to test. >> >> This was a showstopper bug for Glen, as once he exceeded the job >> limit, it *seemed* as I understand that Ranger killed all his jobs. >> >> - Mike >> >> >> -------- Original Message -------- >> Subject: [Fwd: Re: [Swift-devel] RE: [Swift-user] Execution error] >> Date: Tue, 12 May 2009 15:38:52 -0500 >> From: Michael Wilde >> To: Glen Hocky , Mihael Hategan >> >> >> Glen, the bug you were asking about yesterday was fixed on May 1 - >> here's the email from Mihael. >> >> Did you try this fix, or were you unaware of this particular message in >> the thread on that problem? >> >> - Mike >> >> >> -------- Original Message -------- >> Subject: Re: [Swift-devel] RE: [Swift-user] Execution error >> Date: Fri, 01 May 2009 15:22:02 -0500 >> From: Mihael Hategan >> To: Glen Hocky >> CC: swift-devel , "Yue, Chen - BMD" >> >> References: >> >> <49F9DE8F.1070404 at mcs.anl.gov> >> <49F9E298.8030801 at mcs.anl.gov> <49F9E8FB.9020500 at mcs.anl.gov> >> >> <49F9F680.6040503 at mcs.anl.gov> >> >> <49FA03EA.7080807 at mcs.anl.gov> >> >> <49FA147E.6070205 at uchicago.edu> >> <1241135666.3603.1.camel at localhost> >> >> Fix in cog 2394. >> >> Use globus:coasterMaxJobs profile. >> >> On Thu, 2009-04-30 at 18:54 -0500, Mihael Hategan wrote: >>> Mystery solved: >>> >>> Thu Apr 30 18:19:13 2009 JM_SCRIPT: ERROR: job submission failed: >>> Thu Apr 30 18:19:13 2009 JM_SCRIPT: >>> ------------------------------------------------------------------------ >>> Welcome to TACC's Ranger System, an NSF TeraGrid Resource >>> ------------------------------------------------------------------------ >>> >>> --> Submitting 16 tasks... >>> --> Submitting 16 tasks/host... >>> --> Submitting exclusive job to 1 hosts... >>> --> Verifying HOME file-system availability... >>> --> Verifying WORK file-system availability... >>> --> Verifying SCRATCH file-system availability... >>> --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits... >>> --> Requesting valid memory configuration (mt=31.3G)... >>> --> Checking ssh keys... >>> --> Checking file existence and permissions for passwordless ssh... >>> --> Verifying accounting... >>> ---------------------------------------------------------------- >>> ERROR: You have exceeded the max submitted job count. >>> Maximum allowed is 50 jobs. >>> >>> Please contact TACC Consulting if you believe you have >>> received this message in error. >>> ---------------------------------------------------------------- >>> Job aborted by esub. >>> >>> I'll add a limit for the number of jobs allowed to the current coaster >>> code. >>> >>> >>> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: >>> > I have the identical response on ranger. It started yesterday >>> evening. > Possibly a problem that the TACC folks need to fix? >>> > > Glen >>> > > Yue, Chen - BMD wrote: >>> > > Hi Michael, >>> > > > > Thank you for the advices. I tested ranger with 1 job and >>> new > > specifications of maxwalltime. It shows the following error >>> message. I > > don't know if there is other problem with my setup. >>> Thank you! >>> > > > > ///////////////////////////////////////////////// >>> > > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift >>> -sites.file > > sites.xml -tc.file tc.data >>> > > Swift 0.9rc2 swift-r2860 cog-r2388 >>> > > RunID: 20090430-1559-2vi6x811 >>> > > Progress: >>> > > Progress: Stage in:1 >>> > > Progress: Submitting:1 >>> > > Progress: Submitting:1 >>> > > Progress: Submitted:1 >>> > > Progress: Active:1 >>> > > Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger >>> > > Progress: Active:1 >>> > > Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger >>> > > Progress: Stage in:1 >>> > > Progress: Active:1 >>> > > Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger >>> > > Progress: Failed:1 >>> > > Execution failed: >>> > > Exception in PTMap2: >>> > > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, >>> > > parameters.txt] >>> > > Host: ranger >>> > > Directory: >>> PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj >>> > > stderr.txt: >>> > > stdout.txt: >>> > > ---- >>> > > Caused by: >>> > > Failed to start worker: >>> > > null >>> > > null >>> > > org.globus.gram.GramException: The job manager detected an >>> invalid > > script response >>> > > at > > >>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >>> >>> > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) >>> > > at > > >>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >>> > > at java.lang.Thread.run(Thread.java:619) >>> > > Cleaning up... >>> > > Shutting down service at https://129.114.50.163:45562 > > >>> >>> > > Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) >>> > > - Done >>> > > [yuechen at communicado PTMap2]$ >>> > > /////////////////////////////////////////////////////////// >>> > > > > Chen, Yue >>> > > > > >>> > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>> > > *Sent:* Thu 4/30/2009 3:02 PM >>> > > *To:* Yue, Chen - BMD; swift-devel >>> > > *Subject:* Re: [Swift-user] Execution error >>> > > >>> > > Back on list here (I only went off-list to discuss accounts, etc) >>> > > >>> > > The problem in the run below is this: >>> > > >>> > > 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 >>> APPLICATION_EXCEPTION >>> > > jobid=PTMap2-abeii5aj - Application exception: Job cannot be run >>> with >>> > > the given max walltime worker constraint (task: 3000, \ >>> > > maxwalltime: 2400s) >>> > > >>> > > You have this on the ptmap app in your tc.data: >>> > > >>> > > globus::maxwalltime=50 >>> > > >>> > > But you only gave coasters 40 mins per coaster worker. So its >>> > > complaining that it cant run a 50 minute job in a 40 minute (max) >>> > > coaster worker. ;) >>> > > >>> > > I mentioned in a prior mail that you need to set the two time >>> vals in >>> > > your sites.xml entry; thats what you need to do next, now. >>> > > >>> > > change the coaster time in your sites.xml to: >>> > > key="coasterWorkerMaxwalltime">00:51:00 >>> > > >>> > > If you have more info on the variability of your ptmap run times, >>> send >>> > > that to the list, and we can discuss how to handle. >>> > > >>> > > >>> > > (NOTE: doing grp -i of the log for "except" or scanning for "except" >>> > > with an editor will often locate the first "exception" that your job >>> > > encountered. Thats how I found the error above). >>> > > >>> > > Also, Yue, for testing new sites, or for validating that old >>> sites still >>> > > work, you should create the smallest possible ptmap workflow - 1 >>> job if >>> > > that is possible - and verify that this works. Then say 10 jobs >>> to make >>> > > sure scheduling etc is sane. Then, send in your huge jobs. >>> > > >>> > > With only 1 job, its easier to spot the errors in the log file. >>> > > >>> > > - Mike >>> > > >>> > > >>> > > On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: >>> > > > Hi Michael, >>> > > > > > > I run into the same messages again when I use Ranger: >>> > > > > > > Progress: Selecting site:146 Stage in:25 >>> Submitting:15 Submitted:821 >>> > > > Failed but can retry:16 >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger >>> > > > Progress: Selecting site:146 Stage in:3 Submitting:1 >>> Submitted:857 >>> > > > Failed but can retry:16 >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger >>> > > > Failed to transfer wrapper log from >>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>> > > > The log for the search is at : > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log >>> > > > > > > The sites.xml I have is: >>> > > > > > > >>> > > > >> > > > url="gatekeeper.ranger.tacc.teragrid.org" >>> > > > jobManager="gt2:gt2:SGE"/> >>> > > > >> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>> > > > >> > > > >>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>> > > > >> key="project">TG-CCR080022N >>> > > > >> key="coastersPerNode">16 >>> > > > development >>> > > > >> > > > key="coasterWorkerMaxwalltime">00:40:00 >>> > > > 31 >>> > > > 50 >>> > > > 10 >>> > > > /work/01164/yuechen/swiftwork >>> > > > >>> > > > The tc.data I have is: >>> > > > > > > ranger PTMap2 > > > >>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > > > >>> INTEL32::LINUX globus::maxwalltime=50 >>> > > > >>> > > > I'm using swift 0.9 rc2 >>> > > > >>> > > > Thank you very much for help! >>> > > > >>> > > > Chen, Yue >>> > > > >>> > > > > > > >>> > > > >>> ------------------------------------------------------------------------ >>> > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>> > > > *Sent:* Thu 4/30/2009 2:05 PM >>> > > > *To:* Yue, Chen - BMD >>> > > > *Subject:* Re: [Swift-user] Execution error >>> > > > >>> > > > >>> > > > >>> > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: >>> > > > > Hi Michael, >>> > > > > >>> > > > > When I tried to activate my account, I encountered the >>> following > > error: >>> > > > > >>> > > > > "Sorry, this account is in an invalid state. You may not >>> activate > > your >>> > > > > at this time." >>> > > > > >>> > > > > I used the username and password from TG-CDA070002T. Should >>> I use a >>> > > > > different password? >>> > > > >>> > > > If you can already login to Ranger, then you are all set - you >>> must have >>> > > > done this previously. >>> > > > >>> > > > I thought you had *not*, because when I looked up your login on >>> ranger >>> > > > ("finger yuechen") it said "never logged in". But seems like >>> that info >>> > > > is incorrect. >>> > > > >>> > > > If you have ptmap compiled, seems like you are almost all set. >>> > > > >>> > > > Let me know if it works. >>> > > > >>> > > > - Mike >>> > > > >>> > > > > Thanks! >>> > > > > >>> > > > > Chen, Yue >>> > > > > >>> > > > > >>> > > > > > > >>> ------------------------------------------------------------------------ >>> > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>> > > > > *Sent:* Thu 4/30/2009 1:07 PM >>> > > > > *To:* Yue, Chen - BMD >>> > > > > *Cc:* swift user >>> > > > > *Subject:* Re: [Swift-user] Execution error >>> > > > > >>> > > > > Yue, use this XML pool element to access ranger: >>> > > > > >>> > > > > >>> > > > > >> > > > > url="gatekeeper.ranger.tacc.teragrid.org" >>> > > > > jobManager="gt2:gt2:SGE"/> >>> > > > > > >>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>> > > > > >> > > > > >>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>> > > > > > >>> key="project">TG-CCR080022N >>> > > > > >> key="coastersPerNode">16 >>> > > > > >> key="queue">development >>> > > > > >> > > > > key="coasterWorkerMaxwalltime">00:40:00 >>> > > > > 31 >>> > > > > >> key="initialScore">50 >>> > > > > >> key="jobThrottle">10 >>> > > > > >>> /work/00306/tg455797/swiftwork >>> > > > > >>> > > > > >>> > > > > >>> > > > > You will need to also do these steps: >>> > > > > >>> > > > > Go to this web page to enable your Ranger account: >>> > > > > >>> > > > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx >>> > > > > >>> > > > > Then login to Ranger via the TeraGrid portal and put your >>> ssh keys in >>> > > > > place (assuming you use ssh keys, which you should) >>> > > > > >>> > > > > While on Ranger, do this: >>> > > > > >>> > > > > echo $WORK >>> > > > > mkdir $work/swiftwork >>> > > > > >>> > > > > and put the full path of your $WORK/swiftwork directory in the >>> > > > > element above. (My login is tg455etc, yours >>> is > > yuechen) >>> > > > > >>> > > > > Then scp your code to Ranger and compile it. >>> > > > > >>> > > > > Then create a tc.data entry for your ptmap app >>> > > > > >>> > > > > Next, set your time values in the sites.xml entry above to >>> suitable >>> > > > > values for Ranger. You'll need to measure times, but I think >>> you will >>> > > > > find Ranger about twice as fast as Mercury for CPU-bound jobs. >>> > > > > >>> > > > > The values above were set for one app job per coaster. I >>> think > > you can >>> > > > > probably do more. >>> > > > > >>> > > > > If you estimate a run time of 5 minutes, use: >>> > > > > >>> > > > > >> > > > > key="coasterWorkerMaxwalltime">00:30:00 >>> > > > > 5 >>> > > > > >>> > > > > Other people on the list - please sanity check what I >>> suggest here. >>> > > > > >>> > > > > - Mike >>> > > > > >>> > > > > >>> > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: >>> > > > > > I just checked - TG-CDA070002T has indeed expired. >>> > > > > > >>> > > > > > The best for now is to move to use (only) Ranger, under >>> this > > account: >>> > > > > > TG-CCR080022N >>> > > > > > >>> > > > > > I will locate and send you a sites.xml entry in a moment. >>> > > > > > >>> > > > > > You need to go to a web page to activate your Ranger login. >>> > > > > > >>> > > > > > Best to contact me in IM and we can work this out. >>> > > > > > >>> > > > > > - Mike >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: >>> > > > > >> Also, what account are you running under? We may need to >>> change >>> > > > you to >>> > > > > >> a new account - as the OSG Training account expires today. >>> > > > > >> If that happend at Noon, it *might* be the problem. >>> > > > > >> >>> > > > > >> - Mike >>> > > > > >> >>> > > > > >> >>> > > > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >>> > > > > >>> Hi, >>> > > > > >>> >>> > > > > >>> I came back to re-run my application on NCSA Mercury >>> which was >>> > > > tested >>> > > > > >>> successfully last week after I just set up coasters >>> with > > swift 0.9, >>> > > > > >>> but I got many messages like the following: >>> > > > > >>> >>> > > > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 >>> > > > > >>> Progress: Stage in:129 Submitting:703 Submitted:190 >>> Failed >>> > > > but can >>> > > > > >>> retry:1 >>> > > > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 >>> Failed > > but can >>> > > > > >>> retry:4 >>> > > > > >>> Failed to transfer wrapper log from >>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY >>> > > > > >>> Failed to transfer wrapper log from >>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY >>> > > > > >>> Failed to transfer wrapper log from >>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY >>> > > > > >>> Failed to transfer wrapper log from >>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY >>> > > > > >>> Failed to transfer wrapper log from >>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY >>> > > > > >>> Failed to transfer wrapper log from >>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY >>> > > > > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed >>> but can >>> > > > retry:8 >>> > > > > >>> Progress: Submitted:1011 Active:1 Failed but can >>> retry:11 >>> > > > > >>> The log file for the successful run last week is ; >>> > > > > >>> >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >>> > > > > >>> >>> > > > > >>> The log file for the failed run is : >>> > > > > >>> >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >>> > > > > >>> >>> > > > > >>> I don't think I did anything different, so I don't know >>> why this >>> > > > time >>> > > > > >>> they failed. The sites.xml for Mercury is: >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> > >>> url="grid-hg.ncsa.teragrid.org" >>> > > > > >>> jobManager="gt2:PBS"/> >>> > > > > >>> > > >>> /gpfs_scratch1/yuechen/swiftwork >>> > > > > >>> >> key="queue">debug >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> Thank you for help! >>> > > > > >>> >>> > > > > >>> Chen, Yue >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> This email is intended only for the use of the >>> individual or > > entity >>> > > > > >>> to which it is addressed and may contain information >>> that is >>> > > > > >>> privileged and confidential. If the reader of this >>> email > > message is >>> > > > > >>> not the intended recipient, you are hereby notified >>> that any >>> > > > > >>> dissemination, distribution, or copying of this >>> communication is >>> > > > > >>> prohibited. If you have received this email in error, >>> please > > notify >>> > > > > >>> the sender and destroy/delete all copies of the >>> transmittal. >>> > > > Thank you. >>> > > > > >>> >>> > > > > >>> >>> > > > > >>> >>> > > > > > > >>> ------------------------------------------------------------------------ >>> > > > > >>> >>> > > > > >>> _______________________________________________ >>> > > > > >>> Swift-user mailing list >>> > > > > >>> Swift-user at ci.uchicago.edu >>> > > > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>> > > > > >> _______________________________________________ >>> > > > > >> Swift-user mailing list >>> > > > > >> Swift-user at ci.uchicago.edu >>> > > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>> > > > > > _______________________________________________ >>> > > > > > Swift-user mailing list >>> > > > > > Swift-user at ci.uchicago.edu >>> > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > This email is intended only for the use of the individual or >>> > > entity to >>> > > > > which it is addressed and may contain information that is > >>> > privileged and >>> > > > > confidential. If the reader of this email message is not the >>> intended >>> > > > > recipient, you are hereby notified that any dissemination, > >>> > distribution, >>> > > > > or copying of this communication is prohibited. If you have >>> received >>> > > > > this email in error, please notify the sender and >>> destroy/delete all >>> > > > > copies of the transmittal. Thank you. >>> > > > >>> > > > > > > >>> > > > >>> > > > This email is intended only for the use of the individual or >>> entity to >>> > > > which it is addressed and may contain information that is >>> privileged and >>> > > > confidential. If the reader of this email message is not the >>> intended >>> > > > recipient, you are hereby notified that any dissemination, >>> distribution, >>> > > > or copying of this communication is prohibited. If you have >>> received >>> > > > this email in error, please notify the sender and >>> destroy/delete all >>> > > > copies of the transmittal. Thank you. >>> > > >>> > > > > >>> > > >>> > > This email is intended only for the use of the individual or >>> entity to > > which it is addressed and may contain information that >>> is privileged > > and confidential. If the reader of this email >>> message is not the > > intended recipient, you are hereby notified >>> that any dissemination, > > distribution, or copying of this >>> communication is prohibited. If you > > have received this email in >>> error, please notify the sender and > > destroy/delete all copies of >>> the transmittal. Thank you. >>> > > >>> ------------------------------------------------------------------------ >>> > > >>> > > _______________________________________________ >>> > > Swift-devel mailing list >>> > > Swift-devel at ci.uchicago.edu >>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > > > > _______________________________________________ >>> > Swift-devel mailing list >>> > Swift-devel at ci.uchicago.edu >>> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> From zhaozhang at uchicago.edu Thu May 21 13:18:12 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 21 May 2009 13:18:12 -0500 Subject: [Swift-devel] Re: [Swift-user] Execution error]] In-Reply-To: <4A159A0A.8050406@mcs.anl.gov> References: <4A142677.6090705@mcs.anl.gov> <4A1591E7.1060601@uchicago.edu> <4A159A0A.8050406@mcs.anl.gov> Message-ID: <4A159AE4.9060705@uchicago.edu> Michael Wilde wrote: > > > On 5/21/09 12:39 PM, Zhao Zhang wrote: >> Hi, Mike >> >> I did repeat this bug on ranger with 100 jobs. The log is at >> /home/zzhang/scip/scip_100_fail_log >> >> Then I set >> 5 >> The same work flow ran perfectly. > > OK, thats good to hear. > I thought the limit on Ranger was 50, though. > Can you repeat the test at coasterMaxJobs 50 (or whatever the > documented limit is)? yes, the limit is 50, I am trying it now. > >> I ran scip workflow with 16, 64 128, 256, 512, 1024 jobs. >> One thing I concerned is that there might be noise in the tests, the >> end-to-end time could not be the real running time >> for the workflow. > > I dont understand what you mean? That the numbers below are too fast > to be believable? No. What I mean is that if we expect a linear increase for running time from 512 jobs to 1024 jobs, since there is noise in the system, we can't see the running time for 1024 jobs are 200 seconds given that 512 jobs took 100 seconds. zhao > > - Mike > > >> Here is the summary for running time: >> 16 173 seconds >> 64 132 seconds >> 128 166 seconds >> 256 202 seconds >> 512 221 seconds >> 1024 400 seconds >> >> The logs are at >> /home/zzhang/scip/scip_16_log >> /home/zzhang/scip/scip_64_log >> /home/zzhang/scip/scip_128_log >> /home/zzhang/scip/scip_256_log >> /home/zzhang/scip/scip_512_log >> /home/zzhang/scip/scip_1024_log >> >> zhao >> >> Michael Wilde wrote: >>> Zhao, here is the latest Coaster bug fix I am aware of, to test. >>> >>> This was a showstopper bug for Glen, as once he exceeded the job >>> limit, it *seemed* as I understand that Ranger killed all his jobs. >>> >>> - Mike >>> >>> >>> -------- Original Message -------- >>> Subject: [Fwd: Re: [Swift-devel] RE: [Swift-user] Execution error] >>> Date: Tue, 12 May 2009 15:38:52 -0500 >>> From: Michael Wilde >>> To: Glen Hocky , Mihael Hategan >>> >>> >>> Glen, the bug you were asking about yesterday was fixed on May 1 - >>> here's the email from Mihael. >>> >>> Did you try this fix, or were you unaware of this particular message in >>> the thread on that problem? >>> >>> - Mike >>> >>> >>> -------- Original Message -------- >>> Subject: Re: [Swift-devel] RE: [Swift-user] Execution error >>> Date: Fri, 01 May 2009 15:22:02 -0500 >>> From: Mihael Hategan >>> To: Glen Hocky >>> CC: swift-devel , "Yue, Chen - BMD" >>> >>> References: >>> >>> <49F9DE8F.1070404 at mcs.anl.gov> >>> <49F9E298.8030801 at mcs.anl.gov> <49F9E8FB.9020500 at mcs.anl.gov> >>> >>> <49F9F680.6040503 at mcs.anl.gov> >>> >>> <49FA03EA.7080807 at mcs.anl.gov> >>> >>> <49FA147E.6070205 at uchicago.edu> >>> <1241135666.3603.1.camel at localhost> >>> >>> Fix in cog 2394. >>> >>> Use globus:coasterMaxJobs profile. >>> >>> On Thu, 2009-04-30 at 18:54 -0500, Mihael Hategan wrote: >>>> Mystery solved: >>>> >>>> Thu Apr 30 18:19:13 2009 JM_SCRIPT: ERROR: job submission failed: >>>> Thu Apr 30 18:19:13 2009 JM_SCRIPT: >>>> ------------------------------------------------------------------------ >>>> >>>> Welcome to TACC's Ranger System, an NSF TeraGrid Resource >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> --> Submitting 16 tasks... >>>> --> Submitting 16 tasks/host... >>>> --> Submitting exclusive job to 1 hosts... >>>> --> Verifying HOME file-system availability... >>>> --> Verifying WORK file-system availability... >>>> --> Verifying SCRATCH file-system availability... >>>> --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data >>>> limits... >>>> --> Requesting valid memory configuration (mt=31.3G)... >>>> --> Checking ssh keys... >>>> --> Checking file existence and permissions for passwordless ssh... >>>> --> Verifying accounting... >>>> ---------------------------------------------------------------- >>>> ERROR: You have exceeded the max submitted job count. >>>> Maximum allowed is 50 jobs. >>>> >>>> Please contact TACC Consulting if you believe you have >>>> received this message in error. >>>> ---------------------------------------------------------------- >>>> Job aborted by esub. >>>> >>>> I'll add a limit for the number of jobs allowed to the current coaster >>>> code. >>>> >>>> >>>> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: >>>> > I have the identical response on ranger. It started yesterday >>>> evening. > Possibly a problem that the TACC folks need to fix? >>>> > > Glen >>>> > > Yue, Chen - BMD wrote: >>>> > > Hi Michael, >>>> > > > > Thank you for the advices. I tested ranger with 1 job and >>>> new > > specifications of maxwalltime. It shows the following error >>>> message. I > > don't know if there is other problem with my setup. >>>> Thank you! >>>> > > > > ///////////////////////////////////////////////// >>>> > > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift >>>> -sites.file > > sites.xml -tc.file tc.data >>>> > > Swift 0.9rc2 swift-r2860 cog-r2388 >>>> > > RunID: 20090430-1559-2vi6x811 >>>> > > Progress: >>>> > > Progress: Stage in:1 >>>> > > Progress: Submitting:1 >>>> > > Progress: Submitting:1 >>>> > > Progress: Submitted:1 >>>> > > Progress: Active:1 >>>> > > Failed to transfer wrapper log from > > >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger >>>> > > Progress: Active:1 >>>> > > Failed to transfer wrapper log from > > >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger >>>> > > Progress: Stage in:1 >>>> > > Progress: Active:1 >>>> > > Failed to transfer wrapper log from > > >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger >>>> > > Progress: Failed:1 >>>> > > Execution failed: >>>> > > Exception in PTMap2: >>>> > > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, >>>> inputs-unmod.txt, > > parameters.txt] >>>> > > Host: ranger >>>> > > Directory: >>>> PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj >>>> > > stderr.txt: >>>> > > stdout.txt: >>>> > > ---- >>>> > > Caused by: >>>> > > Failed to start worker: >>>> > > null >>>> > > null >>>> > > org.globus.gram.GramException: The job manager detected an >>>> invalid > > script response >>>> > > at > > >>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >>>> >>>> > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) >>>> > > at > > >>>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >>>> > > at java.lang.Thread.run(Thread.java:619) >>>> > > Cleaning up... >>>> > > Shutting down service at https://129.114.50.163:45562 > > >>>> >>>> > > Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) >>>> > > - Done >>>> > > [yuechen at communicado PTMap2]$ >>>> > > /////////////////////////////////////////////////////////// >>>> > > > > Chen, Yue >>>> > > > > >>>> > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>> > > *Sent:* Thu 4/30/2009 3:02 PM >>>> > > *To:* Yue, Chen - BMD; swift-devel >>>> > > *Subject:* Re: [Swift-user] Execution error >>>> > > >>>> > > Back on list here (I only went off-list to discuss accounts, etc) >>>> > > >>>> > > The problem in the run below is this: >>>> > > >>>> > > 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 >>>> APPLICATION_EXCEPTION >>>> > > jobid=PTMap2-abeii5aj - Application exception: Job cannot be >>>> run with >>>> > > the given max walltime worker constraint (task: 3000, \ >>>> > > maxwalltime: 2400s) >>>> > > >>>> > > You have this on the ptmap app in your tc.data: >>>> > > >>>> > > globus::maxwalltime=50 >>>> > > >>>> > > But you only gave coasters 40 mins per coaster worker. So its >>>> > > complaining that it cant run a 50 minute job in a 40 minute (max) >>>> > > coaster worker. ;) >>>> > > >>>> > > I mentioned in a prior mail that you need to set the two time >>>> vals in >>>> > > your sites.xml entry; thats what you need to do next, now. >>>> > > >>>> > > change the coaster time in your sites.xml to: >>>> > > key="coasterWorkerMaxwalltime">00:51:00 >>>> > > >>>> > > If you have more info on the variability of your ptmap run >>>> times, send >>>> > > that to the list, and we can discuss how to handle. >>>> > > >>>> > > >>>> > > (NOTE: doing grp -i of the log for "except" or scanning for >>>> "except" >>>> > > with an editor will often locate the first "exception" that >>>> your job >>>> > > encountered. Thats how I found the error above). >>>> > > >>>> > > Also, Yue, for testing new sites, or for validating that old >>>> sites still >>>> > > work, you should create the smallest possible ptmap workflow - >>>> 1 job if >>>> > > that is possible - and verify that this works. Then say 10 >>>> jobs to make >>>> > > sure scheduling etc is sane. Then, send in your huge jobs. >>>> > > >>>> > > With only 1 job, its easier to spot the errors in the log file. >>>> > > >>>> > > - Mike >>>> > > >>>> > > >>>> > > On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: >>>> > > > Hi Michael, >>>> > > > > > > I run into the same messages again when I use Ranger: >>>> > > > > > > Progress: Selecting site:146 Stage in:25 >>>> Submitting:15 Submitted:821 >>>> > > > Failed but can retry:16 >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger >>>> > > > Progress: Selecting site:146 Stage in:3 Submitting:1 >>>> Submitted:857 >>>> > > > Failed but can retry:16 >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger >>>> > > > Failed to transfer wrapper log from >>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>> > > > The log for the search is at : > > > >>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log >>>> > > > > > > The sites.xml I have is: >>>> > > > > > > >>>> > > > >>> > > > url="gatekeeper.ranger.tacc.teragrid.org" >>>> > > > jobManager="gt2:gt2:SGE"/> >>>> > > > >>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>>> > > > >>> > > > >>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>> > > > >>> key="project">TG-CCR080022N >>>> > > > >>> key="coastersPerNode">16 >>>> > > > >>> key="queue">development >>>> > > > >>> > > > key="coasterWorkerMaxwalltime">00:40:00 >>>> > > > 31 >>>> > > > >>> key="initialScore">50 >>>> > > > 10 >>>> > > > >>>> /work/01164/yuechen/swiftwork >>>> > > > >>>> > > > The tc.data I have is: >>>> > > > > > > ranger PTMap2 > > > >>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > > >>>> > INTEL32::LINUX globus::maxwalltime=50 >>>> > > > >>>> > > > I'm using swift 0.9 rc2 >>>> > > > >>>> > > > Thank you very much for help! >>>> > > > >>>> > > > Chen, Yue >>>> > > > >>>> > > > > > > >>>> > > > >>>> ------------------------------------------------------------------------ >>>> >>>> > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>> > > > *Sent:* Thu 4/30/2009 2:05 PM >>>> > > > *To:* Yue, Chen - BMD >>>> > > > *Subject:* Re: [Swift-user] Execution error >>>> > > > >>>> > > > >>>> > > > >>>> > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: >>>> > > > > Hi Michael, >>>> > > > > >>>> > > > > When I tried to activate my account, I encountered the >>>> following > > error: >>>> > > > > >>>> > > > > "Sorry, this account is in an invalid state. You may not >>>> activate > > your >>>> > > > > at this time." >>>> > > > > >>>> > > > > I used the username and password from TG-CDA070002T. >>>> Should I use a >>>> > > > > different password? >>>> > > > >>>> > > > If you can already login to Ranger, then you are all set - >>>> you must have >>>> > > > done this previously. >>>> > > > >>>> > > > I thought you had *not*, because when I looked up your login >>>> on ranger >>>> > > > ("finger yuechen") it said "never logged in". But seems like >>>> that info >>>> > > > is incorrect. >>>> > > > >>>> > > > If you have ptmap compiled, seems like you are almost all set. >>>> > > > >>>> > > > Let me know if it works. >>>> > > > >>>> > > > - Mike >>>> > > > >>>> > > > > Thanks! >>>> > > > > >>>> > > > > Chen, Yue >>>> > > > > >>>> > > > > >>>> > > > > > > >>>> ------------------------------------------------------------------------ >>>> >>>> > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>> > > > > *Sent:* Thu 4/30/2009 1:07 PM >>>> > > > > *To:* Yue, Chen - BMD >>>> > > > > *Cc:* swift user >>>> > > > > *Subject:* Re: [Swift-user] Execution error >>>> > > > > >>>> > > > > Yue, use this XML pool element to access ranger: >>>> > > > > >>>> > > > > >>>> > > > > >>> > > > > url="gatekeeper.ranger.tacc.teragrid.org" >>>> > > > > jobManager="gt2:gt2:SGE"/> >>>> > > > > > >>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>>> > > > > >>> > > > > >>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>> > > > > > >>>> key="project">TG-CCR080022N >>>> > > > > >>> key="coastersPerNode">16 >>>> > > > > >>> key="queue">development >>>> > > > > >>> > > > > >>>> key="coasterWorkerMaxwalltime">00:40:00 >>>> > > > > >>> key="maxwalltime">31 >>>> > > > > >>> key="initialScore">50 >>>> > > > > >>> key="jobThrottle">10 >>>> > > > > >>>> /work/00306/tg455797/swiftwork >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > You will need to also do these steps: >>>> > > > > >>>> > > > > Go to this web page to enable your Ranger account: >>>> > > > > >>>> > > > > >>>> https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx >>>> > > > > >>>> > > > > Then login to Ranger via the TeraGrid portal and put your >>>> ssh keys in >>>> > > > > place (assuming you use ssh keys, which you should) >>>> > > > > >>>> > > > > While on Ranger, do this: >>>> > > > > >>>> > > > > echo $WORK >>>> > > > > mkdir $work/swiftwork >>>> > > > > >>>> > > > > and put the full path of your $WORK/swiftwork directory in >>>> the >>>> > > > > element above. (My login is tg455etc, >>>> yours is > > yuechen) >>>> > > > > >>>> > > > > Then scp your code to Ranger and compile it. >>>> > > > > >>>> > > > > Then create a tc.data entry for your ptmap app >>>> > > > > >>>> > > > > Next, set your time values in the sites.xml entry above to >>>> suitable >>>> > > > > values for Ranger. You'll need to measure times, but I >>>> think you will >>>> > > > > find Ranger about twice as fast as Mercury for CPU-bound >>>> jobs. >>>> > > > > >>>> > > > > The values above were set for one app job per coaster. I >>>> think > > you can >>>> > > > > probably do more. >>>> > > > > >>>> > > > > If you estimate a run time of 5 minutes, use: >>>> > > > > >>>> > > > > >>> > > > > >>>> key="coasterWorkerMaxwalltime">00:30:00 >>>> > > > > >>> key="maxwalltime">5 >>>> > > > > >>>> > > > > Other people on the list - please sanity check what I >>>> suggest here. >>>> > > > > >>>> > > > > - Mike >>>> > > > > >>>> > > > > >>>> > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: >>>> > > > > > I just checked - TG-CDA070002T has indeed expired. >>>> > > > > > >>>> > > > > > The best for now is to move to use (only) Ranger, under >>>> this > > account: >>>> > > > > > TG-CCR080022N >>>> > > > > > >>>> > > > > > I will locate and send you a sites.xml entry in a moment. >>>> > > > > > >>>> > > > > > You need to go to a web page to activate your Ranger >>>> login. >>>> > > > > > >>>> > > > > > Best to contact me in IM and we can work this out. >>>> > > > > > >>>> > > > > > - Mike >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: >>>> > > > > >> Also, what account are you running under? We may need >>>> to change >>>> > > > you to >>>> > > > > >> a new account - as the OSG Training account expires >>>> today. >>>> > > > > >> If that happend at Noon, it *might* be the problem. >>>> > > > > >> >>>> > > > > >> - Mike >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >>>> > > > > >>> Hi, >>>> > > > > >>> >>>> > > > > >>> I came back to re-run my application on NCSA Mercury >>>> which was >>>> > > > tested >>>> > > > > >>> successfully last week after I just set up coasters >>>> with > > swift 0.9, >>>> > > > > >>> but I got many messages like the following: >>>> > > > > >>> >>>> > > > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 >>>> > > > > >>> Progress: Stage in:129 Submitting:703 >>>> Submitted:190 Failed >>>> > > > but can >>>> > > > > >>> retry:1 >>>> > > > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 >>>> Failed > > but can >>>> > > > > >>> retry:4 >>>> > > > > >>> Failed to transfer wrapper log from >>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on >>>> NCSA_MERCURY >>>> > > > > >>> Failed to transfer wrapper log from >>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on >>>> NCSA_MERCURY >>>> > > > > >>> Failed to transfer wrapper log from >>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on >>>> NCSA_MERCURY >>>> > > > > >>> Failed to transfer wrapper log from >>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on >>>> NCSA_MERCURY >>>> > > > > >>> Failed to transfer wrapper log from >>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on >>>> NCSA_MERCURY >>>> > > > > >>> Failed to transfer wrapper log from >>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on >>>> NCSA_MERCURY >>>> > > > > >>> Progress: Stage in:1 Submitted:1013 Active:1 >>>> Failed but can >>>> > > > retry:8 >>>> > > > > >>> Progress: Submitted:1011 Active:1 Failed but can >>>> retry:11 >>>> > > > > >>> The log file for the successful run last week is ; >>>> > > > > >>> >>>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >>>> > > > > >>> >>>> > > > > >>> The log file for the failed run is : >>>> > > > > >>> >>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >>>> > > > > >>> >>>> > > > > >>> I don't think I did anything different, so I don't >>>> know why this >>>> > > > time >>>> > > > > >>> they failed. The sites.xml for Mercury is: >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>> url="gsiftp://gridftp-hg.ncsa.teragrid.org"/> >>>> > > > > >>> > >>>> url="grid-hg.ncsa.teragrid.org" >>>> > > > > >>> jobManager="gt2:PBS"/> >>>> > > > > >>> > > >>>> /gpfs_scratch1/yuechen/swiftwork >>>> > > > > >>> >>> key="queue">debug >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> Thank you for help! >>>> > > > > >>> >>>> > > > > >>> Chen, Yue >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> This email is intended only for the use of the >>>> individual or > > entity >>>> > > > > >>> to which it is addressed and may contain information >>>> that is >>>> > > > > >>> privileged and confidential. If the reader of this >>>> email > > message is >>>> > > > > >>> not the intended recipient, you are hereby notified >>>> that any >>>> > > > > >>> dissemination, distribution, or copying of this >>>> communication is >>>> > > > > >>> prohibited. If you have received this email in error, >>>> please > > notify >>>> > > > > >>> the sender and destroy/delete all copies of the >>>> transmittal. >>>> > > > Thank you. >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > >>> >>>> > > > > > > >>>> ------------------------------------------------------------------------ >>>> >>>> > > > > >>> >>>> > > > > >>> _______________________________________________ >>>> > > > > >>> Swift-user mailing list >>>> > > > > >>> Swift-user at ci.uchicago.edu >>>> > > > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>> > > > > >> _______________________________________________ >>>> > > > > >> Swift-user mailing list >>>> > > > > >> Swift-user at ci.uchicago.edu >>>> > > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>> > > > > > _______________________________________________ >>>> > > > > > Swift-user mailing list >>>> > > > > > Swift-user at ci.uchicago.edu >>>> > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > This email is intended only for the use of the individual >>>> or > > entity to >>>> > > > > which it is addressed and may contain information that is >>>> > > privileged and >>>> > > > > confidential. If the reader of this email message is not >>>> the intended >>>> > > > > recipient, you are hereby notified that any dissemination, >>>> > > distribution, >>>> > > > > or copying of this communication is prohibited. If you >>>> have received >>>> > > > > this email in error, please notify the sender and >>>> destroy/delete all >>>> > > > > copies of the transmittal. Thank you. >>>> > > > >>>> > > > > > > >>>> > > > >>>> > > > This email is intended only for the use of the individual or >>>> entity to >>>> > > > which it is addressed and may contain information that is >>>> privileged and >>>> > > > confidential. If the reader of this email message is not the >>>> intended >>>> > > > recipient, you are hereby notified that any dissemination, >>>> distribution, >>>> > > > or copying of this communication is prohibited. If you have >>>> received >>>> > > > this email in error, please notify the sender and >>>> destroy/delete all >>>> > > > copies of the transmittal. Thank you. >>>> > > >>>> > > > > >>>> > > >>>> > > This email is intended only for the use of the individual or >>>> entity to > > which it is addressed and may contain information >>>> that is privileged > > and confidential. If the reader of this >>>> email message is not the > > intended recipient, you are hereby >>>> notified that any dissemination, > > distribution, or copying of >>>> this communication is prohibited. If you > > have received this >>>> email in error, please notify the sender and > > destroy/delete all >>>> copies of the transmittal. Thank you. >>>> > > >>>> ------------------------------------------------------------------------ >>>> >>>> > > >>>> > > _______________________________________________ >>>> > > Swift-devel mailing list >>>> > > Swift-devel at ci.uchicago.edu >>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> > > > > _______________________________________________ >>>> > Swift-devel mailing list >>>> > Swift-devel at ci.uchicago.edu >>>> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> > From zhaozhang at uchicago.edu Thu May 21 13:28:34 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 21 May 2009 13:28:34 -0500 Subject: [Swift-devel] Re: [Swift-user] Execution error]] In-Reply-To: <4A159AE4.9060705@uchicago.edu> References: <4A142677.6090705@mcs.anl.gov> <4A1591E7.1060601@uchicago.edu> <4A159A0A.8050406@mcs.anl.gov> <4A159AE4.9060705@uchicago.edu> Message-ID: <4A159D52.9000701@uchicago.edu> Hi, Mike Zhao Zhang wrote: > > > Michael Wilde wrote: >> >> >> On 5/21/09 12:39 PM, Zhao Zhang wrote: >>> Hi, Mike >>> >>> I did repeat this bug on ranger with 100 jobs. The log is at >>> /home/zzhang/scip/scip_100_fail_log >>> >>> Then I set >>> 5 >>> The same work flow ran perfectly. >> >> OK, thats good to hear. >> I thought the limit on Ranger was 50, though. >> Can you repeat the test at coasterMaxJobs 50 (or whatever the >> documented limit is)? > yes, the limit is 50, I am trying it now. I ran the 1024 job workflow again with limit 50, again it took 393 seconds to finish. No big difference with the "5" case. zhao >> >>> I ran scip workflow with 16, 64 128, 256, 512, 1024 jobs. >>> One thing I concerned is that there might be noise in the tests, the >>> end-to-end time could not be the real running time >>> for the workflow. >> >> I dont understand what you mean? That the numbers below are too fast >> to be believable? > No. What I mean is that if we expect a linear increase for running > time from 512 jobs to 1024 jobs, since there is noise in the system, > we can't see the running time for 1024 jobs are 200 seconds given that > 512 jobs took 100 seconds. > > zhao >> >> - Mike >> >> >>> Here is the summary for running time: >>> 16 173 seconds >>> 64 132 seconds >>> 128 166 seconds >>> 256 202 seconds >>> 512 221 seconds >>> 1024 400 seconds >>> >>> The logs are at >>> /home/zzhang/scip/scip_16_log >>> /home/zzhang/scip/scip_64_log >>> /home/zzhang/scip/scip_128_log >>> /home/zzhang/scip/scip_256_log >>> /home/zzhang/scip/scip_512_log >>> /home/zzhang/scip/scip_1024_log >>> >>> zhao >>> >>> Michael Wilde wrote: >>>> Zhao, here is the latest Coaster bug fix I am aware of, to test. >>>> >>>> This was a showstopper bug for Glen, as once he exceeded the job >>>> limit, it *seemed* as I understand that Ranger killed all his jobs. >>>> >>>> - Mike >>>> >>>> >>>> -------- Original Message -------- >>>> Subject: [Fwd: Re: [Swift-devel] RE: [Swift-user] Execution error] >>>> Date: Tue, 12 May 2009 15:38:52 -0500 >>>> From: Michael Wilde >>>> To: Glen Hocky , Mihael Hategan >>>> >>>> >>>> Glen, the bug you were asking about yesterday was fixed on May 1 - >>>> here's the email from Mihael. >>>> >>>> Did you try this fix, or were you unaware of this particular >>>> message in >>>> the thread on that problem? >>>> >>>> - Mike >>>> >>>> >>>> -------- Original Message -------- >>>> Subject: Re: [Swift-devel] RE: [Swift-user] Execution error >>>> Date: Fri, 01 May 2009 15:22:02 -0500 >>>> From: Mihael Hategan >>>> To: Glen Hocky >>>> CC: swift-devel , "Yue, Chen - BMD" >>>> >>>> References: >>>> >>>> <49F9DE8F.1070404 at mcs.anl.gov> >>>> <49F9E298.8030801 at mcs.anl.gov> <49F9E8FB.9020500 at mcs.anl.gov> >>>> >>>> <49F9F680.6040503 at mcs.anl.gov> >>>> >>>> <49FA03EA.7080807 at mcs.anl.gov> >>>> >>>> <49FA147E.6070205 at uchicago.edu> >>>> <1241135666.3603.1.camel at localhost> >>>> >>>> Fix in cog 2394. >>>> >>>> Use globus:coasterMaxJobs profile. >>>> >>>> On Thu, 2009-04-30 at 18:54 -0500, Mihael Hategan wrote: >>>>> Mystery solved: >>>>> >>>>> Thu Apr 30 18:19:13 2009 JM_SCRIPT: ERROR: job submission failed: >>>>> Thu Apr 30 18:19:13 2009 JM_SCRIPT: >>>>> ------------------------------------------------------------------------ >>>>> >>>>> Welcome to TACC's Ranger System, an NSF TeraGrid Resource >>>>> ------------------------------------------------------------------------ >>>>> >>>>> >>>>> --> Submitting 16 tasks... >>>>> --> Submitting 16 tasks/host... >>>>> --> Submitting exclusive job to 1 hosts... >>>>> --> Verifying HOME file-system availability... >>>>> --> Verifying WORK file-system availability... >>>>> --> Verifying SCRATCH file-system availability... >>>>> --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data >>>>> limits... >>>>> --> Requesting valid memory configuration (mt=31.3G)... >>>>> --> Checking ssh keys... >>>>> --> Checking file existence and permissions for passwordless ssh... >>>>> --> Verifying accounting... >>>>> ---------------------------------------------------------------- >>>>> ERROR: You have exceeded the max submitted job count. >>>>> Maximum allowed is 50 jobs. >>>>> >>>>> Please contact TACC Consulting if you believe you have >>>>> received this message in error. >>>>> ---------------------------------------------------------------- >>>>> Job aborted by esub. >>>>> >>>>> I'll add a limit for the number of jobs allowed to the current >>>>> coaster >>>>> code. >>>>> >>>>> >>>>> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: >>>>> > I have the identical response on ranger. It started yesterday >>>>> evening. > Possibly a problem that the TACC folks need to fix? >>>>> > > Glen >>>>> > > Yue, Chen - BMD wrote: >>>>> > > Hi Michael, >>>>> > > > > Thank you for the advices. I tested ranger with 1 job and >>>>> new > > specifications of maxwalltime. It shows the following >>>>> error message. I > > don't know if there is other problem with my >>>>> setup. Thank you! >>>>> > > > > ///////////////////////////////////////////////// >>>>> > > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift >>>>> -sites.file > > sites.xml -tc.file tc.data >>>>> > > Swift 0.9rc2 swift-r2860 cog-r2388 >>>>> > > RunID: 20090430-1559-2vi6x811 >>>>> > > Progress: >>>>> > > Progress: Stage in:1 >>>>> > > Progress: Submitting:1 >>>>> > > Progress: Submitting:1 >>>>> > > Progress: Submitted:1 >>>>> > > Progress: Active:1 >>>>> > > Failed to transfer wrapper log from > > >>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger >>>>> > > Progress: Active:1 >>>>> > > Failed to transfer wrapper log from > > >>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger >>>>> > > Progress: Stage in:1 >>>>> > > Progress: Active:1 >>>>> > > Failed to transfer wrapper log from > > >>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger >>>>> > > Progress: Failed:1 >>>>> > > Execution failed: >>>>> > > Exception in PTMap2: >>>>> > > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, >>>>> inputs-unmod.txt, > > parameters.txt] >>>>> > > Host: ranger >>>>> > > Directory: >>>>> PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj >>>>> > > stderr.txt: >>>>> > > stdout.txt: >>>>> > > ---- >>>>> > > Caused by: >>>>> > > Failed to start worker: >>>>> > > null >>>>> > > null >>>>> > > org.globus.gram.GramException: The job manager detected an >>>>> invalid > > script response >>>>> > > at > > >>>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >>>>> >>>>> > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) >>>>> > > at > > >>>>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >>>>> > > at java.lang.Thread.run(Thread.java:619) >>>>> > > Cleaning up... >>>>> > > Shutting down service at https://129.114.50.163:45562 > > >>>>> >>>>> > > Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) >>>>> > > - Done >>>>> > > [yuechen at communicado PTMap2]$ >>>>> > > /////////////////////////////////////////////////////////// >>>>> > > > > Chen, Yue >>>>> > > > > >>>>> > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>> > > *Sent:* Thu 4/30/2009 3:02 PM >>>>> > > *To:* Yue, Chen - BMD; swift-devel >>>>> > > *Subject:* Re: [Swift-user] Execution error >>>>> > > >>>>> > > Back on list here (I only went off-list to discuss accounts, etc) >>>>> > > >>>>> > > The problem in the run below is this: >>>>> > > >>>>> > > 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 >>>>> APPLICATION_EXCEPTION >>>>> > > jobid=PTMap2-abeii5aj - Application exception: Job cannot be >>>>> run with >>>>> > > the given max walltime worker constraint (task: 3000, \ >>>>> > > maxwalltime: 2400s) >>>>> > > >>>>> > > You have this on the ptmap app in your tc.data: >>>>> > > >>>>> > > globus::maxwalltime=50 >>>>> > > >>>>> > > But you only gave coasters 40 mins per coaster worker. So its >>>>> > > complaining that it cant run a 50 minute job in a 40 minute (max) >>>>> > > coaster worker. ;) >>>>> > > >>>>> > > I mentioned in a prior mail that you need to set the two time >>>>> vals in >>>>> > > your sites.xml entry; thats what you need to do next, now. >>>>> > > >>>>> > > change the coaster time in your sites.xml to: >>>>> > > key="coasterWorkerMaxwalltime">00:51:00 >>>>> > > >>>>> > > If you have more info on the variability of your ptmap run >>>>> times, send >>>>> > > that to the list, and we can discuss how to handle. >>>>> > > >>>>> > > >>>>> > > (NOTE: doing grp -i of the log for "except" or scanning for >>>>> "except" >>>>> > > with an editor will often locate the first "exception" that >>>>> your job >>>>> > > encountered. Thats how I found the error above). >>>>> > > >>>>> > > Also, Yue, for testing new sites, or for validating that old >>>>> sites still >>>>> > > work, you should create the smallest possible ptmap workflow - >>>>> 1 job if >>>>> > > that is possible - and verify that this works. Then say 10 >>>>> jobs to make >>>>> > > sure scheduling etc is sane. Then, send in your huge jobs. >>>>> > > >>>>> > > With only 1 job, its easier to spot the errors in the log file. >>>>> > > >>>>> > > - Mike >>>>> > > >>>>> > > >>>>> > > On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: >>>>> > > > Hi Michael, >>>>> > > > > > > I run into the same messages again when I use Ranger: >>>>> > > > > > > Progress: Selecting site:146 Stage in:25 >>>>> Submitting:15 Submitted:821 >>>>> > > > Failed but can retry:16 >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger >>>>> > > > Progress: Selecting site:146 Stage in:3 Submitting:1 >>>>> Submitted:857 >>>>> > > > Failed but can retry:16 >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger >>>>> > > > Failed to transfer wrapper log from >>>>> > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>>> > > > The log for the search is at : > > > >>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log >>>>> > > > > > > The sites.xml I have is: >>>>> > > > > > > >>>>> > > > >>>> > > > url="gatekeeper.ranger.tacc.teragrid.org" >>>>> > > > jobManager="gt2:gt2:SGE"/> >>>>> > > > >>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>>>> > > > >>>> > > > >>>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>>> > > > >>>> key="project">TG-CCR080022N >>>>> > > > >>>> key="coastersPerNode">16 >>>>> > > > >>>> key="queue">development >>>>> > > > >>>> > > > key="coasterWorkerMaxwalltime">00:40:00 >>>>> > > > 31 >>>>> > > > >>>> key="initialScore">50 >>>>> > > > >>>> key="jobThrottle">10 >>>>> > > > >>>>> /work/01164/yuechen/swiftwork >>>>> > > > >>>>> > > > The tc.data I have is: >>>>> > > > > > > ranger PTMap2 > > > >>>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > > >>>>> > INTEL32::LINUX globus::maxwalltime=50 >>>>> > > > >>>>> > > > I'm using swift 0.9 rc2 >>>>> > > > >>>>> > > > Thank you very much for help! >>>>> > > > >>>>> > > > Chen, Yue >>>>> > > > >>>>> > > > > > > >>>>> > > > >>>>> ------------------------------------------------------------------------ >>>>> >>>>> > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>> > > > *Sent:* Thu 4/30/2009 2:05 PM >>>>> > > > *To:* Yue, Chen - BMD >>>>> > > > *Subject:* Re: [Swift-user] Execution error >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: >>>>> > > > > Hi Michael, >>>>> > > > > >>>>> > > > > When I tried to activate my account, I encountered the >>>>> following > > error: >>>>> > > > > >>>>> > > > > "Sorry, this account is in an invalid state. You may not >>>>> activate > > your >>>>> > > > > at this time." >>>>> > > > > >>>>> > > > > I used the username and password from TG-CDA070002T. >>>>> Should I use a >>>>> > > > > different password? >>>>> > > > >>>>> > > > If you can already login to Ranger, then you are all set - >>>>> you must have >>>>> > > > done this previously. >>>>> > > > >>>>> > > > I thought you had *not*, because when I looked up your login >>>>> on ranger >>>>> > > > ("finger yuechen") it said "never logged in". But seems like >>>>> that info >>>>> > > > is incorrect. >>>>> > > > >>>>> > > > If you have ptmap compiled, seems like you are almost all set. >>>>> > > > >>>>> > > > Let me know if it works. >>>>> > > > >>>>> > > > - Mike >>>>> > > > >>>>> > > > > Thanks! >>>>> > > > > >>>>> > > > > Chen, Yue >>>>> > > > > >>>>> > > > > >>>>> > > > > > > >>>>> ------------------------------------------------------------------------ >>>>> >>>>> > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>> > > > > *Sent:* Thu 4/30/2009 1:07 PM >>>>> > > > > *To:* Yue, Chen - BMD >>>>> > > > > *Cc:* swift user >>>>> > > > > *Subject:* Re: [Swift-user] Execution error >>>>> > > > > >>>>> > > > > Yue, use this XML pool element to access ranger: >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>> > > > > url="gatekeeper.ranger.tacc.teragrid.org" >>>>> > > > > jobManager="gt2:gt2:SGE"/> >>>>> > > > > > >>>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>>>> > > > > >>>> > > > > >>>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>>> > > > > > >>>>> key="project">TG-CCR080022N >>>>> > > > > >>>> key="coastersPerNode">16 >>>>> > > > > >>>> key="queue">development >>>>> > > > > >>>> > > > > >>>>> key="coasterWorkerMaxwalltime">00:40:00 >>>>> > > > > >>>> key="maxwalltime">31 >>>>> > > > > >>>> key="initialScore">50 >>>>> > > > > >>>> key="jobThrottle">10 >>>>> > > > > >>>>> /work/00306/tg455797/swiftwork >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > You will need to also do these steps: >>>>> > > > > >>>>> > > > > Go to this web page to enable your Ranger account: >>>>> > > > > >>>>> > > > > >>>>> https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx >>>>> > > > > >>>>> > > > > Then login to Ranger via the TeraGrid portal and put your >>>>> ssh keys in >>>>> > > > > place (assuming you use ssh keys, which you should) >>>>> > > > > >>>>> > > > > While on Ranger, do this: >>>>> > > > > >>>>> > > > > echo $WORK >>>>> > > > > mkdir $work/swiftwork >>>>> > > > > >>>>> > > > > and put the full path of your $WORK/swiftwork directory >>>>> in the >>>>> > > > > element above. (My login is tg455etc, >>>>> yours is > > yuechen) >>>>> > > > > >>>>> > > > > Then scp your code to Ranger and compile it. >>>>> > > > > >>>>> > > > > Then create a tc.data entry for your ptmap app >>>>> > > > > >>>>> > > > > Next, set your time values in the sites.xml entry above >>>>> to suitable >>>>> > > > > values for Ranger. You'll need to measure times, but I >>>>> think you will >>>>> > > > > find Ranger about twice as fast as Mercury for CPU-bound >>>>> jobs. >>>>> > > > > >>>>> > > > > The values above were set for one app job per coaster. I >>>>> think > > you can >>>>> > > > > probably do more. >>>>> > > > > >>>>> > > > > If you estimate a run time of 5 minutes, use: >>>>> > > > > >>>>> > > > > >>>> > > > > >>>>> key="coasterWorkerMaxwalltime">00:30:00 >>>>> > > > > >>>> key="maxwalltime">5 >>>>> > > > > >>>>> > > > > Other people on the list - please sanity check what I >>>>> suggest here. >>>>> > > > > >>>>> > > > > - Mike >>>>> > > > > >>>>> > > > > >>>>> > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: >>>>> > > > > > I just checked - TG-CDA070002T has indeed expired. >>>>> > > > > > >>>>> > > > > > The best for now is to move to use (only) Ranger, >>>>> under this > > account: >>>>> > > > > > TG-CCR080022N >>>>> > > > > > >>>>> > > > > > I will locate and send you a sites.xml entry in a moment. >>>>> > > > > > >>>>> > > > > > You need to go to a web page to activate your Ranger >>>>> login. >>>>> > > > > > >>>>> > > > > > Best to contact me in IM and we can work this out. >>>>> > > > > > >>>>> > > > > > - Mike >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: >>>>> > > > > >> Also, what account are you running under? We may need >>>>> to change >>>>> > > > you to >>>>> > > > > >> a new account - as the OSG Training account expires >>>>> today. >>>>> > > > > >> If that happend at Noon, it *might* be the problem. >>>>> > > > > >> >>>>> > > > > >> - Mike >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >>>>> > > > > >>> Hi, >>>>> > > > > >>> >>>>> > > > > >>> I came back to re-run my application on NCSA Mercury >>>>> which was >>>>> > > > tested >>>>> > > > > >>> successfully last week after I just set up coasters >>>>> with > > swift 0.9, >>>>> > > > > >>> but I got many messages like the following: >>>>> > > > > >>> >>>>> > > > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 >>>>> > > > > >>> Progress: Stage in:129 Submitting:703 >>>>> Submitted:190 Failed >>>>> > > > but can >>>>> > > > > >>> retry:1 >>>>> > > > > >>> Progress: Stage in:38 Submitting:425 >>>>> Submitted:556 Failed > > but can >>>>> > > > > >>> retry:4 >>>>> > > > > >>> Failed to transfer wrapper log from >>>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on >>>>> NCSA_MERCURY >>>>> > > > > >>> Failed to transfer wrapper log from >>>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on >>>>> NCSA_MERCURY >>>>> > > > > >>> Failed to transfer wrapper log from >>>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on >>>>> NCSA_MERCURY >>>>> > > > > >>> Failed to transfer wrapper log from >>>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on >>>>> NCSA_MERCURY >>>>> > > > > >>> Failed to transfer wrapper log from >>>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on >>>>> NCSA_MERCURY >>>>> > > > > >>> Failed to transfer wrapper log from >>>>> > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on >>>>> NCSA_MERCURY >>>>> > > > > >>> Progress: Stage in:1 Submitted:1013 Active:1 >>>>> Failed but can >>>>> > > > retry:8 >>>>> > > > > >>> Progress: Submitted:1011 Active:1 Failed but can >>>>> retry:11 >>>>> > > > > >>> The log file for the successful run last week is ; >>>>> > > > > >>> >>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >>>>> > > > > >>> >>>>> > > > > >>> The log file for the failed run is : >>>>> > > > > >>> >>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >>>>> > > > > >>> >>>>> > > > > >>> I don't think I did anything different, so I don't >>>>> know why this >>>>> > > > time >>>>> > > > > >>> they failed. The sites.xml for Mercury is: >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>> url="gsiftp://gridftp-hg.ncsa.teragrid.org"/> >>>>> > > > > >>> > >>>>> url="grid-hg.ncsa.teragrid.org" >>>>> > > > > >>> jobManager="gt2:PBS"/> >>>>> > > > > >>> > > >>>>> /gpfs_scratch1/yuechen/swiftwork >>>>> > > > > >>> >>>> key="queue">debug >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> Thank you for help! >>>>> > > > > >>> >>>>> > > > > >>> Chen, Yue >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> This email is intended only for the use of the >>>>> individual or > > entity >>>>> > > > > >>> to which it is addressed and may contain information >>>>> that is >>>>> > > > > >>> privileged and confidential. If the reader of this >>>>> email > > message is >>>>> > > > > >>> not the intended recipient, you are hereby notified >>>>> that any >>>>> > > > > >>> dissemination, distribution, or copying of this >>>>> communication is >>>>> > > > > >>> prohibited. If you have received this email in >>>>> error, please > > notify >>>>> > > > > >>> the sender and destroy/delete all copies of the >>>>> transmittal. >>>>> > > > Thank you. >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > >>> >>>>> > > > > > > >>>>> ------------------------------------------------------------------------ >>>>> >>>>> > > > > >>> >>>>> > > > > >>> _______________________________________________ >>>>> > > > > >>> Swift-user mailing list >>>>> > > > > >>> Swift-user at ci.uchicago.edu >>>>> > > > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>> > > > > >> _______________________________________________ >>>>> > > > > >> Swift-user mailing list >>>>> > > > > >> Swift-user at ci.uchicago.edu >>>>> > > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>> > > > > > _______________________________________________ >>>>> > > > > > Swift-user mailing list >>>>> > > > > > Swift-user at ci.uchicago.edu >>>>> > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > This email is intended only for the use of the individual >>>>> or > > entity to >>>>> > > > > which it is addressed and may contain information that is >>>>> > > privileged and >>>>> > > > > confidential. If the reader of this email message is not >>>>> the intended >>>>> > > > > recipient, you are hereby notified that any >>>>> dissemination, > > distribution, >>>>> > > > > or copying of this communication is prohibited. If you >>>>> have received >>>>> > > > > this email in error, please notify the sender and >>>>> destroy/delete all >>>>> > > > > copies of the transmittal. Thank you. >>>>> > > > >>>>> > > > > > > >>>>> > > > >>>>> > > > This email is intended only for the use of the individual or >>>>> entity to >>>>> > > > which it is addressed and may contain information that is >>>>> privileged and >>>>> > > > confidential. If the reader of this email message is not the >>>>> intended >>>>> > > > recipient, you are hereby notified that any dissemination, >>>>> distribution, >>>>> > > > or copying of this communication is prohibited. If you have >>>>> received >>>>> > > > this email in error, please notify the sender and >>>>> destroy/delete all >>>>> > > > copies of the transmittal. Thank you. >>>>> > > >>>>> > > > > >>>>> > > >>>>> > > This email is intended only for the use of the individual or >>>>> entity to > > which it is addressed and may contain information >>>>> that is privileged > > and confidential. If the reader of this >>>>> email message is not the > > intended recipient, you are hereby >>>>> notified that any dissemination, > > distribution, or copying of >>>>> this communication is prohibited. If you > > have received this >>>>> email in error, please notify the sender and > > destroy/delete >>>>> all copies of the transmittal. Thank you. >>>>> > > >>>>> ------------------------------------------------------------------------ >>>>> >>>>> > > >>>>> > > _______________________________________________ >>>>> > > Swift-devel mailing list >>>>> > > Swift-devel at ci.uchicago.edu >>>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> > > > > _______________________________________________ >>>>> > Swift-devel mailing list >>>>> > Swift-devel at ci.uchicago.edu >>>>> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Thu May 21 13:38:59 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 21 May 2009 13:38:59 -0500 Subject: [Swift-devel] Logs of Swift execution for NSF report Message-ID: <4A159FC3.1000309@mcs.anl.gov> I'm targeting a report to the NSF on Swift progress for July 1 - earlier if possible. This report will be critical for seeking an extension of Swift funding. Can you all make sure that any users you've worked with have either pushed their swift logs to the Swift log repository (when GPFS is back online), or collect it? Ben will be doing the stats breakdowns and plots of overall usage. Sarah, Glen, can you gather or point out the location of all logs from CNARI and OOPS? Ben, do you have logs from the UMD work on surgical planning? - Mike From benc at hawaga.org.uk Thu May 21 13:41:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 21 May 2009 18:41:10 +0000 (GMT) Subject: [Swift-devel] Logs of Swift execution for NSF report In-Reply-To: <4A159FC3.1000309@mcs.anl.gov> References: <4A159FC3.1000309@mcs.anl.gov> Message-ID: On Thu, 21 May 2009, Michael Wilde wrote: > Ben, do you have logs from the UMD work on surgical planning? >From Andriy? Maybe. I talked about it with him at one time but not sure if I actually got them -- From wilde at mcs.anl.gov Thu May 21 18:50:50 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 21 May 2009 18:50:50 -0500 Subject: [Swift-devel] Usage stats more urgent now Message-ID: <4A15E8DA.4080202@mcs.anl.gov> Hi All, It looks like we may need to submit an extension proposal for Swift to NSF as early as June 1. Please continue to round up usage logs and post their location to swift-devel. Ben will collate. Ian will be organizing the NSF proposal while I complete the NIH one. - Mike From hategan at mcs.anl.gov Sun May 24 15:57:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 24 May 2009 15:57:48 -0500 Subject: [Swift-devel] block allocation Message-ID: <1243198668.30428.12.camel@localhost> I committed the block allocation code I had. Here's a quick status: - I tested it locally, alone, and with swift. - Not tested on a cluster (my jobs just sat in the queue on TP and TGUC, so I got bored and killed it). - I updated the documentation on the profile entries, but there are permission problems when I try to deploy it. - I don't think jobs are cleaned properly if an early abort is done, so be sure to check the queue and manually clean up if needed. - Exercise caution in general. It's new stuff, and things may go bad. So please keep an eye on the resources used and report misbehavior. - There is no mechanism to specify a static allocation. This is due to me giving it a low priority for this particular iteration of the code, and not because I hate globus. Mihael PS: On the topic of static allocations, in people's experience, how does one specify reservations on various resources? From benc at hawaga.org.uk Mon May 25 02:16:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 May 2009 07:16:19 +0000 (GMT) Subject: [Swift-devel] block allocation In-Reply-To: <1243198668.30428.12.camel@localhost> References: <1243198668.30428.12.camel@localhost> Message-ID: On Sun, 24 May 2009, Mihael Hategan wrote: > PS: On the topic of static allocations, in people's experience, how does > one specify reservations on various resources? I've only done that by sending email to help at teragrid.org, which is a relatively high latency and expensive operation. -- From iraicu at cs.uchicago.edu Mon May 25 07:57:25 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 25 May 2009 05:57:25 -0700 Subject: [Swift-devel] block allocation In-Reply-To: References: <1243198668.30428.12.camel@localhost> Message-ID: <4A1A95B5.3080402@cs.uchicago.edu> If you want a reservation, you write an email about the requirements and deadlines, which then gets replied back with a schedule, times, number of processors, time allocation, and a queue name. It is then up to the respective individual to submit jobs to the right queue, which will place the jobs in a hold state if its prior to the reservation start time, and it will release them only when the reservation time is reached. Job walltime needs to be set within the bounds of the reservation (e.g. if the reservation is for 60 minutes, a job wall time cannot be 70 minutes, if half way through a reservation that has 30 minutes left, one cannot submit a job for 40 minutes, even if a 40 minute job would have been accepted earlier in the reservation). Now, many people using Falkon and block allocation on the BG/P run without any reservations. They simply submit their jobs to the appropriate queue (not specific to the user, but specific to job requirements), and once the jobs go through, the blocks get instantiated, and the user uses the block allocation until it runs out and it de-allocates itself. At that point, if the workload was not complete, the user would manually invoke the block allocation again, and let the workload continue where it left off earlier. And the process would continue as long as there was work. Ioan Ben Clifford wrote: > On Sun, 24 May 2009, Mihael Hategan wrote: > > >> PS: On the topic of static allocations, in people's experience, how does >> one specify reservations on various resources? >> > > I've only done that by sending email to help at teragrid.org, which is a > relatively high latency and expensive operation. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon May 25 08:30:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 May 2009 13:30:17 +0000 (GMT) Subject: [Swift-devel] block allocation In-Reply-To: <1243198668.30428.12.camel@localhost> References: <1243198668.30428.12.camel@localhost> Message-ID: On Sun, 24 May 2009, Mihael Hategan wrote: > - I updated the documentation on the profile entries, but there are > permission problems when I try to deploy it. hmm. The docs should be updatable by anyone in the vdl2-svn unix group, which you are in. Please can you make a trivial change and then send me the error message? -- From bugzilla-daemon at mcs.anl.gov Mon May 25 09:06:01 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 25 May 2009 09:06:01 -0500 (CDT) Subject: [Swift-devel] [Bug 208] SWIFT_JOBDIR_PATH env not passed on remote sites (pbs and deef) In-Reply-To: References: Message-ID: <20090525140601.4432F2CD36@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=208 --- Comment #1 from Ben Clifford 2009-05-25 09:06:00 --- The test case tests/misc/path-prefix.sh will detect this if run against appropriate sites. However misc tests are not run against non-local site definitions regularly. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching someone on the CC list of the bug. From hategan at mcs.anl.gov Mon May 25 15:44:39 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 May 2009 15:44:39 -0500 Subject: [Swift-devel] block allocation In-Reply-To: References: <1243198668.30428.12.camel@localhost> Message-ID: <1243284283.3264.1.camel@localhost> On Mon, 2009-05-25 at 07:16 +0000, Ben Clifford wrote: > On Sun, 24 May 2009, Mihael Hategan wrote: > > > PS: On the topic of static allocations, in people's experience, how does > > one specify reservations on various resources? > > I've only done that by sending email to help at teragrid.org, which is a > relatively high latency and expensive operation. > Right. But once you have the reservation, I assume you have a mechanism of specifying which of your jobs should make use of it and which shouldn't. Like a reservation id. That mechanism is what I'm interested in. From iraicu at cs.uchicago.edu Mon May 25 10:51:42 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 25 May 2009 10:51:42 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243284283.3264.1.camel@localhost> References: <1243198668.30428.12.camel@localhost> <1243284283.3264.1.camel@localhost> Message-ID: <4A1ABE8E.1080208@cs.uchicago.edu> At least on the BG/P using Cobalt, the reservations are based on user IDs and queues. Once a reservation is in place, and it has certain users associated with that reservation, those users can simply submit jobs to the specific queue (unique to that reservation, and is usually named by the sys admin that sets up the reservation, e.g. R.Falkon-8K), and the jobs will run under that reservation. Other systems such as PBS/SGE/Condor could be different, I don't know. Ioan Mihael Hategan wrote: > On Mon, 2009-05-25 at 07:16 +0000, Ben Clifford wrote: > >> On Sun, 24 May 2009, Mihael Hategan wrote: >> >> >>> PS: On the topic of static allocations, in people's experience, how does >>> one specify reservations on various resources? >>> >> I've only done that by sending email to help at teragrid.org, which is a >> relatively high latency and expensive operation. >> >> > > Right. But once you have the reservation, I assume you have a mechanism > of specifying which of your jobs should make use of it and which > shouldn't. Like a reservation id. That mechanism is what I'm interested > in. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon May 25 10:54:52 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 May 2009 10:54:52 -0500 Subject: [Swift-devel] block allocation In-Reply-To: References: <1243198668.30428.12.camel@localhost> Message-ID: <1243266892.3683.1.camel@localhost> On Mon, 2009-05-25 at 13:30 +0000, Ben Clifford wrote: > On Sun, 24 May 2009, Mihael Hategan wrote: > > > - I updated the documentation on the profile entries, but there are > > permission problems when I try to deploy it. > > hmm. The docs should be updatable by anyone in the vdl2-svn unix group, > which you are in. > > Please can you make a trivial change and then send me the error message? > I see that the single-page is updated. The problem is with the multi-page. [hategan at login swift]$ ./update.sh --------- Updating www... ---------- At revision 2943. --------- Updating docs... ---------- At revision 2943. make: Nothing to be done for `all'. ./build-chunked-userguide.sh rm: cannot remove `appmodel.php': Permission denied rm: cannot remove `buildoptions.php': Permission denied rm: cannot remove `clustering.php': Permission denied rm: cannot remove `coasters.php': Permission denied rm: cannot remove `commands.php': Permission denied rm: cannot remove `engineconfiguration.php': Permission denied rm: cannot remove `extending.php': Permission denied rm: cannot remove `functions.php': Permission denied rm: cannot remove `index.php': Permission denied rm: cannot remove `kickstart.php': Permission denied rm: cannot remove `language.php': Permission denied rm: cannot remove `localhowtos.php': Permission denied rm: cannot remove `mappers.php': Permission denied rm: cannot remove `overview.php': Permission denied rm: cannot remove `procedures.php': Permission denied rm: cannot remove `profiles.php': Permission denied rm: cannot remove `reliability.php': Permission denied rm: cannot remove `sitecatalog.php': Permission denied rm: cannot remove `techoverview.php': Permission denied rm: cannot remove `transformationcatalog.php': Permission denied Computing chunks... Writing overview.php for section(overview) ... From hategan at mcs.anl.gov Mon May 25 10:55:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 May 2009 10:55:31 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <4A1ABE8E.1080208@cs.uchicago.edu> References: <1243198668.30428.12.camel@localhost> <1243284283.3264.1.camel@localhost> <4A1ABE8E.1080208@cs.uchicago.edu> Message-ID: <1243266931.3683.3.camel@localhost> Great. Thanks. On Mon, 2009-05-25 at 10:51 -0500, Ioan Raicu wrote: > At least on the BG/P using Cobalt, the reservations are based on user > IDs and queues. Once a reservation is in place, and it has certain > users associated with that reservation, those users can simply submit > jobs to the specific queue (unique to that reservation, and is usually > named by the sys admin that sets up the reservation, e.g. > R.Falkon-8K), and the jobs will run under that reservation. Other > systems such as PBS/SGE/Condor could be different, I don't know. > > Ioan > > Mihael Hategan wrote: > > On Mon, 2009-05-25 at 07:16 +0000, Ben Clifford wrote: > > > > > On Sun, 24 May 2009, Mihael Hategan wrote: > > > > > > > > > > PS: On the topic of static allocations, in people's experience, how does > > > > one specify reservations on various resources? > > > > > > > I've only done that by sending email to help at teragrid.org, which is a > > > relatively high latency and expensive operation. > > > > > > > > > > Right. But once you have the reservation, I assume you have a mechanism > > of specifying which of your jobs should make use of it and which > > shouldn't. Like a reservation id. That mechanism is what I'm interested > > in. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > =================================================== > Ioan Raicu, Ph.D. > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== From iraicu at cs.uchicago.edu Mon May 25 10:59:49 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 25 May 2009 10:59:49 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243284283.3264.1.camel@localhost> References: <1243198668.30428.12.camel@localhost> <1243284283.3264.1.camel@localhost> Message-ID: <4A1AC075.1010907@cs.uchicago.edu> At least on the BG/P using Cobalt, the reservations are based on user IDs and queues. Once a reservation is in place, and it has certain users associated with that reservation, those users can simply submit jobs to the specific queue (unique to that reservation, and is usually named by the sys admin that sets up the reservation, e.g. R.Falkon-8K), and the jobs will run under that reservation. Other systems such as PBS/SGE/Condor could be different, I don't know. Ioan Mihael Hategan wrote: > On Mon, 2009-05-25 at 07:16 +0000, Ben Clifford wrote: > >> On Sun, 24 May 2009, Mihael Hategan wrote: >> >> >>> PS: On the topic of static allocations, in people's experience, how does >>> one specify reservations on various resources? >>> >> I've only done that by sending email to help at teragrid.org, which is a >> relatively high latency and expensive operation. >> >> > > Right. But once you have the reservation, I assume you have a mechanism > of specifying which of your jobs should make use of it and which > shouldn't. Like a reservation id. That mechanism is what I'm interested > in. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon May 25 11:01:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 25 May 2009 11:01:22 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243266892.3683.1.camel@localhost> References: <1243198668.30428.12.camel@localhost> <1243266892.3683.1.camel@localhost> Message-ID: <1243267282.4037.1.camel@localhost> On Mon, 2009-05-25 at 10:54 -0500, Mihael Hategan wrote: > On Mon, 2009-05-25 at 13:30 +0000, Ben Clifford wrote: > > On Sun, 24 May 2009, Mihael Hategan wrote: > > > > > - I updated the documentation on the profile entries, but there are > > > permission problems when I try to deploy it. > > > > hmm. The docs should be updatable by anyone in the vdl2-svn unix group, > > which you are in. > > > > Please can you make a trivial change and then send me the error message? > > > > I see that the single-page is updated. The problem is with the > multi-page. Actually, I'm still unsure about that. Aren't the changes supposed to be visible immediately after the ./update.sh? If they are, then that, at least as the userguide is concerned (in particular the single-page), does not work. From benc at hawaga.org.uk Tue May 26 02:26:02 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 May 2009 07:26:02 +0000 (GMT) Subject: [Swift-devel] block allocation In-Reply-To: <1243267282.4037.1.camel@localhost> References: <1243198668.30428.12.camel@localhost> <1243266892.3683.1.camel@localhost> <1243267282.4037.1.camel@localhost> Message-ID: On Mon, 25 May 2009, Mihael Hategan wrote: > > I see that the single-page is updated. The problem is with the > > multi-page. > > Actually, I'm still unsure about that. Aren't the changes supposed to be > visible immediately after the ./update.sh? If they are, then that, at > least as the userguide is concerned (in particular the single-page), > does not work. I just fixed a permissions problem with the multipage guide where the owning group did not own the destination directory. Its not clear what the problem is with the single page - the output you gave doesn't show any errors (but also doesn't show any update being made to the PHP file at all). Next time you make a change, capture the output again and send. The web server serves directly from that directory so it will start serving updates as soon as they are generated. If you're checking for updates actually working or not, look at file times using ls, though, rather than relying on your browser to not cache. -- From benc at hawaga.org.uk Tue May 26 03:30:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 May 2009 08:30:08 +0000 (GMT) Subject: [Swift-devel] infinity * n Message-ID: In the coasters section of the user guide now: > highOverallocation > How many times larger than the job walltime should a block's walltime be > if all jobs are infinitely long What does it mean for a job to be infinitey long there? that the job has infinite wall time, in which case what does it mean to make any multiple of that? -- From benc at hawaga.org.uk Tue May 26 10:14:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 May 2009 15:14:37 +0000 (GMT) Subject: [Swift-devel] provenance presentation at UC Message-ID: I want to present the provenance work that I've done recently at UC either next weds or next thurs (3rd or 4th). Who is interested and who has preferences? -- From foster at anl.gov Tue May 26 10:44:49 2009 From: foster at anl.gov (foster at anl.gov) Date: Tue, 26 May 2009 10:44:49 -0500 (CDT) Subject: [Swift-devel] provenance presentation at UC In-Reply-To: References: Message-ID: <667EF28F-B439-4473-A399-BAD0E3CE29DE@anl.gov> Could we do it in the Friday workshop? Ian -- from mobile On May 26, 2009, at 10:15 AM, Ben Clifford wrote: > > I want to present the provenance work that I've done recently at UC > either > next weds or next thurs (3rd or 4th). Who is interested and who has > preferences? > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue May 26 10:48:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 May 2009 15:48:05 +0000 (GMT) Subject: [Swift-devel] provenance presentation at UC In-Reply-To: <667EF28F-B439-4473-A399-BAD0E3CE29DE@anl.gov> References: <667EF28F-B439-4473-A399-BAD0E3CE29DE@anl.gov> Message-ID: Has friday turned into a workshop? (or was it all along?) I was planning on goign through provenance things with Tanu on that day, but I'd been expecting it to be more informa On Tue, 26 May 2009, foster at anl.gov wrote: > Could we do it in the Friday workshop? > > Ian -- from mobile > > On May 26, 2009, at 10:15 AM, Ben Clifford wrote: > > > > > I want to present the provenance work that I've done recently at UC either > > next weds or next thurs (3rd or 4th). Who is interested and who has > > preferences? > > > > -- > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue May 26 10:49:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 May 2009 15:49:55 +0000 (GMT) Subject: [Swift-devel] provenance presentation at UC In-Reply-To: References: <667EF28F-B439-4473-A399-BAD0E3CE29DE@anl.gov> Message-ID: On Tue, 26 May 2009, Ben Clifford wrote: > Has friday turned into a workshop? (or was it all along?) I was planning > on goign through provenance things with Tanu on that day, but I'd been > expecting it to be more informa umm the end of that line disappeared it seems. expecting it to be more informal, but friday is fine for me. -- From skenny at uchicago.edu Tue May 26 13:44:12 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 26 May 2009 13:44:12 -0500 (CDT) Subject: [Swift-devel] provenance presentation at UC In-Reply-To: References: Message-ID: <20090526134412.BYE43624@m4500-02.uchicago.edu> i'd be interested...thursday slightly less-preferable. i'm copying a couple people from hnl who might be interested too. ~sk ---- Original message ---- >Date: Tue, 26 May 2009 15:14:37 +0000 (GMT) >From: Ben Clifford >Subject: [Swift-devel] provenance presentation at UC >To: swift-devel at ci.uchicago.edu >Cc: childers at mcs.anl.gov > > >I want to present the provenance work that I've done recently at UC either >next weds or next thurs (3rd or 4th). Who is interested and who has >preferences? > >-- > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Tue May 26 14:05:00 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 26 May 2009 14:05:00 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243198668.30428.12.camel@localhost> References: <1243198668.30428.12.camel@localhost> Message-ID: <4A1C3D5C.5080003@uchicago.edu> Hi, Mihael I read through the new content about coaster in user guide. Here I got several questions: Given the following case: we have 100 jobs, 5 worker nodes, and each worker node has 4 cores, which means we have 4 workers on each node. The profile key "Slot", it is defined as "How many maximum LRM jobs/worker blocks are allowed". Correct me if my understanding is incorrect. In the above case, 100 jobs/5 nodes/4 workers=5 jobs/worker So if we set "Slot" to 4, this above test case won't finish because 5 is going out of the bound 4. Is this correct? Also, could you point me to a sites.xml definition you used for the local test? That will be helpful for me to get the ideas. Thanks. best wishes zhao Mihael Hategan wrote: > I committed the block allocation code I had. Here's a quick status: > > - I tested it locally, alone, and with swift. > - Not tested on a cluster (my jobs just sat in the queue on TP and TGUC, > so I got bored and killed it). > - I updated the documentation on the profile entries, but there are > permission problems when I try to deploy it. > - I don't think jobs are cleaned properly if an early abort is done, so > be sure to check the queue and manually clean up if needed. > - Exercise caution in general. It's new stuff, and things may go bad. So > please keep an eye on the resources used and report misbehavior. > - There is no mechanism to specify a static allocation. This is due to > me giving it a low priority for this particular iteration of the code, > and not because I hate globus. > > Mihael > > PS: On the topic of static allocations, in people's experience, how does > one specify reservations on various resources? > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Tue May 26 15:34:42 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 26 May 2009 15:34:42 -0500 Subject: [Swift-devel] Re: infinity * n In-Reply-To: References: Message-ID: <1243370082.3794.1.camel@localhost> On Tue, 2009-05-26 at 08:30 +0000, Ben Clifford wrote: > In the coasters section of the user guide now: > > > highOverallocation > > > How many times larger than the job walltime should a block's walltime be > > if all jobs are infinitely long > > What does it mean for a job to be infinitey long there? that the job has > infinite wall time, in which case what does it mean to make any multiple > of that? > lim_(n->+inf)(overallocation(n)/n) = highOverallocation From hategan at mcs.anl.gov Tue May 26 15:37:27 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 26 May 2009 15:37:27 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <4A1C3D5C.5080003@uchicago.edu> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> Message-ID: <1243370247.3794.4.camel@localhost> On Tue, 2009-05-26 at 14:05 -0500, Zhao Zhang wrote: > Hi, Mihael > > I read through the new content about coaster in user guide. Here I got > several questions: > > Given the following case: we have 100 jobs, 5 worker nodes, and each > worker node has 4 cores, which means > we have 4 workers on each node. > > The profile key "Slot", it is defined as "How many maximum LRM > jobs/worker blocks are allowed". > Correct me if my understanding is incorrect. In the above case, 100 > jobs/5 nodes/4 workers=5 jobs/worker > > So if we set "Slot" to 4, Set the slots to the maximum number of concurrent jobs the LRM allows you to have. On Ranger that would be 50 (as far as I remember), and on BGP, that would be 6. > this above test case won't finish because 5 is > going out of the bound 4. Is this correct? No. Once old blocks are done, new blocks are created. > > Also, could you point me to a sites.xml definition you used for the > local test? That will be helpful for me to get > the ideas. Thanks. /var/tmp 4 2 5 1 2 true From foster at anl.gov Wed May 27 10:45:50 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 27 May 2009 10:45:50 -0500 Subject: [Swift-devel] EC2 plugin Message-ID: here is something comparable in a differnet context: http://wiki.hudson-ci.org/display/HUDSON/Amazon+EC2+Plugin From iraicu at cs.uchicago.edu Wed May 27 11:41:44 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 27 May 2009 11:41:44 -0500 Subject: [Swift-devel] provenance presentation at UC In-Reply-To: References: <667EF28F-B439-4473-A399-BAD0E3CE29DE@anl.gov> Message-ID: <4A1D6D48.6060105@cs.uchicago.edu> Yes, Friday, June 5th, will be the DSL Workshop 2009. http://dsl-wiki.cs.uchicago.edu/index.php/DSL_Workshop_2009 If some of you haven't heard about it, its because I haven't sent out the announcement yet. Ben, I have added you to the schedule. The only question is, do you want a 30 min slot, or a 45 min slot? Ioan Ben Clifford wrote: > On Tue, 26 May 2009, Ben Clifford wrote: > > >> Has friday turned into a workshop? (or was it all along?) I was planning >> on goign through provenance things with Tanu on that day, but I'd been >> expecting it to be more informa >> > > umm the end of that line disappeared it seems. > > expecting it to be more informal, but friday is fine for me. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed May 27 11:55:33 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 May 2009 16:55:33 +0000 (GMT) Subject: [Swift-devel] provenance presentation at UC In-Reply-To: <4A1D6D48.6060105@cs.uchicago.edu> References: <667EF28F-B439-4473-A399-BAD0E3CE29DE@anl.gov> <4A1D6D48.6060105@cs.uchicago.edu> Message-ID: On Wed, 27 May 2009, Ioan Raicu wrote: > Ben, I have added you to the schedule. The only question is, do you want a 30 > min slot, or a 45 min slot? I can do either - whichever fits in the schedule better. Let me know beforehadn though. -- From iraicu at cs.uchicago.edu Wed May 27 12:24:11 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 27 May 2009 12:24:11 -0500 Subject: [Swift-devel] provenance presentation at UC In-Reply-To: References: <667EF28F-B439-4473-A399-BAD0E3CE29DE@anl.gov> <4A1D6D48.6060105@cs.uchicago.edu> Message-ID: <4A1D773B.2060700@cs.uchicago.edu> OK, lets plan for 30 min then. I scheduled it for 1:30PM - 2PM. http://dsl-wiki.cs.uchicago.edu/index.php/DSL_Workshop_2009 Ioan Ben Clifford wrote: > On Wed, 27 May 2009, Ioan Raicu wrote: > > >> Ben, I have added you to the schedule. The only question is, do you want a 30 >> min slot, or a 45 min slot? >> > > I can do either - whichever fits in the schedule better. Let me know > beforehadn though. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Wed May 27 13:42:02 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 27 May 2009 13:42:02 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243370247.3794.4.camel@localhost> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> Message-ID: <4A1D897A.3050209@uchicago.edu> Hi, Mihael I have been running the language-behavior and application test with coaster up to date from SVN. Could you help double check I am using the right version of cog? Swift svn swift-r2949 cog-r2406 Also, all my tests returned successful, are there any run-time logs that I could see how many workers were running on each site and monitoring their status? Like how many registered, how many are idle how many are busy and etc. I am also attaching two sites.xml definition for uc-teragrid and ranger. best zhao [zzhang at communicado scip]$ cat /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster_new/tgranger-sge-gram2.xml TG-CCR080022N /work/00946/zzhang/work /tmp/zzhang/jobdir 16 development 50 10 4 2 5 1 2 false [zzhang at communicado scip]$ cat /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster_new/tguc-pbs-gram2.xml TG-DBS080005N /home/zzhang/work 4 2 5 1 2 false Mihael Hategan wrote: > On Tue, 2009-05-26 at 14:05 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I read through the new content about coaster in user guide. Here I got >> several questions: >> >> Given the following case: we have 100 jobs, 5 worker nodes, and each >> worker node has 4 cores, which means >> we have 4 workers on each node. >> >> The profile key "Slot", it is defined as "How many maximum LRM >> jobs/worker blocks are allowed". >> Correct me if my understanding is incorrect. In the above case, 100 >> jobs/5 nodes/4 workers=5 jobs/worker >> >> So if we set "Slot" to 4, >> > > Set the slots to the maximum number of concurrent jobs the LRM allows > you to have. On Ranger that would be 50 (as far as I remember), and on > BGP, that would be 6. > > >> this above test case won't finish because 5 is >> going out of the bound 4. Is this correct? >> > > No. Once old blocks are done, new blocks are created. > > >> Also, could you point me to a sites.xml definition you used for the >> local test? That will be helpful for me to get >> the ideas. Thanks. >> > > > > > > url="localhost" /> > /var/tmp > 4 > 2 > 5 > 1 > 2 > key="remoteMonitorEnabled">true > > > > > > From hategan at mcs.anl.gov Wed May 27 14:23:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 27 May 2009 14:23:10 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <4A1D897A.3050209@uchicago.edu> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> Message-ID: <1243452190.26671.18.camel@localhost> On Wed, 2009-05-27 at 13:42 -0500, Zhao Zhang wrote: > Hi, Mihael > > I have been running the language-behavior and application test with > coaster up to date from SVN. > Could you help double check I am using the right version of cog? > Swift svn swift-r2949 cog-r2406 Looks right. > > Also, all my tests returned successful, That is a bit intriguing. > are there any run-time logs > that I could see how many workers were > running on each site and monitoring their status? You'll find that information in the usual place: ~/.globus/coasters/coaster(s).log. In there, every time the schedule is (re)calculated, something like this is printed: BlockQueueProcessor Committed 0 new jobs * that's the number of new jobs that were added since the last planning step. BlockQueueProcessor Cleaned 0 done blocks * blocks that finished since the last planning step. BlockQueueProcessor Updated allocsize: 10332 * that's how much time is available in all the blocks (running or to be started). It's the sum of all the remaining walltimes of all the workers. BlockQueueProcessor Queued 0 jobs to existing blocks * for how many jobs it was deemed that existing blocks have enough space to run them (this is really just a 2d box packing algorithm). BlockQueueProcessor allocsize = 10332, queuedsize = 9600, qsz = 16 * queuedsize is the sum of walltimes of all jobs that are deemed to have some space in some block; qsz is the number of such jobs. BlockQueueProcessor Requeued 0 non-fitting jobs * When the plan doesn't go according to plan, it may be required that some jobs that were thought to fit in existing blocks won't really fit in existing blocks. This shows the number of such jobs. BlockQueueProcessor Required size: 0 for 0 jobs * That tells you, for all jobs that don't have block space, the sum of their walltimes and their number. There are a few more coming from the same class when blocks are created. I'll let you discover those. > Like how many > registered, how many are idle > how many are busy and etc. I am also attaching two sites.xml definition > for uc-teragrid and ranger. > > best > zhao > > [zzhang at communicado scip]$ cat > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster_new/tgranger-sge-gram2.xml > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > TG-CCR080022N > /work/00946/zzhang/work > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir > 16 > development > 50 > 10 > 4 You should probably up that to the number of safe gt2 jobs (20). > 2 You don't really need to specify a node granularity other than 1, since you can request any number of nodes on ranger. > 5 > 1 > 2 You probably want to have a larger maxnodes. You probably don't event want to specify it, because there's nothing imposing a limit on that. > false > You also don't need to specify the default. From iraicu at cs.uchicago.edu Wed May 27 15:09:26 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 27 May 2009 15:09:26 -0500 Subject: [Swift-devel] [Fwd: Re: [Fwd: CFP - 1st International Workshop on High Performance Distributed Data Management]] Message-ID: <4A1D9DF6.2030808@cs.uchicago.edu> Hi all, Here is an interesting workshop on data management in distributed systems. Cheers, Ioan > > > -------- Original Message -------- > Subject: [Dbworld] CFP - 1st International Workshop on High > Performance Distributed Data Management > Date: Sat, 23 May 2009 01:25:39 -0500 > From: Sandro Fiore > Reply-To: dbworld_owner at yahoo.com > To: dbworld at cs.wisc.edu > > > > [Please accept our apologies if you receive multiple copies of this email] > > --------------------------------------------------------------------------- > CALL FOR PAPERS > --------------------------------------------------------------------------- > > 1st International Workshop on High Performance Distributed Data Management (HPDDM09) > > held in conjunction with > > 4th International Conference for Internet Technology and Secured Transactions (ICITST-2009) > November 9-12, 2009, London, UK > > Technical Co-Sponsored by IEEE > > Workshop website: http://grelc.unile.it/HPDDM09 > > Call for Papers > Distributed Data Management addresses a lot of problems related to large data sets manipulation, massive data replication, distributed metadata management, efficient data transfer over wide-area networks using Internet infrastructure as the basis for building transparent, robust and efficient data services. High performance distributed data management is continuosly gaining much interest in several contexts (finance, climate change, high henergy physics, astronomy, etc.) and plays a key role in the current and future eInfrastructures and middleware solutions. Several challenges concerning distributed data management and related to heterogeneity, security, dependability, high autonomy, P2P and large-scale distribution of data resources still need to be faced. > The main purpose of this workshop is to bring researchers and practitioners to identify and explore open issues as well as discuss and propose high performance data management solutions for distributed environments. Authors are invited to submit regular papers. > > Topics of Interest > High Performance data management > Data Grids > Data and Information Quality Management > Data-intensive Grid applications > Digital Libraries > Data Life Cycle in Products and Processes > Distributed Database Management > High performance Data Mining > Data Models for Production Systems and Services > P2P data integration > Accessing legacy databases via semantic services > Semantic distributed query processing > Ontology-based integration of data and services > WSRF-based Data Grid services > Grid Database Access, Management and Integration Services > Data security and privacy > Data Grid Portals > Distributed Metadata Management for eScience > Data warehousing > Workflow/Dataflow management > High performance Storage management > Replication, Indexing, caching and load balancing in distributed environments > Data Clouds & Data Virtualization > Real cases and application scenarios of integration of data in distributed environments > > Important Dates > Paper submission deadline June 30, 2009 > Author notification July 31, 2009 > Camera ready version due September 1, 2009 > Conference November 9-12, 2009 > > For more information please visit: > - the conference website: http://www.icitst.org/ > - the workshop website: http://grelc.unile.it/HPDDM09/ > > Workshop Organizers > Giovanni Aloisio - Euro-Mediterranean Centre for Climate Change (CMCC) & University of Salento - Italy > Sandro Fiore - Euro-Mediterranean Centre for Climate Change (CMCC) & University of Salento - Italy > Peter Fox - Rensselaer Polytechnic Institute (RPI) and National Center for Atmospheric Research (NCAR), USA > > Program Committee (provisional) > Massimo Cafaro - University of Salento - Italy > Claudia Luc?a Jim?nez Guar?n - Universidad de Los Andes - Bogot? Colombia > Al Kellie - NCAR - USA > Craig Lee - AeroSpace - USA > Almerico Murli - Univ. of Napoli Federico II & SPACI Consortium - Italy > Kristin Stock - University of Nottingham - UK > Osamu Tatebe - Univ. of Tsukuba - Japan > > Submission procedure > Papers reporting original and unpublished research results and experience are solicited. Papers will be selected based on their originality, timeliness, significance, relevance, and clarity of presentation. Accepted and presented papers will be published in the conference proceedings. > Submissions should include abstract, key words, the e-mail address of the corresponding author, and must not exceed 10 pages of single column, including tables and figures. > Submissions imply the willingness of at least one author to register, attend the conference, and present the paper. Workshop participants must pay the ICITST-2009 conference registration fee. Each paper will be refereed by at least three independent reviewers. Paper submission indicates the intention of the author to present the paper at the HPDDM09 workshop at ICITST-2009. > Your final version must be in IEEE?s required format (maximum of 6 pages in length). Additional submission guidelines will be available on the conference web site. > Please submit your paper by e-mail to sandro.fiore [AT] unile.it (make sure that the subject of the e-mail says "HPDDM09 Submission"). > _______________________________________________ > Please do not post msgs that are not relevant to the database community at large. Go to www.cs.wisc.edu/dbworld for guidelines and posting forms. > To unsubscribe, go to https://lists.cs.wisc.edu/mailman/listinfo/dbworld > > > > -- > =================================================== > Ioan Raicu, Ph.D. > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > *************************************************************************************** */Prof. Giovanni Aloisio /* /Scientific Computing & Operations (SCO) Division, Head/ /CMCC-Euromediterranean Centre for Climate Change (https://www.cmcc.it)/ /c/o Technological District, University of Salento (Lecce)/ /Via per Arnesano, 73100 Lecce, Italy/ /email: giovanni.aloisio at unile.it / /web: https://www.cmcc.it/web/public/Aloisio/ /office/mobile: (+39) 0832.29.7221-8137 / +39 348 9007336/+39 334 6501704/ /fax: (+39) 0832.298173/ *************************************************************************************** -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Wed May 27 16:31:13 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 27 May 2009 16:31:13 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243452190.26671.18.camel@localhost> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> Message-ID: <4A1DB120.4090402@uchicago.edu> Hi, Mihael I did a clean run of 100 scip jobs on ranger. (scip is a new application from MCS). The log is at /home/zzhang/scip/ranger-logs/coasters.log Mihael Hategan wrote: > On Wed, 2009-05-27 at 13:42 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I have been running the language-behavior and application test with >> coaster up to date from SVN. >> Could you help double check I am using the right version of cog? >> Swift svn swift-r2949 cog-r2406 >> > > Looks right. > > >> Also, all my tests returned successful, >> > > That is a bit intriguing. > > >> are there any run-time logs >> that I could see how many workers were >> running on each site and monitoring their status? >> > > You'll find that information in the usual place: > ~/.globus/coasters/coaster(s).log. > > In there, every time the schedule is (re)calculated, something like this > is printed: > BlockQueueProcessor Committed 0 new jobs > > * that's the number of new jobs that were added since the last planning > step. > This number is all 0 in the log files, is this normal? > BlockQueueProcessor Cleaned 0 done blocks > > * blocks that finished since the last planning step. > This value is all 0 too. > BlockQueueProcessor Updated allocsize: 10332 > > * that's how much time is available in all the blocks (running or to be > started). It's the sum of all the remaining walltimes of all the > workers. > > BlockQueueProcessor Queued 0 jobs to existing blocks > > * for how many jobs it was deemed that existing blocks have enough space > to run them (this is really just a 2d box packing algorithm). > > BlockQueueProcessor allocsize = 10332, queuedsize = 9600, qsz = 16 > > * queuedsize is the sum of walltimes of all jobs that are deemed to have > some space in some block; qsz is the number of such jobs. > > BlockQueueProcessor Requeued 0 non-fitting jobs > > * When the plan doesn't go according to plan, it may be required that > some jobs that were thought to fit in existing blocks won't really fit > in existing blocks. This shows the number of such jobs. > > BlockQueueProcessor Required size: 0 for 0 jobs > > * That tells you, for all jobs that don't have block space, the sum of > their walltimes and their number. > Based on the above values, I could tell that coaster schedules jobs on job-wall-time. So before we run an application, we need to have an estimated running time, is this correct? How about if we don't know about the running time, or the running time for each job varies a lot? > There are a few more coming from the same class when blocks are created. > I'll let you discover those. > Yes, I am reading through the logs and trying to find more fun from it. zhao > >> Like how many >> registered, how many are idle >> how many are busy and etc. I am also attaching two sites.xml definition >> for uc-teragrid and ranger. >> >> best >> zhao >> >> [zzhang at communicado scip]$ cat >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/coaster_new/tgranger-sge-gram2.xml >> >> >> >> > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >> TG-CCR080022N >> /work/00946/zzhang/work >> > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir >> 16 >> development >> 50 >> 10 >> 4 >> > > You should probably up that to the number of safe gt2 jobs (20). > > >> 2 >> > > You don't really need to specify a node granularity other than 1, since > you can request any number of nodes on ranger. > > >> 5 >> 1 >> 2 >> > > You probably want to have a larger maxnodes. You probably don't event > want to specify it, because there's nothing imposing a limit on that. > > >> false >> >> > > You also don't need to specify the default. > > > > From hategan at mcs.anl.gov Wed May 27 16:40:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 27 May 2009 16:40:58 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <4A1DB120.4090402@uchicago.edu> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> Message-ID: <1243460458.29393.0.camel@localhost> On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: > Hi, Mihael > > I did a clean run of 100 scip jobs on ranger. (scip is a new application > from MCS). And? > The log is at /home/zzhang/scip/ranger-logs/coasters.log [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log /home/zzhang/scip/ranger-logs/coasters.log: Permission denied From zhaozhang at uchicago.edu Wed May 27 16:49:57 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 27 May 2009 16:49:57 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243460458.29393.0.camel@localhost> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> Message-ID: <4A1DB585.30501@uchicago.edu> Mihael Hategan wrote: > On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I did a clean run of 100 scip jobs on ranger. (scip is a new application >> from MCS). >> > > And? > I need your help to make sure coaster is working as expected. > >> The log is at /home/zzhang/scip/ranger-logs/coasters.log >> > > [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > try now. > > > From hategan at mcs.anl.gov Wed May 27 17:30:04 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 27 May 2009 17:30:04 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <4A1DB585.30501@uchicago.edu> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> <4A1DB585.30501@uchicago.edu> Message-ID: <1243463404.30223.3.camel@localhost> On Wed, 2009-05-27 at 16:49 -0500, Zhao Zhang wrote: > > Mihael Hategan wrote: > > On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: > > > >> Hi, Mihael > >> > >> I did a clean run of 100 scip jobs on ranger. (scip is a new application > >> from MCS). > >> > > > > And? > > > I need your help to make sure coaster is working as expected. Did the workflow finish successfully? I'll assume "yes", since you didn't mention any errors, but it would be useful to state such things from the start. As far as it working as expected, I have more confidence in the workings of the algorithm itself than in the ancillary stuff. In other words, if you don't see errors/restarted jobs or stack traces in the logs, it's likely that things are fine. > > > >> The log is at /home/zzhang/scip/ranger-logs/coasters.log > >> > > > > [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > > /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > > > try now. [hategan at communicado ~]$ date Wed May 27 17:26:51 CDT 2009 [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log /home/zzhang/scip/ranger-logs/coasters.log: Permission denied From zhaozhang at uchicago.edu Wed May 27 17:33:45 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 27 May 2009 17:33:45 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243463404.30223.3.camel@localhost> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> <4A1DB585.30501@uchicago.edu> <1243463404.30223.3.camel@localhost> Message-ID: <4A1DBFC9.7040500@uchicago.edu> Mihael Hategan wrote: > On Wed, 2009-05-27 at 16:49 -0500, Zhao Zhang wrote: > >> Mihael Hategan wrote: >> >>> On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: >>> >>> >>>> Hi, Mihael >>>> >>>> I did a clean run of 100 scip jobs on ranger. (scip is a new application >>>> from MCS). >>>> >>>> >>> And? >>> >>> >> I need your help to make sure coaster is working as expected. >> > > Did the workflow finish successfully? I'll assume "yes", since you > didn't mention any errors, but it would be useful to state such things > from the start. > > Yes, it was successful. > As far as it working as expected, I have more confidence in the workings > of the algorithm itself than in the ancillary stuff. In other words, if > you don't see errors/restarted jobs or stack traces in the logs, it's > likely that things are fine. > > >>> >>> >>>> The log is at /home/zzhang/scip/ranger-logs/coasters.log >>>> >>>> >>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log >>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied >>> >>> >> try now. >> > > [hategan at communicado ~]$ date > Wed May 27 17:26:51 CDT 2009 > [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > > sorry, try again. zhao > > From hategan at mcs.anl.gov Wed May 27 17:43:30 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 27 May 2009 17:43:30 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <4A1DBFC9.7040500@uchicago.edu> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> <4A1DB585.30501@uchicago.edu> <1243463404.30223.3.camel@localhost> <4A1DBFC9.7040500@uchicago.edu> Message-ID: <1243464210.30425.1.camel@localhost> On Wed, 2009-05-27 at 17:33 -0500, Zhao Zhang wrote: > > Mihael Hategan wrote: > > On Wed, 2009-05-27 at 16:49 -0500, Zhao Zhang wrote: > > > >> Mihael Hategan wrote: > >> > >>> On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: > >>> > >>> > >>>> Hi, Mihael > >>>> > >>>> I did a clean run of 100 scip jobs on ranger. (scip is a new application > >>>> from MCS). > >>>> > >>>> > >>> And? > >>> > >>> > >> I need your help to make sure coaster is working as expected. > >> > > > > Did the workflow finish successfully? I'll assume "yes", since you > > didn't mention any errors, but it would be useful to state such things > > from the start. > > > > > Yes, it was successful. That's not a bad sign. > > As far as it working as expected, I have more confidence in the workings > > of the algorithm itself than in the ancillary stuff. In other words, if > > you don't see errors/restarted jobs or stack traces in the logs, it's > > likely that things are fine. > > > > > >>> > >>> > >>>> The log is at /home/zzhang/scip/ranger-logs/coasters.log > >>>> > >>>> > >>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > >>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > >>> > >>> > >> try now. > >> > > > > [hategan at communicado ~]$ date > > Wed May 27 17:26:51 CDT 2009 > > [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > > /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > > > > > sorry, try again. Looks ok. There are some things that should probably be adjusted there, but it looks reasonable. Where the queued/running jobs removed from the queue when the workflow finished? From zhaozhang at uchicago.edu Wed May 27 18:10:21 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 27 May 2009 18:10:21 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243464210.30425.1.camel@localhost> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> <4A1DB585.30501@uchicago.edu> <1243463404.30223.3.camel@localhost> <4A1DBFC9.7040500@uchicago.edu> <1243464210.30425.1.camel@localhost> Message-ID: <4A1DC85D.5060701@uchicago.edu> Sorry, Mihael, I didn't get your last question. zhao Mihael Hategan wrote: > On Wed, 2009-05-27 at 17:33 -0500, Zhao Zhang wrote: > >> Mihael Hategan wrote: >> >>> On Wed, 2009-05-27 at 16:49 -0500, Zhao Zhang wrote: >>> >>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Hi, Mihael >>>>>> >>>>>> I did a clean run of 100 scip jobs on ranger. (scip is a new application >>>>>> from MCS). >>>>>> >>>>>> >>>>>> >>>>> And? >>>>> >>>>> >>>>> >>>> I need your help to make sure coaster is working as expected. >>>> >>>> >>> Did the workflow finish successfully? I'll assume "yes", since you >>> didn't mention any errors, but it would be useful to state such things >>> from the start. >>> >>> >>> >> Yes, it was successful. >> > > That's not a bad sign. > > >>> As far as it working as expected, I have more confidence in the workings >>> of the algorithm itself than in the ancillary stuff. In other words, if >>> you don't see errors/restarted jobs or stack traces in the logs, it's >>> likely that things are fine. >>> >>> >>> >>>>> >>>>> >>>>> >>>>>> The log is at /home/zzhang/scip/ranger-logs/coasters.log >>>>>> >>>>>> >>>>>> >>>>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log >>>>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied >>>>> >>>>> >>>>> >>>> try now. >>>> >>>> >>> [hategan at communicado ~]$ date >>> Wed May 27 17:26:51 CDT 2009 >>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log >>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied >>> >>> >>> >> sorry, try again. >> > > Looks ok. There are some things that should probably be adjusted there, > but it looks reasonable. Where the queued/running jobs removed from the > queue when the workflow finished? > > > From hategan at mcs.anl.gov Thu May 28 02:21:42 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 28 May 2009 02:21:42 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <4A1DC85D.5060701@uchicago.edu> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> <4A1DB585.30501@uchicago.edu> <1243463404.30223.3.camel@localhost> <4A1DBFC9.7040500@uchicago.edu> <1243464210.30425.1.camel@localhost> <4A1DC85D.5060701@uchicago.edu> Message-ID: <1243495302.31029.0.camel@localhost> You should check whether blocks are properly de-allocated when all the work is done. On Wed, 2009-05-27 at 18:10 -0500, Zhao Zhang wrote: > Sorry, Mihael, I didn't get your last question. > > zhao > > Mihael Hategan wrote: > > On Wed, 2009-05-27 at 17:33 -0500, Zhao Zhang wrote: > > > >> Mihael Hategan wrote: > >> > >>> On Wed, 2009-05-27 at 16:49 -0500, Zhao Zhang wrote: > >>> > >>> > >>>> Mihael Hategan wrote: > >>>> > >>>> > >>>>> On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: > >>>>> > >>>>> > >>>>> > >>>>>> Hi, Mihael > >>>>>> > >>>>>> I did a clean run of 100 scip jobs on ranger. (scip is a new application > >>>>>> from MCS). > >>>>>> > >>>>>> > >>>>>> > >>>>> And? > >>>>> > >>>>> > >>>>> > >>>> I need your help to make sure coaster is working as expected. > >>>> > >>>> > >>> Did the workflow finish successfully? I'll assume "yes", since you > >>> didn't mention any errors, but it would be useful to state such things > >>> from the start. > >>> > >>> > >>> > >> Yes, it was successful. > >> > > > > That's not a bad sign. > > > > > >>> As far as it working as expected, I have more confidence in the workings > >>> of the algorithm itself than in the ancillary stuff. In other words, if > >>> you don't see errors/restarted jobs or stack traces in the logs, it's > >>> likely that things are fine. > >>> > >>> > >>> > >>>>> > >>>>> > >>>>> > >>>>>> The log is at /home/zzhang/scip/ranger-logs/coasters.log > >>>>>> > >>>>>> > >>>>>> > >>>>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > >>>>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > >>>>> > >>>>> > >>>>> > >>>> try now. > >>>> > >>>> > >>> [hategan at communicado ~]$ date > >>> Wed May 27 17:26:51 CDT 2009 > >>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > >>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > >>> > >>> > >>> > >> sorry, try again. > >> > > > > Looks ok. There are some things that should probably be adjusted there, > > but it looks reasonable. Where the queued/running jobs removed from the > > queue when the workflow finished? > > > > > > From bugzilla-daemon at mcs.anl.gov Thu May 28 12:05:32 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 28 May 2009 12:05:32 -0500 (CDT) Subject: [Swift-devel] [Bug 210] New: job exceeding wallclock limit -- error is not reported by swift Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=210 Summary: job exceeding wallclock limit -- error is not reported by swift Product: Swift Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: skenny at uchicago.edu CC: benc at hawaga.org.uk -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching someone on the CC list of the bug. From zhaozhang at uchicago.edu Thu May 28 12:21:38 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 28 May 2009 12:21:38 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243495302.31029.0.camel@localhost> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> <4A1DB585.30501@uchicago.edu> <1243463404.30223.3.camel@localhost> <4A1DBFC9.7040500@uchicago.edu> <1243464210.30425.1.camel@localhost> <4A1DC85D.5060701@uchicago.edu> <1243495302.31029.0.camel@localhost> Message-ID: <4A1EC822.2030500@uchicago.edu> Hi, Mihael Mihael Hategan wrote: > You should check whether blocks are properly de-allocated when all the > work is done. > I made a clear run of 100-scip jobs. Here is the description of procedure: 1. Workflow started on submit host. Queue Status login3% showq | grep zzhang 745500 data zzhang Waiting 16 00:50:00 Thu May 28 11:59:50 745501 data zzhang Waiting 16 00:50:00 Thu May 28 11:59:52 745502 data zzhang Waiting 16 00:40:00 Thu May 28 11:59:56 2. Workflow running on Ranger Queue Status login3% showq | grep zzhang 745500 data zzhang Running 16 00:49:53 Thu May 28 12:00:57 745501 data zzhang Running 16 00:49:53 Thu May 28 12:00:57 745502 data zzhang Running 16 00:39:53 Thu May 28 12:00:57 745504 data zzhang Running 16 00:39:53 Thu May 28 12:00:57 login3% showq | grep zzhang 745500 data zzhang Running 16 00:49:36 Thu May 28 12:00:57 745501 data zzhang Running 16 00:49:36 Thu May 28 12:00:57 745502 data zzhang Running 16 00:39:36 Thu May 28 12:00:57 745504 data zzhang Running 16 00:39:36 Thu May 28 12:00:57 login3% showq | grep zzhang 745500 data zzhang Running 16 00:49:30 Thu May 28 12:00:57 745501 data zzhang Running 16 00:49:30 Thu May 28 12:00:57 745502 data zzhang Running 16 00:39:30 Thu May 28 12:00:57 745504 data zzhang Running 16 00:39:30 Thu May 28 12:00:57 3. Workflow finished Queue Status login3% showq | grep zzhang 745500 data zzhang Running 16 00:48:06 Thu May 28 12:00:57 745501 data zzhang Running 16 00:48:06 Thu May 28 12:00:57 745502 data zzhang Running 16 00:38:06 Thu May 28 12:00:57 745504 data zzhang Running 16 00:38:06 Thu May 28 12:00:57 745511 data zzhang Waiting 16 00:10:00 Thu May 28 12:02:39 As you could tell, there is one more job with WallTime 10 minutes coming out of the queue, I checked the last 20 lines of coaster.log, it doesn't say much about this: 2009-05-28 12:02:38,246-0500 INFO AbstractKarajanChannel GSSCChannel-https://128.135.125.17:50004(1) REQ: Handler(SHUTDOWNSERVICE) 2009-05-28 12:02:38,249-0500 INFO CoasterService Shutdown sequence completed 2009-05-28 12:02:39,212-0500 INFO Cpu 0528-591140-000000:1 pullLater 2009-05-28 12:02:39,213-0500 INFO Cpu 0528-591140-000001:0 pull 2009-05-28 12:02:40,223-0500 INFO Cpu 0528-591140-000001:0 pullLater 2009-05-28 12:02:40,223-0500 INFO Cpu 0528-591140-000000:0 pull 2009-05-28 12:02:41,232-0500 INFO Cpu 0528-591140-000000:0 pullLater 2009-05-28 12:02:41,239-0500 INFO Cpu 0528-591140-000003:1 pull 2009-05-28 12:02:42,252-0500 INFO Cpu 0528-591140-000003:1 pullLater 2009-05-28 12:02:42,253-0500 INFO Cpu 0528-591140-000003:0 pull 2009-05-28 12:02:43,262-0500 INFO Cpu 0528-591140-000003:0 pullLater 2009-05-28 12:02:43,263-0500 INFO Cpu 0528-591140-000002:0 pull 2009-05-28 12:02:44,273-0500 INFO Cpu 0528-591140-000002:0 pullLater 2009-05-28 12:02:44,273-0500 INFO Cpu 0528-591140-000001:1 pull 2009-05-28 12:02:45,283-0500 INFO Cpu 0528-591140-000001:1 pullLater 2009-05-28 12:02:45,283-0500 INFO Cpu 0528-591140-000002:1 pull 2009-05-28 12:02:46,292-0500 INFO Cpu 0528-591140-000002:1 pullLater 2009-05-28 12:02:46,293-0500 INFO Cpu 0528-591140-000000:1 pull 2009-05-28 12:02:46,992-0500 INFO CoasterService Idle time: 0 2009-05-28 12:02:47,713 117968031 Exit code: 0 Within ~5 minutes the workflow finished, the first 4 allocations' status changed to "unsched" login3% showq | grep zzhang 745524 data zzhang Unsched 16 00:50:00 Thu May 28 12:11:31 745525 data zzhang Unsched 16 00:50:00 Thu May 28 12:11:33 745526 data zzhang Unsched 16 00:50:00 Thu May 28 12:11:38 745527 data zzhang Unsched 16 00:50:00 Thu May 28 12:11:42 Within ~2 minutes, those allocations are released. zhao zhao > On Wed, 2009-05-27 at 18:10 -0500, Zhao Zhang wrote: > >> Sorry, Mihael, I didn't get your last question. >> >> zhao >> >> Mihael Hategan wrote: >> >>> On Wed, 2009-05-27 at 17:33 -0500, Zhao Zhang wrote: >>> >>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Wed, 2009-05-27 at 16:49 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Mihael Hategan wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi, Mihael >>>>>>>> >>>>>>>> I did a clean run of 100 scip jobs on ranger. (scip is a new application >>>>>>>> from MCS). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> And? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> I need your help to make sure coaster is working as expected. >>>>>> >>>>>> >>>>>> >>>>> Did the workflow finish successfully? I'll assume "yes", since you >>>>> didn't mention any errors, but it would be useful to state such things >>>>> from the start. >>>>> >>>>> >>>>> >>>>> >>>> Yes, it was successful. >>>> >>>> >>> That's not a bad sign. >>> >>> >>> >>>>> As far as it working as expected, I have more confidence in the workings >>>>> of the algorithm itself than in the ancillary stuff. In other words, if >>>>> you don't see errors/restarted jobs or stack traces in the logs, it's >>>>> likely that things are fine. >>>>> >>>>> >>>>> >>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> The log is at /home/zzhang/scip/ranger-logs/coasters.log >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log >>>>>>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> try now. >>>>>> >>>>>> >>>>>> >>>>> [hategan at communicado ~]$ date >>>>> Wed May 27 17:26:51 CDT 2009 >>>>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log >>>>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied >>>>> >>>>> >>>>> >>>>> >>>> sorry, try again. >>>> >>>> >>> Looks ok. There are some things that should probably be adjusted there, >>> but it looks reasonable. Where the queued/running jobs removed from the >>> queue when the workflow finished? >>> >>> >>> >>> > > > From zhaozhang at uchicago.edu Thu May 28 17:11:25 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 28 May 2009 17:11:25 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <1243495302.31029.0.camel@localhost> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> <4A1DB585.30501@uchicago.edu> <1243463404.30223.3.camel@localhost> <4A1DBFC9.7040500@uchicago.edu> <1243464210.30425.1.camel@localhost> <4A1DC85D.5060701@uchicago.edu> <1243495302.31029.0.camel@localhost> Message-ID: <4A1F0C0D.8000207@uchicago.edu> Hi, Mihael Here comes another question about the definition of "block allocation".From my understanding, "block allocation" is that we grab bunch of resources, then run workflowS on it without request the resource again. Say, we request 1K compute nodes from a cluster, then we run a dock6 workflow on it, after that we run a blast workflow on the same allocation. Is my understanding correct? In the current coaster design, coaster seems deallocate the resources after A workflow is done. So, next time when I tried to run another workflow, I have to run it on a new allocation instead of the one we have already had in hand. Or is there any options we could set in coaster and could have both the approaches? Thanks. best zhao Mihael Hategan wrote: > You should check whether blocks are properly de-allocated when all the > work is done. > > On Wed, 2009-05-27 at 18:10 -0500, Zhao Zhang wrote: > >> Sorry, Mihael, I didn't get your last question. >> >> zhao >> >> Mihael Hategan wrote: >> >>> On Wed, 2009-05-27 at 17:33 -0500, Zhao Zhang wrote: >>> >>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Wed, 2009-05-27 at 16:49 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Mihael Hategan wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi, Mihael >>>>>>>> >>>>>>>> I did a clean run of 100 scip jobs on ranger. (scip is a new application >>>>>>>> from MCS). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> And? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> I need your help to make sure coaster is working as expected. >>>>>> >>>>>> >>>>>> >>>>> Did the workflow finish successfully? I'll assume "yes", since you >>>>> didn't mention any errors, but it would be useful to state such things >>>>> from the start. >>>>> >>>>> >>>>> >>>>> >>>> Yes, it was successful. >>>> >>>> >>> That's not a bad sign. >>> >>> >>> >>>>> As far as it working as expected, I have more confidence in the workings >>>>> of the algorithm itself than in the ancillary stuff. In other words, if >>>>> you don't see errors/restarted jobs or stack traces in the logs, it's >>>>> likely that things are fine. >>>>> >>>>> >>>>> >>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> The log is at /home/zzhang/scip/ranger-logs/coasters.log >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log >>>>>>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> try now. >>>>>> >>>>>> >>>>>> >>>>> [hategan at communicado ~]$ date >>>>> Wed May 27 17:26:51 CDT 2009 >>>>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log >>>>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied >>>>> >>>>> >>>>> >>>>> >>>> sorry, try again. >>>> >>>> >>> Looks ok. There are some things that should probably be adjusted there, >>> but it looks reasonable. Where the queued/running jobs removed from the >>> queue when the workflow finished? >>> >>> >>> >>> > > > From hategan at mcs.anl.gov Thu May 28 21:15:36 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 28 May 2009 21:15:36 -0500 Subject: [Swift-devel] block allocation In-Reply-To: <4A1F0C0D.8000207@uchicago.edu> References: <1243198668.30428.12.camel@localhost> <4A1C3D5C.5080003@uchicago.edu> <1243370247.3794.4.camel@localhost> <4A1D897A.3050209@uchicago.edu> <1243452190.26671.18.camel@localhost> <4A1DB120.4090402@uchicago.edu> <1243460458.29393.0.camel@localhost> <4A1DB585.30501@uchicago.edu> <1243463404.30223.3.camel@localhost> <4A1DBFC9.7040500@uchicago.edu> <1243464210.30425.1.camel@localhost> <4A1DC85D.5060701@uchicago.edu> <1243495302.31029.0.camel@localhost> <4A1F0C0D.8000207@uchicago.edu> Message-ID: <1243563336.13446.7.camel@localhost> You are correct in your assumptions. There is one coaster service instance per machine per swift JVM. On Thu, 2009-05-28 at 17:11 -0500, Zhao Zhang wrote: > Hi, Mihael > > Here comes another question about the definition of "block > allocation".From my understanding, "block allocation" is that we grab > bunch of > resources, then run workflowS on it without request the resource again. > Say, we request 1K compute nodes from a cluster, then we run a dock6 > workflow on it, after that we run a blast workflow on the same allocation. > Is my understanding correct? > > In the current coaster design, coaster seems deallocate the resources > after A workflow is done. So, next time when I tried to run another > workflow, > I have to run it on a new allocation instead of the one we have already > had in hand. Or is there any options we could set in coaster and could have > both the approaches? Thanks. > > best > zhao > > Mihael Hategan wrote: > > You should check whether blocks are properly de-allocated when all the > > work is done. > > > > On Wed, 2009-05-27 at 18:10 -0500, Zhao Zhang wrote: > > > >> Sorry, Mihael, I didn't get your last question. > >> > >> zhao > >> > >> Mihael Hategan wrote: > >> > >>> On Wed, 2009-05-27 at 17:33 -0500, Zhao Zhang wrote: > >>> > >>> > >>>> Mihael Hategan wrote: > >>>> > >>>> > >>>>> On Wed, 2009-05-27 at 16:49 -0500, Zhao Zhang wrote: > >>>>> > >>>>> > >>>>> > >>>>>> Mihael Hategan wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Wed, 2009-05-27 at 16:31 -0500, Zhao Zhang wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> Hi, Mihael > >>>>>>>> > >>>>>>>> I did a clean run of 100 scip jobs on ranger. (scip is a new application > >>>>>>>> from MCS). > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> And? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> I need your help to make sure coaster is working as expected. > >>>>>> > >>>>>> > >>>>>> > >>>>> Did the workflow finish successfully? I'll assume "yes", since you > >>>>> didn't mention any errors, but it would be useful to state such things > >>>>> from the start. > >>>>> > >>>>> > >>>>> > >>>>> > >>>> Yes, it was successful. > >>>> > >>>> > >>> That's not a bad sign. > >>> > >>> > >>> > >>>>> As far as it working as expected, I have more confidence in the workings > >>>>> of the algorithm itself than in the ancillary stuff. In other words, if > >>>>> you don't see errors/restarted jobs or stack traces in the logs, it's > >>>>> likely that things are fine. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> The log is at /home/zzhang/scip/ranger-logs/coasters.log > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > >>>>>>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> try now. > >>>>>> > >>>>>> > >>>>>> > >>>>> [hategan at communicado ~]$ date > >>>>> Wed May 27 17:26:51 CDT 2009 > >>>>> [hategan at communicado ~]$ less /home/zzhang/scip/ranger-logs/coasters.log > >>>>> /home/zzhang/scip/ranger-logs/coasters.log: Permission denied > >>>>> > >>>>> > >>>>> > >>>>> > >>>> sorry, try again. > >>>> > >>>> > >>> Looks ok. There are some things that should probably be adjusted there, > >>> but it looks reasonable. Where the queued/running jobs removed from the > >>> queue when the workflow finished? > >>> > >>> > >>> > >>> > > > > > > From yizhu at cs.uchicago.edu Fri May 29 02:13:22 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Fri, 29 May 2009 02:13:22 -0500 Subject: [Swift-devel] Issues when running swift on fuse Message-ID: <4A1F8B12.8050206@cs.uchicago.edu> Hi, I tried to run a swift on a directory mounted by fuse (s3fs), but got the following error: (-debug output attached) since I execute swift on /mnt/s3, the output file should be write there,i had rw permission on /mnt/s3, the job seems run correctly (the output file: Hello.txt is there), just a issue with log file. Is there a way to turn off the logging or any ideas to solve this problem? Thank you very much! -From Yi Zhu [torqueuser at ip-10-251-89-208 ~]$ cd /mnt/s3 [torqueuser at ip-10-251-89-208 s3]$ swift /home/torqueuser/swift-0.9/examples/swift/first.swift log4j:ERROR Failed to flush writer, java.io.IOException: File too large at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336) at sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:404) at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408) at sun.nio.cs.StreamE.ncoder.flush(StreamEncoder.java:152) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213) at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:49) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:306) at org.apache.log4j.WriterAppender.append(WriterAppender.java:150) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:221) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:57) at org.apache.log4j.Category.callAppenders(Category.java:187) at org.apache.log4j.Category.forcedLog(Category.java:372) at org.apache.log4j.Category.debug(Category.java:241) at org.griphyn.vdl.karajan.Loader.main(Loader.java:75) Swift 0.9 swift-r2860 cog-r2388 RunID: 20090529-0204-55gjsnhc Progress: uninitialized:1 Progress: Submitted:1 Progress: Active:1 Ex098 org.globus.cog.karajan.workflow.KarajanRuntimeException: Exception caught while writing to log file at org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:40) at org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.append(VariableArgumentsOperator.java:38) at org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.appendAll(VariableArgumentsOperator.java:44) at org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.merge(VariableArgumentsOperator.java:34) at org.globus.cog.karajan.workflow.nodes.SequentialChoice.commitBuffers(SequentialChoice.java:52) at org.globus.cog.karajan.workflow.nodes.SequentialChoice.childCompleted(SequentialChoice.java:41) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.SequentialChoice.notificationEvent(SequentialChoice.java:66) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) Caused by: java.io.IOException: File too large at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336) at sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:404) at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213) at org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.flush(FlushableLockedFileWriter.java:39) at org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:37) ... 40 more Execution failed: Exception caught while writing to log file Caused by: File too large [torqueuser at ip-10-251-89-208 s3]$ -------------- next part -------------- A non-text attachment was scrubbed... Name: error.log Type: text/x-log Size: 35601 bytes Desc: not available URL: From hategan at mcs.anl.gov Fri May 29 04:02:54 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 May 2009 04:02:54 -0500 Subject: [Swift-devel] Issues when running swift on fuse In-Reply-To: <4A1F8B12.8050206@cs.uchicago.edu> References: <4A1F8B12.8050206@cs.uchicago.edu> Message-ID: <1243587774.15582.7.camel@localhost> That's strange. You are getting multiple errors, both for the log file and the restart log. But the restart log should be small (and for first.swift, so should be the execution log). So I suspect the error message is misleading. Can you provide more details? Such as: - the actual size of the files in question at the time the error(s) occurred - the motivation behind having the swift working directory (as opposed to just the data) on S3 - any relevant quotas or limitations of the S3 fs (output of 'quota' if applicable and 'df'). On Fri, 2009-05-29 at 02:13 -0500, yizhu wrote: > Hi, > > > I tried to run a swift on a directory mounted by fuse (s3fs), but got > the following error: > (-debug output attached) > > since I execute swift on /mnt/s3, the output file should be write > there,i had rw permission on /mnt/s3, the job seems run correctly (the > output file: Hello.txt is there), just a issue with log file. > > Is there a way to turn off the logging or any ideas to solve this problem? > > Thank you very much! > > > -From > > Yi Zhu > > > > [torqueuser at ip-10-251-89-208 ~]$ cd /mnt/s3 > [torqueuser at ip-10-251-89-208 s3]$ swift > /home/torqueuser/swift-0.9/examples/swift/first.swift > log4j:ERROR Failed to flush writer, > java.io.IOException: File too large > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336) > at > sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:404) > at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408) > at sun.nio.cs.StreamE.ncoder.flush(StreamEncoder.java:152) > at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213) > at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:49) > at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:306) > at org.apache.log4j.WriterAppender.append(WriterAppender.java:150) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:221) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:57) > at org.apache.log4j.Category.callAppenders(Category.java:187) > at org.apache.log4j.Category.forcedLog(Category.java:372) > at org.apache.log4j.Category.debug(Category.java:241) > at org.griphyn.vdl.karajan.Loader.main(Loader.java:75) > Swift 0.9 swift-r2860 cog-r2388 > > RunID: 20090529-0204-55gjsnhc > Progress: uninitialized:1 > Progress: Submitted:1 > Progress: Active:1 > Ex098 > org.globus.cog.karajan.workflow.KarajanRuntimeException: Exception > caught while writing to log file > at > org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:40) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.append(VariableArgumentsOperator.java:38) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.appendAll(VariableArgumentsOperator.java:44) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.merge(VariableArgumentsOperator.java:34) > at > org.globus.cog.karajan.workflow.nodes.SequentialChoice.commitBuffers(SequentialChoice.java:52) > at > org.globus.cog.karajan.workflow.nodes.SequentialChoice.childCompleted(SequentialChoice.java:41) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.SequentialChoice.notificationEvent(SequentialChoice.java:66) > at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > at > org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > Caused by: java.io.IOException: File too large > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336) > at > sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:404) > at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408) > at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152) > at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213) > at > org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.flush(FlushableLockedFileWriter.java:39) > at > org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:37) > ... 40 more > Execution failed: > Exception caught while writing to log file > Caused by: > File too large > [torqueuser at ip-10-251-89-208 s3]$ > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From yizhu at cs.uchicago.edu Fri May 29 13:09:13 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Fri, 29 May 2009 13:09:13 -0500 Subject: [Swift-devel] Issues when running swift on fuse In-Reply-To: <1243587774.15582.7.camel@localhost> References: <4A1F8B12.8050206@cs.uchicago.edu> <1243587774.15582.7.camel@localhost> Message-ID: <4A2024C9.6010705@cs.uchicago.edu> Mihael Hategan wrote: > That's strange. You are getting multiple errors, both for the log file > and the restart log. But the restart log should be small (and for > first.swift, so should be the execution log). So I suspect the error > message is misleading. Yes, i think you are right, I noticed that swift is trying to generate log file with size more that 18446744073GB, which cause the file two large problem. > > Can you provide more details? Such as: > - the actual size of the files in question at the time the error(s) > occurred for a normal execution of first.swift code, it will will generate the files as follow; [torqueuser at ip-10-251-89-208 test]$ pwd /mnt/ebs/test [torqueuser at ip-10-251-89-208 test]$ ls -l -rw-rw-r-- 1 torqueuser torqueuser 25141 May 29 12:33 first-20090529-1233-nj1nazbf.log -rw-rw-r-- 1 torqueuser torqueuser 25406 May 29 12:34 first-20090529-1234-bgzd13z4.log -rw-rw-r-- 1 torqueuser torqueuser 14 May 29 12:34 hello.txt -rw-rw-r-- 1 torqueuser torqueuser 114 May 29 12:34 swift.log but when I try to run it on s3 (mounted via s3-fuse), the error occurs is following files is generated: [torqueuser at ip-10-251-89-208 abc]$ pwd /mnt/s3/abc [torqueuser at ip-10-251-89-208 abc]$ ls -l total 2 -rw-rw-r-- 1 torqueuser torqueuser 18446744073709551615 May 29 12:35 first-20090529-1235-xjyfu9r4.0.rlog -rw-rw-r-- 1 torqueuser torqueuser 18446744073709551615 May 29 12:35 first-20090529-1235-xjyfu9r4.log -rw-rw-r-- 1 torqueuser torqueuser 14 May 29 12:35 hello.txt -rw-rw-r-- 1 torqueuser torqueuser 18446744073709551615 May 29 12:35 swift.log [torqueuser at ip-10-251-89-208 abc]$ > - the motivation behind having the swift working directory (as opposed > to just the data) on S3 > - any relevant quotas or limitations of the S3 fs (output of 'quota' if > applicable and 'df'). no output of 'quota', 'df'is (/mnt/s3 is the s3 directory : [torqueuser at ip-10-251-89-208 s3]$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 4128448 2744188 1174548 71% / none 873880 0 873880 0% /dev/shm /dev/sda2 153899044 1999124 144082296 2% /mnt /dev/sdf 10321208 290468 9506452 3% /mnt/ebs fuse 274877906944 0 274877906944 0% /mnt/s3 normally, we have create a single file sized around 1G with no problem, it seems there is something wrong when swift try to generate log on s3fuse. (I am using swift-0.9) Then I tried to run swift on locally but with output file write to s3 directory, then it works correctly.(log file is stored on local directory where i run swift, and output file is on s3 directory) > On Fri, 2009-05-29 at 02:13 -0500, yizhu wrote: >> Hi, >> >> >> I tried to run a swift on a directory mounted by fuse (s3fs), but got >> the following error: >> (-debug output attached) >> >> since I execute swift on /mnt/s3, the output file should be write >> there,i had rw permission on /mnt/s3, the job seems run correctly (the >> output file: Hello.txt is there), just a issue with log file. >> >> Is there a way to turn off the logging or any ideas to solve this problem? >> >> Thank you very much! >> >> >> -From >> >> Yi Zhu >> >> >> >> [torqueuser at ip-10-251-89-208 ~]$ cd /mnt/s3 >> [torqueuser at ip-10-251-89-208 s3]$ swift >> /home/torqueuser/swift-0.9/examples/swift/first.swift >> log4j:ERROR Failed to flush writer, >> java.io.IOException: File too large >> at java.io.FileOutputStream.writeBytes(Native Method) >> at java.io.FileOutputStream.write(FileOutputStream.java:260) >> at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336) >> at >> sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:404) >> at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408) >> at sun.nio.cs.StreamE.ncoder.flush(StreamEncoder.java:152) >> at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213) >> at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:49) >> at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:306) >> at org.apache.log4j.WriterAppender.append(WriterAppender.java:150) >> at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:221) >> at >> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:57) >> at org.apache.log4j.Category.callAppenders(Category.java:187) >> at org.apache.log4j.Category.forcedLog(Category.java:372) >> at org.apache.log4j.Category.debug(Category.java:241) >> at org.griphyn.vdl.karajan.Loader.main(Loader.java:75) >> Swift 0.9 swift-r2860 cog-r2388 >> >> RunID: 20090529-0204-55gjsnhc >> Progress: uninitialized:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Ex098 >> org.globus.cog.karajan.workflow.KarajanRuntimeException: Exception >> caught while writing to log file >> at >> org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:40) >> at >> org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.append(VariableArgumentsOperator.java:38) >> at >> org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.appendAll(VariableArgumentsOperator.java:44) >> at >> org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.merge(VariableArgumentsOperator.java:34) >> at >> org.globus.cog.karajan.workflow.nodes.SequentialChoice.commitBuffers(SequentialChoice.java:52) >> at >> org.globus.cog.karajan.workflow.nodes.SequentialChoice.childCompleted(SequentialChoice.java:41) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> at >> org.globus.cog.karajan.workflow.nodes.SequentialChoice.notificationEvent(SequentialChoice.java:66) >> at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >> at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >> at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >> at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:37) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >> at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >> at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >> at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> at >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >> Caused by: java.io.IOException: File too large >> at java.io.FileOutputStream.writeBytes(Native Method) >> at java.io.FileOutputStream.write(FileOutputStream.java:260) >> at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336) >> at >> sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:404) >> at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408) >> at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152) >> at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213) >> at >> org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.flush(FlushableLockedFileWriter.java:39) >> at >> org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:37) >> ... 40 more >> Execution failed: >> Exception caught while writing to log file >> Caused by: >> File too large >> [torqueuser at ip-10-251-89-208 s3]$ >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From yizhu at cs.uchicago.edu Fri May 29 13:23:33 2009 From: yizhu at cs.uchicago.edu (yizhu) Date: Fri, 29 May 2009 13:23:33 -0500 Subject: [Swift-devel] swift syntax problem (please ignore the previous one ) Message-ID: <4A202825.3000809@cs.uchicago.edu> Hi, In order to let swift write output to another directory rather than currently running location, we can use simple_mapper with location parameter like ", but if I want to use regexp_mapper, since there is no 'location' parameter available for regexp_mapper, (http://www.ci.uchicago.edu/swift/guides/userguide.php#mapper.simple_mapper) how can i sort it out? For example, In this case, every files with prefix "data" and suffix ".txt" at location "/mnt/testdata/1024Kdir" will be map to inputfiles[] array, and for each running of loop, the regexp_mapper will modify the ".txt" to 'swift xxx.swift' command), so how can i specify a output location other than the default/current one? Many Thanks. -YI From hategan at mcs.anl.gov Fri May 29 16:30:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 29 May 2009 16:30:00 -0500 Subject: [Swift-devel] Issues when running swift on fuse In-Reply-To: <4A2024C9.6010705@cs.uchicago.edu> References: <4A1F8B12.8050206@cs.uchicago.edu> <1243587774.15582.7.camel@localhost> <4A2024C9.6010705@cs.uchicago.edu> Message-ID: <1243632600.30923.3.camel@localhost> On Fri, 2009-05-29 at 13:09 -0500, yizhu wrote: > Mihael Hategan wrote: > > That's strange. You are getting multiple errors, both for the log file > > and the restart log. But the restart log should be small (and for > > first.swift, so should be the execution log). So I suspect the error > > message is misleading. > Yes, i think you are right, I noticed that swift is trying to generate > log file with size more that 18446744073GB, which cause the file two > large problem. Swift doesn't generate empty logs of a pre-set size. And the size you mention is insanely large. So I think this is a bug in the s3 driver. I'm not sure why swift triggers it. From benc at hawaga.org.uk Fri May 29 18:32:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 29 May 2009 23:32:06 +0000 (GMT) Subject: [Swift-devel] Issues when running swift on fuse In-Reply-To: <1243632600.30923.3.camel@localhost> References: <4A1F8B12.8050206@cs.uchicago.edu> <1243587774.15582.7.camel@localhost> <4A2024C9.6010705@cs.uchicago.edu> <1243632600.30923.3.camel@localhost> Message-ID: probably an s3-backed filesystem is bad for storing run logs and restart logs - something more local is probably better. -- From benc at hawaga.org.uk Fri May 29 18:41:58 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 29 May 2009 23:41:58 +0000 (GMT) Subject: [Swift-devel] Re: [Swift-user] swift syntax problem (please ignore the previous one ) In-Reply-To: <4A202825.3000809@cs.uchicago.edu> References: <4A202825.3000809@cs.uchicago.edu> Message-ID: You can use some code like the below. This uses the simple mapper to generate a new filename for otherfile, based on substituting "foo" for "hello". type messagefile; app (messagefile t) greeting() { echo "Hello, world!" stdout=@filename(t); } messagefile outfile <"/tmp/hello.txt">; messagefile otherfile ; otherfile = greeting(); From yizhu at uchicago.edu Fri May 29 13:19:27 2009 From: yizhu at uchicago.edu (Yi Zhu) Date: Fri, 29 May 2009 13:19:27 -0500 Subject: [Swift-devel] swift syntax question Message-ID: <4A20272F.9020203@uchicago.edu> Hi, In order to let swift write output to another directory rather than currently running location, we can use simple_mapper with location parameter like ", but if I want to use regexp_mapper, since there is no 'location' parameter available for regexp_mapper, (http://www.ci.uchicago.edu/swift/guides/userguide.php#mapper.simple_mapper) how can i sort it out? For example, In this case, every files with prefix "data" and suffix ".txt" at location "/mnt/testdata/1024Kdir" will be map to inputfiles[] array, and for each running of loop, the regexp_mapper will modify the ".txt" to 'swift xxx.swift' command), so how can i specify a output location other than the default/current one? (catfile t) catwords (messagefile f) { app { cat @filename(f) stdout=@filename(t); } } messagefile inputfiles[] ; foreach f in inputfiles { catfile c; c = catwords(f); } Many Thanks! -Yi Zhu