From benc at hawaga.org.uk Thu Mar 1 04:49:25 2012 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 1 Mar 2012 11:49:25 +0100 Subject: [Swift-devel] visualize your code as it executes In-Reply-To: References: Message-ID: On Mar 1, 2012, at 5:19 AM, Ketan Maheshwari wrote: > This is a nice page showing visualize as you run code: > > http://people.csail.mit.edu/pgbovine/python/tutor.html#mode=edit > > Relavant to the try Swift online venture. > > (from google+ python stream) I've pondered about this before, but more from the perspective of visualising performance on large execution runs, which I think works differently from a "try some simple code here" visualisation. One thing that is very different in Swift is all the parallelism. It doesn't make sense for a large run, but for a small run where there are (for example) not too many branches in a foreach loop, then using that style of interface to graph the progression of a DAG might be interesting. There was some DAG generation stuff long ago - I have no idea what state it is in now. It started to run into trouble when there were non-trivial data structures. For stepping through code, you can (I think) pretty much always safely run swift code as a single thread, which would be much more amenable to single stepping like in that example. But I don't really know how harmful that would be to understanding the parallelness, compared to the benefit of being able to see whats going on step by step. -- From wilde at mcs.anl.gov Thu Mar 1 11:14:24 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 1 Mar 2012 11:14:24 -0600 (CST) Subject: [Swift-devel] camelCase issues In-Reply-To: Message-ID: <431948357.50048.1330622064238.JavaMail.root@zimbra.anl.gov> I got this from a user who was trying the ParameterSweep example: ----- Forwarded Message ----- To run on my MacBook, I had to change @toInt to @toint in you .swift file. ----- I wrote the code using toInt because I tested on trunk, and wanted to avoid the deprecation message. But I see now that I should have coded to 0.93 standards. So 2 questions: - are all built-in functions now accepted in a case-insensitive manner, or just specific ones, in specific cases (like toint and toInt)? - can we remove the case-related deprecation messages from trunk until we sort out where we are heading on this issue? (In the last thread Mihael suggested we decide for 0.94, which I think is good to do). Mike From jonmon at mcs.anl.gov Thu Mar 1 19:16:47 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 1 Mar 2012 19:16:47 -0600 Subject: [Swift-devel] coasters-hosts.pl script Message-ID: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> Justin, So I have been trying to help Emalayan get the host list file for the worker-init.pl script. It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script. I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG. So for some reason the workers are not connecting to the service. When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses. However when setting this line it seems they are not connecting. Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker. The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect. I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them. Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file? Any idea what may be happening? The job is running when looking under cqstat. A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log. This may need to be changed. Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG From iraicu at cs.iit.edu Thu Mar 1 20:02:15 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Thu, 01 Mar 2012 20:02:15 -0600 Subject: [Swift-devel] Call for Papers: eScience 2012 and associated workshops and tutorial Message-ID: <4F502A27.6010805@cs.iit.edu> *Call for Papers: eScience 2012 and associated workshops* * * In addition to the eScience conference itself, described below, there are six associated workshops and one tutorial (http://www.ci.uchicago.edu/escience2012/workshops.php) * Extending High-Performance Computing Beyond its Traditional User Communities, http://www.psc.edu/events/escience-2012-workshop/ * 2nd International Workshop on Analyzing and Improving Collaborative eScience with Social Networks (eSoN 12), http://www.ci.uchicago.edu/eson2012/ * Advances in eHealth, http://www.scalalife.eu/content/advances-ehealth-2012-workshop * Maintainable Software Practices in e-Science, http://software.ac.uk/maintainable-software-practice-workshop * eScience Meets the Instrument, https://confluence-vre.its.monash.edu.au/display/escience2012/eScience+Meets+the+Instrument * Collaborative research using eScience infrastructure and high speed networks, http://www.surfnet.nl/en/Hybride_netwerk/SURFlichtpaden/Pages/CollaborativeresearchusingeScienceinfrastructureandhighspeednetworks.aspx * Tutorial: Big Data Processing: Lessons from Industry and Applications in Science, http://www.ci.uchicago.edu/escience2012/tutorial.php CALL FOR PAPERS 8th IEEE International Conference on eScience http://www.ci.uchicago.edu/escience2012/ October 8-12, 2012 Chicago, IL, USA Researchers in all disciplines are increasingly adopting digital tools, techniques and practices, often in communities and projects that span disciplines, laboratories, organizations, and national boundaries. The eScience 2012 conference is designed to bring together leading international and interdisciplinary research communities, developers, and users of eScience applications and enabling IT technologies. The conference serves as a forum to present the results of the latest applications research and product/tool developments and to highlight related activities from around the world. Also, we are now entering the second decade of eScience and the 2012 conference gives an opportunity to take stock of what has been achieved so far and look forward to the challenges and opportunities the next decade will bring. A special emphasis of the 2012 conference is on advances in the application of technology in a particular discipline. Accordingly, significant advances in applications science and technology will be considered as important as the development of new technologies themselves. Further, we welcome contributions in educational activities under any of these disciplines. As a result, the conference will be structured around two e-Science tracks: . eScience Algorithms and Applications . eScience application areas, including: . Physical sciences . Biomedical sciences . Social sciences and humanities . Data-oriented approaches and applications . Compute-oriented approaches and applications . Extreme scale approaches and applications . Cyberinfrastructure to support eScience . Novel hardware . Novel uses of production infrastructure . Software and services . Tools The conference proceedings will be published by the IEEE Computer Society Press, USA and will be made available online through the IEEE Digital Library. Selected papers will be invited to submit extended versions to a special issue of the Future Generation Computer Systems (FGCS)journal. SUBMISSION PROCESS Authors are invited to submit papers with unpublished, original work of not more than 8 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines. (Up to 2 additional pages may be purchased for US$150/page) Templates are available from http://www.ieee.org/conferences_events/conferences/publishing/templates.html. Authors should submit a PDF file that will print on a PostScript printer to https://www.easychair.org/conferences/?conf=escience2012 (Note that paper submitters also must submit an abstract in advance of the paper deadline. This should be done through the same site where papers are submitted.) It is a requirement that at least one author of each accepted paper attend the conference. IMPORTANT DATES Abstract submission (required): 4 July 2012 Paper submission: 11 July 2012 Paper author notification: 22 August 2012 Camera-ready papers due: 10 September 2012 Conference: 8-12 October 2012 CONFERENCE ORGANIZATION General Chair . Ian Foster, University of Chicago & Argonne National Laboratory, USA Program Co-Chairs . Daniel S. Katz, University of Chicago & Argonne National Laboratory, USA . Heinz Stockinger, SIB Swiss Institute of Bioinformatics, Switzerland Program Vice Co-Chairs . eScience Algorithms and Applications Track . David Abramson, Monash University, Australia . Gabrielle Allen, Louisiana State University, USA . Cyberinfrastructure to support eScience Track . Rosa M. Badia, Barcelona Supercomputing Center / CSIC, Spain . Geoffrey Fox, Indiana University, USA Early Results and Works-in-Progress Posters Chair . Roger Barga, Microsoft, USA Workshops Chair . Ruth Pordes, FNAL, USA Sponsorship Chair . Charlie Catlett, Argonne National Laboratory, USA Conference Manager and Finance Chair . Julie Wulf-Knoerzer, University of Chicago & Argonne National Laboratory, USA Publicity Chairs . Kento Aida, National Institute of Informatics, Japan . Ioan Raicu, Illinois Institute of Technology, USA . David Wallom, Oxford e-Research Centre, UK Local Organizing Committee . Ninfa Mayorga, University of Chicago, USA . Evelyn Rayburn, University of Chicago, USA . Lynn Valentini, Argonne National Laboratory, USA Program Committee . eScience Algorithms and Applications Track . Srinivas Aluru, Iowa State University, USA . Ashiq Anjum, University of Derby, UK . David A. Bader, Georgia Institute of Technology, USA . Jon Blower, University of Reading, UK . Paul Bonnington, Monash University, Australia . Simon Cox, University of Southampton, UK . David De Roure, Oxford e-Research Centre, UK . George Djorgovski, California Institute of Technology, USA . Anshu Dubey, University of Chicago & Argonne National Laboratory, USA . Yuri Estrin, Monash University, Australia . Dan Fay, Microsoft, USA . Jeremy Frey, University of Southampton, UK . Wolfgang Gentzsch, HPC Consultant, Germany . Lutz Gross, The University of Queensland, Austrialia . Sverker Holmgren, Uppsala University, Sweden . Bill Howe, University of Washington, USA . Marina Jirotka, University of Oxford, UK . Timoleon Kipouros, University of Cambridge, UK . Kerstin Kleese van Dam, Pacific Northwest National Laboratory, USA . Arun S. Konagurthu, Monash University, Australia . Peter Kunszt, SystemsX.ch , Switzerland . Alexey Lastovetsky, University College Dublin, Ireland . Andrew Lewis, Griffith University, Australia . Sergio Maffioletti, University of Zurich, Switzerland . Amitava Majumdar, San Diego Supercomputer Center, University of California at San Diego, USA . Rui Mao, Shenzhen University, China . Madhav V. Marathe, Virginia Tech, USA . Maryann Martone, University of California at San Diego, USA . Louis Moresi, Monash University, Australia . Riccardo Murri, University of Zurich, Switzerland . Silvia D. Olabarriaga, Academic Medical Center of the University of Amsterdam, Netherlands . Enrique S. Quintana-Ort?, Universidad Jaume I, Spain . Abani Patra, University at Buffalo, USA . Rob Pennington, NSF, USA . Andrew Perry, Monash University, Australia . Beth Plale, Indiana University, USA . Michael Resch, University of Stuttgart, Germany . Adrian Sandu, Virginia Tech, USA . Mark Savill, Cranfield University, UK . Erik Schnetter, Perimeter Institute for Theoretical Physics, Canada . Edward Seidel, Louisiana State University, USA . Suzanne M. Shontz, The Pennsylvania State University, USA . David Skinner, Lawrence Berkeley National Laboratory, USA . Alan Sussman, University of Maryland, USA . Alex Szalay, Johns Hopkins University, USA . Domenico Talia, ICAR-CNR & University of Calabria, Italy . Jian Tao, Louisiana State University, USA . David Wallom, Oxford e-Research Centre, UK . Shaowen Wang, University of Illinois at Urbana-Champaign, USA . Michael Wilde, Argonne National Laboratory & University of Chicago, USA . Nancy Wilkins-Diehr, San Diego Supercomputer Center, University of California at San Diego, USA . Wu Zhang, Shanghai University, China . Yunquan Zhang, Chinese Academy of Sciences, China . Cyberinfrastructure to support eScience Track . Deb Agarwal, Lawrence Berkeley National Laboratory, USA . Ilkay Altintas, San Diego Supercomputer Center, University of California at San Diego, USA . Henri Bal, Vrije Universiteit, Netherlands . Roger Barga, Microsoft, USA . Martin Berzins, University of Utah, USA . John Brooke, University of Manchester, UK . Thomas Fahringer, University of Innsbruck, Austria . Gilles Fedak, INRIA, France . Jos? A. B. Fortes, University of Florida, USA . Yolanda Gil, ISI/USC, USA . Madhusudhan Govindaraju, SUNY Binghamton, USA . Thomas Hacker, Purdue University, USA . Ken Hawick, Massey University, New Zealand . Marty Humphrey, University of Virginia, USA . Hai Jin, Huazhong University of Science and Technology, China . Thilo Kielmann, Vrije Universiteit, Netherlands . Scott Klasky, Oak Ridge National Laboratory, USA . Isao Kojima, AIST, Japan . Tevfik Kosar, University at Buffalo, USA . Dieter Kranzlmueller, LMU & LRZ Munich, Germany . Erwin Laure, KTH, Sweden . Jysoo Lee, KISTI, Korea . Li Xiaoming, Peking University, China . Bertram Lud?scher, University of California, Davis, USA . Andrew Lumsdaine, Indiana University, USA . Tanu Malik, University of Chicago, USA . Satoshi Matsuoka, Tokyo Institute of Technology, Japan . Reagan Moore, University of North Carolina at Chapel Hill, USA . Shirley Moore, University of Kentucky, USA . Steven Newhouse, EGI, Netherlands . Dhabaleswar K. (DK) Panda, The Ohio State University, USA . Manish Parashar, Rutgers University, USA . Ron Perrott, University of Oxford, UK . Depei Qian, Beihang University, China . Judy Qui, Indiana University, USA . Ioan Raicu, Illinois Institute of Technology, USA . Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA . Omer Rana, Cardiff University, UK . Paul Roe, Queensland University of Technology, Australia . Bruno Schulze, LNCC, Brazil . Marc Snir, Argonne National Laboratory & University of Illinois at Urbana-Champaign, USA . Xian-He Sun, Illinois Institute of Technology, USA . Yoshio Tanaka, AIST, Japan . Michela Taufer, University of Delaware, USA . Kerry Taylor, CSIRO, Australia . Douglas Thain, University of Notre Dame, USA . Paul Watson, Newcastle University, UK . Jian Zhang, Northern Illinois University, USA . Jun Zhao, University of Oxford, UK -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Fri Mar 2 02:05:36 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 2 Mar 2012 02:05:36 -0600 (CST) Subject: [Swift-devel] Question about retry behavior In-Reply-To: <2056534556.145019.1330669441101.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1101089446.145100.1330675536970.JavaMail.root@zimbra-mb2.anl.gov> Consider the case of one John Q. Swifterson. Mr. Swifterson is working late one night performing science. He has written a very important program to simulate the effects of cocaine on honeybee dance behavior. John is using persistent coasters and has 100 nodes available. Each node performs only 1 task at a time. In an abundance of caution, he sets execution.retries=50. John then submits 100,000 jobs. 99 jobs start immediately and start working as expected. But, 1 job fails due to a corrupted binary on that node. What should happen next? The swift user guide says this: --- If an application procedure execution fails, Swift will attempt that execution again repeatedly until it succeeds, up until the limit defined in the execution.retries configuration property. Site selection will occur for retried jobs in the same way that it happens for new jobs. Retried jobs may run on the same site or may run on a different site. If the retry limit execution.retries is reached for an application procedure, then that application procedure will fail. This will cause the entire run to fail - either immediately (if the lazy.errors property is false) or after all other possible work has been attempted (if the lazy.errors property is true). --- Since 99/100 nodes are in use, so all 50 retries will occur on same the problematic node. This causes the entire run to fail. Is this correct? Is there any way to change this behavior? One possibility is to set a job throttle lower than the number of sites actually available. That might increase the chances of success a bit. Is there any way to force retries to happen on a different node? And to also optionally to disconnect nodes which experience high failure rates? Thanks, David From benc at hawaga.org.uk Fri Mar 2 02:30:26 2012 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 2 Mar 2012 09:30:26 +0100 Subject: [Swift-devel] Question about retry behavior In-Reply-To: <1101089446.145100.1330675536970.JavaMail.root@zimbra-mb2.anl.gov> References: <1101089446.145100.1330675536970.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: The below was a problem with grid sites that "failed fast" on OSG; but there, there was/is a site scoring mechanism to try to slow down submissions to that site. Plus ?a change, plus c'est la m?me chose. On Mar 2, 2012, at 9:05 AM, David Kelly wrote: > > Consider the case of one John Q. Swifterson. > > Mr. Swifterson is working late one night performing science. He has written a very important program to simulate the effects of cocaine on honeybee dance behavior. > > John is using persistent coasters and has 100 nodes available. Each node performs only 1 task at a time. In an abundance of caution, he sets execution.retries=50. > > John then submits 100,000 jobs. 99 jobs start immediately and start working as expected. But, 1 job fails due to a corrupted binary on that node. What should happen next? > > The swift user guide says this: > --- > If an application procedure execution fails, Swift will attempt that execution again repeatedly until it succeeds, up until the limit defined in the execution.retries configuration property. > > Site selection will occur for retried jobs in the same way that it happens for new jobs. Retried jobs may run on the same site or may run on a different site. > > If the retry limit execution.retries is reached for an application procedure, then that application procedure will fail. This will cause the entire run to fail - either immediately (if the lazy.errors property is false) or after all other possible work has been attempted (if the lazy.errors property is true). > --- > > Since 99/100 nodes are in use, so all 50 retries will occur on same the problematic node. This causes the entire run to fail. Is this correct? Is there any way to change this behavior? > > One possibility is to set a job throttle lower than the number of sites actually available. That might increase the chances of success a bit. > > Is there any way to force retries to happen on a different node? And to also optionally to disconnect nodes which experience high failure rates? > > Thanks, > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Fri Mar 2 09:37:14 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 2 Mar 2012 09:37:14 -0600 (CST) Subject: [Swift-devel] Question about retry behavior In-Reply-To: Message-ID: <817973900.53911.1330702634980.JavaMail.root@zimbra.anl.gov> I think the problem here is that in David's case there is only one site, "OSG", (using the Glidein workload mgmt system GWMS), so he has no control of where his coaster workers start. Jobs are failing because he has not yet told Condor to avoid launching workers on sites where his app is not correctly installed. If this were a more general case where apps fail on specific nodes, we'd want to try to both prevent Condor from running workers on that node, and prevent the coaster worker form taking jobs for that node. In one mode we could train the worker to "freeze" on the node after any job on the node fails in a certain way. That way we'd take the node out of service and stop it form failing future jobs. I think for now we have simple workarounds for this kind of problem but moving forward we should look at increasingly more robust solutions. - Mike ----- Original Message ----- > From: "Ben Clifford" > To: "David Kelly" > Cc: "Swift Devel" > Sent: Friday, March 2, 2012 2:30:26 AM > Subject: Re: [Swift-devel] Question about retry behavior > The below was a problem with grid sites that "failed fast" on OSG; but > there, there was/is a site scoring mechanism to try to slow down > submissions to that site. Plus ?a change, plus c'est la m?me chose. > > On Mar 2, 2012, at 9:05 AM, David Kelly wrote: > > > > > Consider the case of one John Q. Swifterson. > > > > Mr. Swifterson is working late one night performing science. He has > > written a very important program to simulate the effects of cocaine > > on honeybee dance behavior. > > > > John is using persistent coasters and has 100 nodes available. Each > > node performs only 1 task at a time. In an abundance of caution, he > > sets execution.retries=50. > > > > John then submits 100,000 jobs. 99 jobs start immediately and start > > working as expected. But, 1 job fails due to a corrupted binary on > > that node. What should happen next? > > > > The swift user guide says this: > > --- > > If an application procedure execution fails, Swift will attempt that > > execution again repeatedly until it succeeds, up until the limit > > defined in the execution.retries configuration property. > > > > Site selection will occur for retried jobs in the same way that it > > happens for new jobs. Retried jobs may run on the same site or may > > run on a different site. > > > > If the retry limit execution.retries is reached for an application > > procedure, then that application procedure will fail. This will > > cause the entire run to fail - either immediately (if the > > lazy.errors property is false) or after all other possible work has > > been attempted (if the lazy.errors property is true). > > --- > > > > Since 99/100 nodes are in use, so all 50 retries will occur on same > > the problematic node. This causes the entire run to fail. Is this > > correct? Is there any way to change this behavior? > > > > One possibility is to set a job throttle lower than the number of > > sites actually available. That might increase the chances of success > > a bit. > > > > Is there any way to force retries to happen on a different node? And > > to also optionally to disconnect nodes which experience high failure > > rates? > > > > Thanks, > > David > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From iraicu at cs.iit.edu Fri Mar 2 09:56:52 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 2 Mar 2012 09:56:52 -0600 Subject: [Swift-devel] Question about retry behavior In-Reply-To: <817973900.53911.1330702634980.JavaMail.root@zimbra.anl.gov> References: <817973900.53911.1330702634980.JavaMail.root@zimbra.anl.gov> Message-ID: <3F89B2AC-BFE6-4B16-9F9B-C8E0722BDAF9@cs.iit.edu> In Falkon, we used to keep track of failures at the workers, and too many repeated failures with certain exit codes or keywords in the output stream, and we would suspend the faulty worker for some period of time. This worked great for intermittent shared file system problems due to load, as backing off for some time usually fixed the problem. For other things, such as apps not installed or missing data, this only slowed down the failure rate, but at least it was controllable based on the way we configured the worker logic. Ioan -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= On Mar 2, 2012, at 9:37 AM, Michael Wilde wrote: > I think the problem here is that in David's case there is only one site, "OSG", (using the Glidein workload mgmt system GWMS), so he has no control of where his coaster workers start. > > Jobs are failing because he has not yet told Condor to avoid launching workers on sites where his app is not correctly installed. > > If this were a more general case where apps fail on specific nodes, we'd want to try to both prevent Condor from running workers on that node, and prevent the coaster worker form taking jobs for that node. In one mode we could train the worker to "freeze" on the node after any job on the node fails in a certain way. That way we'd take the node out of service and stop it form failing future jobs. > > I think for now we have simple workarounds for this kind of problem but moving forward we should look at increasingly more robust solutions. > > - Mike > > > ----- Original Message ----- >> From: "Ben Clifford" >> To: "David Kelly" >> Cc: "Swift Devel" >> Sent: Friday, March 2, 2012 2:30:26 AM >> Subject: Re: [Swift-devel] Question about retry behavior >> The below was a problem with grid sites that "failed fast" on OSG; but >> there, there was/is a site scoring mechanism to try to slow down >> submissions to that site. Plus ?a change, plus c'est la m?me chose. >> >> On Mar 2, 2012, at 9:05 AM, David Kelly wrote: >> >>> >>> Consider the case of one John Q. Swifterson. >>> >>> Mr. Swifterson is working late one night performing science. He has >>> written a very important program to simulate the effects of cocaine >>> on honeybee dance behavior. >>> >>> John is using persistent coasters and has 100 nodes available. Each >>> node performs only 1 task at a time. In an abundance of caution, he >>> sets execution.retries=50. >>> >>> John then submits 100,000 jobs. 99 jobs start immediately and start >>> working as expected. But, 1 job fails due to a corrupted binary on >>> that node. What should happen next? >>> >>> The swift user guide says this: >>> --- >>> If an application procedure execution fails, Swift will attempt that >>> execution again repeatedly until it succeeds, up until the limit >>> defined in the execution.retries configuration property. >>> >>> Site selection will occur for retried jobs in the same way that it >>> happens for new jobs. Retried jobs may run on the same site or may >>> run on a different site. >>> >>> If the retry limit execution.retries is reached for an application >>> procedure, then that application procedure will fail. This will >>> cause the entire run to fail - either immediately (if the >>> lazy.errors property is false) or after all other possible work has >>> been attempted (if the lazy.errors property is true). >>> --- >>> >>> Since 99/100 nodes are in use, so all 50 retries will occur on same >>> the problematic node. This causes the entire run to fail. Is this >>> correct? Is there any way to change this behavior? >>> >>> One possibility is to set a job throttle lower than the number of >>> sites actually available. That might increase the chances of success >>> a bit. >>> >>> Is there any way to force retries to happen on a different node? And >>> to also optionally to disconnect nodes which experience high failure >>> rates? >>> >>> Thanks, >>> David >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From iraicu at cs.iit.edu Fri Mar 2 09:59:11 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 2 Mar 2012 09:59:11 -0600 Subject: [Swift-devel] Question about retry behavior In-Reply-To: <817973900.53911.1330702634980.JavaMail.root@zimbra.anl.gov> References: <817973900.53911.1330702634980.JavaMail.root@zimbra.anl.gov> Message-ID: And BTW, the logic was part o the worker, and each worker was making a separate independent decision. I think a central place to have this logic might also be useful, and perhaps might perform better, as it might be possible to differentiate between failures due to a specific node, to system wide failures that would cause some job to fail no matter where it was submitted. Ioan -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= On Mar 2, 2012, at 9:37 AM, Michael Wilde wrote: > I think the problem here is that in David's case there is only one site, "OSG", (using the Glidein workload mgmt system GWMS), so he has no control of where his coaster workers start. > > Jobs are failing because he has not yet told Condor to avoid launching workers on sites where his app is not correctly installed. > > If this were a more general case where apps fail on specific nodes, we'd want to try to both prevent Condor from running workers on that node, and prevent the coaster worker form taking jobs for that node. In one mode we could train the worker to "freeze" on the node after any job on the node fails in a certain way. That way we'd take the node out of service and stop it form failing future jobs. > > I think for now we have simple workarounds for this kind of problem but moving forward we should look at increasingly more robust solutions. > > - Mike > > > ----- Original Message ----- >> From: "Ben Clifford" >> To: "David Kelly" >> Cc: "Swift Devel" >> Sent: Friday, March 2, 2012 2:30:26 AM >> Subject: Re: [Swift-devel] Question about retry behavior >> The below was a problem with grid sites that "failed fast" on OSG; but >> there, there was/is a site scoring mechanism to try to slow down >> submissions to that site. Plus ?a change, plus c'est la m?me chose. >> >> On Mar 2, 2012, at 9:05 AM, David Kelly wrote: >> >>> >>> Consider the case of one John Q. Swifterson. >>> >>> Mr. Swifterson is working late one night performing science. He has >>> written a very important program to simulate the effects of cocaine >>> on honeybee dance behavior. >>> >>> John is using persistent coasters and has 100 nodes available. Each >>> node performs only 1 task at a time. In an abundance of caution, he >>> sets execution.retries=50. >>> >>> John then submits 100,000 jobs. 99 jobs start immediately and start >>> working as expected. But, 1 job fails due to a corrupted binary on >>> that node. What should happen next? >>> >>> The swift user guide says this: >>> --- >>> If an application procedure execution fails, Swift will attempt that >>> execution again repeatedly until it succeeds, up until the limit >>> defined in the execution.retries configuration property. >>> >>> Site selection will occur for retried jobs in the same way that it >>> happens for new jobs. Retried jobs may run on the same site or may >>> run on a different site. >>> >>> If the retry limit execution.retries is reached for an application >>> procedure, then that application procedure will fail. This will >>> cause the entire run to fail - either immediately (if the >>> lazy.errors property is false) or after all other possible work has >>> been attempted (if the lazy.errors property is true). >>> --- >>> >>> Since 99/100 nodes are in use, so all 50 retries will occur on same >>> the problematic node. This causes the entire run to fail. Is this >>> correct? Is there any way to change this behavior? >>> >>> One possibility is to set a job throttle lower than the number of >>> sites actually available. That might increase the chances of success >>> a bit. >>> >>> Is there any way to force retries to happen on a different node? And >>> to also optionally to disconnect nodes which experience high failure >>> rates? >>> >>> Thanks, >>> David >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Mar 2 10:10:15 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 2 Mar 2012 10:10:15 -0600 (CST) Subject: [Swift-devel] Question about retry behavior In-Reply-To: Message-ID: <1049547380.54126.1330704615353.JavaMail.root@zimbra.anl.gov> Good points, Ioan - I'd forgotten about that part of the Falkon work. Seems like per-worker fault analysis is a good thing, but that higher level analysis and actions are also needed. Maybe per-worker and per-site analysis and down-ability. - Mike ----- Original Message ----- > From: "Ioan Raicu" > To: "Michael Wilde" > Cc: "Ben Clifford" , "Swift Devel" > Sent: Friday, March 2, 2012 9:59:11 AM > Subject: Re: [Swift-devel] Question about retry behavior > And BTW, the logic was part o the worker, and each worker was making a > separate independent decision. I think a central place to have this > logic might also be useful, and perhaps might perform better, as it > might be possible to differentiate between failures due to a specific > node, to system wide failures that would cause some job to fail no > matter where it was submitted. > > Ioan > > -- > ================================================================= > Ioan Raicu, Ph.D. > Assistant Professor > ================================================================= > Computer Science Department > Illinois Institute of Technology > 10 W. 31st Street Chicago, IL 60616 > ================================================================= > Cel: 1-847-722-0876 > Email: iraicu at cs.iit.edu > Web: http://www.cs.iit.edu/~iraicu/ > ================================================================= > ================================================================= > > > > On Mar 2, 2012, at 9:37 AM, Michael Wilde wrote: > > > I think the problem here is that in David's case there is only one > > site, "OSG", (using the Glidein workload mgmt system GWMS), so he > > has no control of where his coaster workers start. > > > > Jobs are failing because he has not yet told Condor to avoid > > launching workers on sites where his app is not correctly installed. > > > > If this were a more general case where apps fail on specific nodes, > > we'd want to try to both prevent Condor from running workers on that > > node, and prevent the coaster worker form taking jobs for that node. > > In one mode we could train the worker to "freeze" on the node after > > any job on the node fails in a certain way. That way we'd take the > > node out of service and stop it form failing future jobs. > > > > I think for now we have simple workarounds for this kind of problem > > but moving forward we should look at increasingly more robust > > solutions. > > > > - Mike > > > > > > ----- Original Message ----- > >> From: "Ben Clifford" > >> To: "David Kelly" > >> Cc: "Swift Devel" > >> Sent: Friday, March 2, 2012 2:30:26 AM > >> Subject: Re: [Swift-devel] Question about retry behavior > >> The below was a problem with grid sites that "failed fast" on OSG; > >> but > >> there, there was/is a site scoring mechanism to try to slow down > >> submissions to that site. Plus ?a change, plus c'est la m?me chose. > >> > >> On Mar 2, 2012, at 9:05 AM, David Kelly wrote: > >> > >>> > >>> Consider the case of one John Q. Swifterson. > >>> > >>> Mr. Swifterson is working late one night performing science. He > >>> has > >>> written a very important program to simulate the effects of > >>> cocaine > >>> on honeybee dance behavior. > >>> > >>> John is using persistent coasters and has 100 nodes available. > >>> Each > >>> node performs only 1 task at a time. In an abundance of caution, > >>> he > >>> sets execution.retries=50. > >>> > >>> John then submits 100,000 jobs. 99 jobs start immediately and > >>> start > >>> working as expected. But, 1 job fails due to a corrupted binary on > >>> that node. What should happen next? > >>> > >>> The swift user guide says this: > >>> --- > >>> If an application procedure execution fails, Swift will attempt > >>> that > >>> execution again repeatedly until it succeeds, up until the limit > >>> defined in the execution.retries configuration property. > >>> > >>> Site selection will occur for retried jobs in the same way that it > >>> happens for new jobs. Retried jobs may run on the same site or may > >>> run on a different site. > >>> > >>> If the retry limit execution.retries is reached for an application > >>> procedure, then that application procedure will fail. This will > >>> cause the entire run to fail - either immediately (if the > >>> lazy.errors property is false) or after all other possible work > >>> has > >>> been attempted (if the lazy.errors property is true). > >>> --- > >>> > >>> Since 99/100 nodes are in use, so all 50 retries will occur on > >>> same > >>> the problematic node. This causes the entire run to fail. Is this > >>> correct? Is there any way to change this behavior? > >>> > >>> One possibility is to set a job throttle lower than the number of > >>> sites actually available. That might increase the chances of > >>> success > >>> a bit. > >>> > >>> Is there any way to force retries to happen on a different node? > >>> And > >>> to also optionally to disconnect nodes which experience high > >>> failure > >>> rates? > >>> > >>> Thanks, > >>> David > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jon.monette at gmail.com Fri Mar 2 10:27:01 2012 From: jon.monette at gmail.com (Jonathan Monette) Date: Fri, 2 Mar 2012 10:27:01 -0600 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> Message-ID: <19941054-371A-48E0-9EF3-5565E4972415@gmail.com> An update, I noticed that the worker logs were not displaying the message connected. This lead me to believe this circular dependency issue described in the previous email. So I modified the worker.pl to connect to the service before trying to run a WORKER_INIT_CMD script. This caused the connected message to appear in the worker logs, however the lines necessary for the coaster-hosts.pl script to work are still not present in the cps log. This still confuses me, Justin how were you able to get those DEBUG Cpu lines to appear in the cps log? On Mar 1, 2012, at 7:16 PM, Jonathan Monette wrote: > Justin, > So I have been trying to help Emalayan get the host list file for the worker-init.pl script. It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script. I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG. So for some reason the workers are not connecting to the service. When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses. However when setting this line it seems they are not connecting. > > Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker. The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect. I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them. Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file? Any idea what may be happening? The job is running when looking under cqstat. > > A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log. This may need to be changed. Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. > > For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG From wozniak at mcs.anl.gov Fri Mar 2 11:01:02 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 2 Mar 2012 11:01:02 -0600 (CST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> Message-ID: Yes- I must have tested this with a different log file. I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect(). I am a little worried about time outs but it works so far. I will continue testing... Justin On Thu, 1 Mar 2012, Jonathan Monette wrote: > Justin, > So I have been trying to help Emalayan get the host list file for the worker-init.pl script. It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script. I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG. So for some reason the workers are not connecting to the service. When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses. However when setting this line it seems they are not connecting. > > Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker. The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect. I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them. Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file? Any idea what may be happening? The job is running when looking under cqstat. > > A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log. This may need to be changed. Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. > > For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG -- Justin M Wozniak From jonmon at mcs.anl.gov Fri Mar 2 11:05:23 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 2 Mar 2012 11:05:23 -0600 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> Message-ID: Thanks, in my copy I thought I had moved the reconnect to before the init-cmd and it still wasn't working. I will test with your change. I just verified that it was indeed waiting for the worker-init.pl script to finish. I created an empty file for the script to read and it finished connecting and the ip addresses I needed were added to the cps log. I will also be testing your fix. On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > > Yes- I must have tested this with a different log file. I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect(). I am a little worried about time outs but it works so far. I will continue testing... > Justin > > On Thu, 1 Mar 2012, Jonathan Monette wrote: > >> Justin, >> So I have been trying to help Emalayan get the host list file for the worker-init.pl script. It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script. I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG. So for some reason the workers are not connecting to the service. When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses. However when setting this line it seems they are not connecting. >> >> Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker. The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect. I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them. Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file? Any idea what may be happening? The job is running when looking under cqstat. >> >> A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log. This may need to be changed. Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. >> >> For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > > -- > Justin M Wozniak From jonmon at mcs.anl.gov Fri Mar 2 11:15:03 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 2 Mar 2012 11:15:03 -0600 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> Message-ID: That fix still did not work. I had moved it to the same spot. It is still waiting for the worker-init.pl script to finish before the ip addresses are printed to the cps log. Those ip addresses are what is needed by the coaster-hosts.pl script to finish. If I create an empty file for the coaster-host.pl script to read, then the work continues and the ip addresses show up in the cps log. Why is log4j waiting to add those lines to the cps log after the worker-init.pl script is finished? On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > Thanks, in my copy I thought I had moved the reconnect to before the init-cmd and it still wasn't working. I will test with your change. I just verified that it was indeed waiting for the worker-init.pl script to finish. I created an empty file for the script to read and it finished connecting and the ip addresses I needed were added to the cps log. I will also be testing your fix. > > On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > >> >> Yes- I must have tested this with a different log file. I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect(). I am a little worried about time outs but it works so far. I will continue testing... >> Justin >> >> On Thu, 1 Mar 2012, Jonathan Monette wrote: >> >>> Justin, >>> So I have been trying to help Emalayan get the host list file for the worker-init.pl script. It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script. I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG. So for some reason the workers are not connecting to the service. When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses. However when setting this line it seems they are not connecting. >>> >>> Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker. The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect. I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them. Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file? Any idea what may be happening? The job is running when looking under cqstat. >>> >>> A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log. This may need to be changed. Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. >>> >>> For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >> >> -- >> Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Fri Mar 2 11:26:45 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 2 Mar 2012 11:26:45 -0600 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> Message-ID: <6DE3E57C-4412-4B63-9196-2D532E71DACB@mcs.anl.gov> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on node 172.18.1.83 from the worker log, instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the cps log? They both provide the same ip addresses. And the worker log always has that ip address before the cps log does. On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > That fix still did not work. I had moved it to the same spot. It is still waiting for the worker-init.pl script to finish before the ip addresses are printed to the cps log. Those ip addresses are what is needed by the coaster-hosts.pl script to finish. If I create an empty file for the coaster-host.pl script to read, then the work continues and the ip addresses show up in the cps log. > > Why is log4j waiting to add those lines to the cps log after the worker-init.pl script is finished? > > On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > >> Thanks, in my copy I thought I had moved the reconnect to before the init-cmd and it still wasn't working. I will test with your change. I just verified that it was indeed waiting for the worker-init.pl script to finish. I created an empty file for the script to read and it finished connecting and the ip addresses I needed were added to the cps log. I will also be testing your fix. >> >> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >> >>> >>> Yes- I must have tested this with a different log file. I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect(). I am a little worried about time outs but it works so far. I will continue testing... >>> Justin >>> >>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>> >>>> Justin, >>>> So I have been trying to help Emalayan get the host list file for the worker-init.pl script. It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script. I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG. So for some reason the workers are not connecting to the service. When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses. However when setting this line it seems they are not connecting. >>>> >>>> Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker. The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect. I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them. Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file? Any idea what may be happening? The job is running when looking under cqstat. >>>> >>>> A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log. This may need to be changed. Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. >>>> >>>> For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>> >>> -- >>> Justin M Wozniak >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Mar 2 11:27:51 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 2 Mar 2012 11:27:51 -0600 (CST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: Message-ID: <490167337.54555.1330709271010.JavaMail.root@zimbra.anl.gov> This all seems a bit brittle. I think what we did in Falkon was to use the Zoid init script that runs on the IOP to add the worker IPs: http://wiki.mcs.anl.gov/zeptoos/index.php/ZOID#User_script This script can find the subnet of the workers, and the worker IPs on that subnet are fixed. You still have the issue of waiting for all the IPs to report back. Each could make a file in a directory. But you'd be less at the mercy of worker.pl scripts and log4j to get the IP info you need, perhaps? - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Justin M Wozniak" > Cc: "swift-devel at ci.uchicago.edu Devel" , emalayan at ece.ubc.ca > Sent: Friday, March 2, 2012 11:15:03 AM > Subject: Re: [Swift-devel] coasters-hosts.pl script > That fix still did not work. I had moved it to the same spot. It is > still waiting for the worker-init.pl script to finish before the ip > addresses are printed to the cps log. Those ip addresses are what is > needed by the coaster-hosts.pl script to finish. If I create an empty > file for the coaster-host.pl script to read, then the work continues > and the ip addresses show up in the cps log. > > Why is log4j waiting to add those lines to the cps log after the > worker-init.pl script is finished? > > On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > > > Thanks, in my copy I thought I had moved the reconnect to before the > > init-cmd and it still wasn't working. I will test with your change. > > I just verified that it was indeed waiting for the worker-init.pl > > script to finish. I created an empty file for the script to read and > > it finished connecting and the ip addresses I needed were added to > > the cps log. I will also be testing your fix. > > > > On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > > > >> > >> Yes- I must have tested this with a different log file. I just > >> checked in and installed in ~wozniak/Public a fix for this that > >> launches WORKER_INIT_CMD after the reconnect(). I am a little > >> worried about time outs but it works so far. I will continue > >> testing... > >> Justin > >> > >> On Thu, 1 Mar 2012, Jonathan Monette wrote: > >> > >>> Justin, > >>> So I have been trying to help Emalayan get the host list file for > >>> the worker-init.pl script. It seems the cps log file is not > >>> providing the ip addresses for the coasters-hosts.pl script. I > >>> thought this was maybe because we did not have the correct log4j > >>> setting set but we have the Coaster service Cpu set to DEBUG. So > >>> for some reason the workers are not connecting to the service. > >>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the > >>> coaster-service.conf file I see the workers connect and the cps > >>> log file shows there ip addresses. However when setting this line > >>> it seems they are not connecting. > >>> > >>> Emalayan thought there might be some sort of circular dependency > >>> going with the host-list file and the worker. The worker requires > >>> the host-list file so that it can run the worker-init.pl script > >>> and then connect but the host-list file cannot be generated > >>> because the workers cannot connect. I noticed in your swift-test > >>> directory the cps files did have the ip addresses set and > >>> coasters-hosts.pl found the ip addresses and reported them. Did > >>> you try that test with setting the WORKER_ENVIRONMENT variable in > >>> the coaster-service.conf file? Any idea what may be happening? The > >>> job is running when looking under cqstat. > >>> > >>> A side note: At the mosaswift site, your example talks about > >>> running the coasters-hosts.pl on the cps log but the example you > >>> provide runs it on logs/coasters.log. This may need to be changed. > >>> Also, should provide the log4j setting that is required to > >>> generate the Cpu line with the worker ip address just to clarify > >>> that this line should be set for this script to work. > >>> > >>> For reference, this line: > >>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > >> > >> -- > >> Justin M Wozniak > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Fri Mar 2 11:47:36 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 2 Mar 2012 11:47:36 -0600 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <490167337.54555.1330709271010.JavaMail.root@zimbra.anl.gov> References: <490167337.54555.1330709271010.JavaMail.root@zimbra.anl.gov> Message-ID: <5CC8D04D-3F1C-44A8-95A8-C18BFC1813F4@mcs.anl.gov> I think this approach was chosen to get something working for Emalayan quickly so he could start developing. I do not think this was Goin to be the final approach. A more stable approach was to replace it later. On Mar 2, 2012, at 11:27 AM, Michael Wilde wrote: > This all seems a bit brittle. I think what we did in Falkon was to use the Zoid init script that runs on the IOP to add the worker IPs: > > http://wiki.mcs.anl.gov/zeptoos/index.php/ZOID#User_script > > This script can find the subnet of the workers, and the worker IPs on that subnet are fixed. > > You still have the issue of waiting for all the IPs to report back. Each could make a file in a directory. But you'd be less at the mercy of worker.pl scripts and log4j to get the IP info you need, perhaps? > > - Mike > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Justin M Wozniak" >> Cc: "swift-devel at ci.uchicago.edu Devel" , emalayan at ece.ubc.ca >> Sent: Friday, March 2, 2012 11:15:03 AM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> That fix still did not work. I had moved it to the same spot. It is >> still waiting for the worker-init.pl script to finish before the ip >> addresses are printed to the cps log. Those ip addresses are what is >> needed by the coaster-hosts.pl script to finish. If I create an empty >> file for the coaster-host.pl script to read, then the work continues >> and the ip addresses show up in the cps log. >> >> Why is log4j waiting to add those lines to the cps log after the >> worker-init.pl script is finished? >> >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >> >>> Thanks, in my copy I thought I had moved the reconnect to before the >>> init-cmd and it still wasn't working. I will test with your change. >>> I just verified that it was indeed waiting for the worker-init.pl >>> script to finish. I created an empty file for the script to read and >>> it finished connecting and the ip addresses I needed were added to >>> the cps log. I will also be testing your fix. >>> >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>> >>>> >>>> Yes- I must have tested this with a different log file. I just >>>> checked in and installed in ~wozniak/Public a fix for this that >>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>> worried about time outs but it works so far. I will continue >>>> testing... >>>> Justin >>>> >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>> >>>>> Justin, >>>>> So I have been trying to help Emalayan get the host list file for >>>>> the worker-init.pl script. It seems the cps log file is not >>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>> thought this was maybe because we did not have the correct log4j >>>>> setting set but we have the Coaster service Cpu set to DEBUG. So >>>>> for some reason the workers are not connecting to the service. >>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the >>>>> coaster-service.conf file I see the workers connect and the cps >>>>> log file shows there ip addresses. However when setting this line >>>>> it seems they are not connecting. >>>>> >>>>> Emalayan thought there might be some sort of circular dependency >>>>> going with the host-list file and the worker. The worker requires >>>>> the host-list file so that it can run the worker-init.pl script >>>>> and then connect but the host-list file cannot be generated >>>>> because the workers cannot connect. I noticed in your swift-test >>>>> directory the cps files did have the ip addresses set and >>>>> coasters-hosts.pl found the ip addresses and reported them. Did >>>>> you try that test with setting the WORKER_ENVIRONMENT variable in >>>>> the coaster-service.conf file? Any idea what may be happening? The >>>>> job is running when looking under cqstat. >>>>> >>>>> A side note: At the mosaswift site, your example talks about >>>>> running the coasters-hosts.pl on the cps log but the example you >>>>> provide runs it on logs/coasters.log. This may need to be changed. >>>>> Also, should provide the log4j setting that is required to >>>>> generate the Cpu line with the worker ip address just to clarify >>>>> that this line should be set for this script to work. >>>>> >>>>> For reference, this line: >>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>> >>>> -- >>>> Justin M Wozniak >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From jonmon at mcs.anl.gov Fri Mar 2 16:21:08 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 2 Mar 2012 16:21:08 -0600 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <6DE3E57C-4412-4B63-9196-2D532E71DACB@mcs.anl.gov> References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> <6DE3E57C-4412-4B63-9196-2D532E71DACB@mcs.anl.gov> Message-ID: <77426B86-9314-4A0F-AE60-01DC5D64AB8E@mcs.anl.gov> Emalayan, We believe we have fixed the issue. You can copy the new coasters-hosts.pl script from ~jonmon/surveyor/worker-init-test/coasters-hosts.pl This script reads the worker logs located in the logs directory. The steps to run are as follows: start-coaster-service ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt You MUST clean out the worker logs after you before you start a new coaster service to make sure the script searches the right worker log files. This may not be ideal at the moment but this will help get you started. If you have any other questions feel free to ask. We will need to update the mosaswift site with the new information, we will do this soon. On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on node 172.18.1.83 from the worker log, > instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the cps log? > > They both provide the same ip addresses. And the worker log always has that ip address before the cps log does. > > On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > >> That fix still did not work. I had moved it to the same spot. It is still waiting for the worker-init.pl script to finish before the ip addresses are printed to the cps log. Those ip addresses are what is needed by the coaster-hosts.pl script to finish. If I create an empty file for the coaster-host.pl script to read, then the work continues and the ip addresses show up in the cps log. >> >> Why is log4j waiting to add those lines to the cps log after the worker-init.pl script is finished? >> >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >> >>> Thanks, in my copy I thought I had moved the reconnect to before the init-cmd and it still wasn't working. I will test with your change. I just verified that it was indeed waiting for the worker-init.pl script to finish. I created an empty file for the script to read and it finished connecting and the ip addresses I needed were added to the cps log. I will also be testing your fix. >>> >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>> >>>> >>>> Yes- I must have tested this with a different log file. I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect(). I am a little worried about time outs but it works so far. I will continue testing... >>>> Justin >>>> >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>> >>>>> Justin, >>>>> So I have been trying to help Emalayan get the host list file for the worker-init.pl script. It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script. I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG. So for some reason the workers are not connecting to the service. When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses. However when setting this line it seems they are not connecting. >>>>> >>>>> Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker. The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect. I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them. Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file? Any idea what may be happening? The job is running when looking under cqstat. >>>>> >>>>> A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log. This may need to be changed. Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. >>>>> >>>>> For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>> >>>> -- >>>> Justin M Wozniak >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From svemalayan at yahoo.com Fri Mar 2 16:31:31 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Fri, 2 Mar 2012 14:31:31 -0800 (PST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <77426B86-9314-4A0F-AE60-01DC5D64AB8E@mcs.anl.gov> References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> <6DE3E57C-4412-4B63-9196-2D532E71DACB@mcs.anl.gov> <77426B86-9314-4A0F-AE60-01DC5D64AB8E@mcs.anl.gov> Message-ID: <1330727491.44630.YahooMailNeo@web39508.mail.mud.yahoo.com> Thank you Jon and Justin. This is a great news. I will get back to you if I have questions. Regards Emalayan ________________________________ From: Jonathan Monette To: Justin M Wozniak Cc: "swift-devel at ci.uchicago.edu Devel" ; emalayan at ece.ubc.ca Sent: Friday, 2 March 2012 2:21 PM Subject: Re: [Swift-devel] coasters-hosts.pl script Emalayan, ? We believe we have fixed the issue.? You can copy the new coasters-hosts.pl script from ~jonmon/surveyor/worker-init-test/coasters-hosts.pl This script reads the worker logs located in the logs directory.? The steps to run are as follows: start-coaster-service ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt You MUST clean out the worker logs after you before you start a new coaster service to make sure the script searches the right worker log files.? ? This may not be ideal at the moment but this will help get you started.? If you have any other questions feel free to ask.? We will need to update the mosaswift site with the new information, we will do this soon. On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > Can we match this line: 2012/03/02 17:16:04.712 INFO? - Running on node 172.18.1.83 from the worker log, > instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the cps log? > > They both provide the same ip addresses.? And the worker log always has that ip address before the cps log does. > > On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > >> That fix still did not work.? I had moved it to the same spot.? It is still waiting for the worker-init.pl script to finish before the ip addresses are printed to the cps log.? Those ip addresses are what is needed by the coaster-hosts.pl script to finish.? If I create an empty file for the coaster-host.pl script to read, then the work continues and the ip addresses show up in the cps log.? >> >> Why is log4j waiting to add those lines to the cps log after the worker-init.pl script is finished? >> >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >> >>> Thanks, in my copy I thought I had moved the reconnect to before the init-cmd and it still wasn't working.? I will test with your change.? I just verified that it was indeed waiting for the worker-init.pl script to finish.? I created an empty file for the script to read and it finished connecting and the ip addresses I needed were added to the cps log.? I will also be testing your fix. >>> >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>> >>>> >>>> Yes- I must have tested this with a different log file.? I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect().? I am a little worried about time outs but it works so far.? I will continue testing... >>>> ??? Justin >>>> >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>> >>>>> Justin, >>>>> So I have been trying to help Emalayan get the host list file for the worker-init.pl script.? It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script.? I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG.? So for some reason the workers are not connecting to the service.? When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses.? However when setting this line it seems they are not connecting. >>>>> >>>>> Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker.? The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect.? I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them.? Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file?? Any idea what may be happening?? The job is running when looking under cqstat. >>>>> >>>>> A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log.? This may need to be changed.? Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. >>>>> >>>>> For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>> >>>> -- >>>> Justin M Wozniak >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Fri Mar 2 18:41:02 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Fri, 2 Mar 2012 16:41:02 -0800 (PST) Subject: [Swift-devel] Fw: coasters-hosts.pl script In-Reply-To: <00ed01ccf8d4$ac72c660$05585320$@gmail.com> References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> <6DE3E57C-4412-4B63-9196-2D532E71DACB@mcs.anl.gov> <77426B86-9314-4A0F-AE60-01DC5D64AB8E@mcs.anl.gov> <1330727491.44630.YahooMailNeo@web39508.mail.mud.yahoo.com> <00ed01ccf8d4$ac72c660$05585320$@gmail.com> Message-ID: <1330735262.39425.YahooMailNeo@web39504.mail.mud.yahoo.com> Forwarding Matei's mail....... ----- Forwarded Message ----- From: Matei Ripeanu To: mosastore at googlegroups.com; 'Jonathan Monette' ; 'Justin M Wozniak' Cc: swift-devel at ci.uchicago.edu; emalayan at ece.ubc.ca Sent: Friday, 2 March 2012 4:29 PM Subject: RE: [Swift-devel] coasters-hosts.pl script Indeed this is good news!? Thank you. ? Our next task, I think, will be to figure out how to configure Swift so that the headnode (where Swift runs) will not require any access to intermediate storage (MosaStore). Only the worker nodes will have access to intermediate storage.? This is to go around the one way headnode-worker node connectivity issue. ? Any guidance on how to get this configuration would be much appreciated. ? Thank you again, ? -Matei ? From:mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] On Behalf Of Emalayan Vairavanathan Sent: March-02-12 2:32 PM To: Jonathan Monette; Justin M Wozniak Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; MosaStore Subject: Re: [Swift-devel] coasters-hosts.pl script ? Thank you Jon and Justin. ? This is a great news. I will get back to you if I have questions. ? Regards Emalayan ? ________________________________ From:Jonathan Monette To: Justin M Wozniak Cc: "swift-devel at ci.uchicago.edu Devel" ; emalayan at ece.ubc.ca Sent: Friday, 2 March 2012 2:21 PM Subject: Re: [Swift-devel] coasters-hosts.pl script Emalayan, ? We believe we have fixed the issue.? You can copy the new coasters-hosts.pl script from ~jonmon/surveyor/worker-init-test/coasters-hosts.pl This script reads the worker logs located in the logs directory.? The steps to run are as follows: start-coaster-service ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt You MUST clean out the worker logs after you before you start a new coaster service to make sure the script searches the right worker log files.? ? This may not be ideal at the moment but this will help get you started.? If you have any other questions feel free to ask.? We will need to update the mosaswift site with the new information, we will do this soon. On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > Can we match this line: 2012/03/02 17:16:04.712 INFO? - Running on node 172.18.1.83 from the worker log, > instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the cps log? > > They both provide the same ip addresses.? And the worker log always has that ip address before the cps log does. > > On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > >> That fix still did not work.? I had moved it to the same spot.? It is still waiting for the worker-init.pl script to finish before the ip addresses are printed to the cps log.? Those ip addresses are what is needed by the coaster-hosts.pl script to finish.? If I create an empty file for the coaster-host.pl script to read, then the work continues and the ip addresses show up in the cps log.? >> >> Why is log4j waiting to add those lines to the cps log after the worker-init.pl script is finished? >> >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >> >>> Thanks, in my copy I thought I had moved the reconnect to before the init-cmd and it still wasn't working.? I will test with your change.? I just verified that it was indeed waiting for the worker-init.pl script to finish.? I created an empty file for the script to read and it finished connecting and the ip addresses I needed were added to the cps log.? I will also be testing your fix. >>> >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>> >>>> >>>> Yes- I must have tested this with a different log file.? I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect().? I am a little worried about time outs but it works so far.? I will continue testing... >>>> ??? Justin >>>> >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>> >>>>> Justin, >>>>> So I have been trying to help Emalayan get the host list file for the worker-init.pl script.? It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script.? I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG.? So for some reason the workers are not connecting to the service.? When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses.? However when setting this line it seems they are not connecting. >>>>> >>>>> Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker.? The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect.? I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them.? Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file?? Any idea what may be happening?? The job is running when looking under cqstat. >>>>> >>>>> A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log.? This may need to be changed.? Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. >>>>> >>>>> For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>> >>>> -- >>>> Justin M Wozniak >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From matei.ripeanu at gmail.com Fri Mar 2 18:29:17 2012 From: matei.ripeanu at gmail.com (Matei Ripeanu) Date: Fri, 2 Mar 2012 16:29:17 -0800 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <1330727491.44630.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> <6DE3E57C-4412-4B63-9196-2D532E71DACB@mcs.anl.gov> <77426B86-9314-4A0F-AE60-01DC5D64AB8E@mcs.anl.gov> <1330727491.44630.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <00ed01ccf8d4$ac72c660$05585320$@gmail.com> Indeed this is good news! Thank you. Our next task, I think, will be to figure out how to configure Swift so that the headnode (where Swift runs) will not require any access to intermediate storage (MosaStore). Only the worker nodes will have access to intermediate storage. This is to go around the one way headnode-worker node connectivity issue. Any guidance on how to get this configuration would be much appreciated. Thank you again, -Matei From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] On Behalf Of Emalayan Vairavanathan Sent: March-02-12 2:32 PM To: Jonathan Monette; Justin M Wozniak Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; MosaStore Subject: Re: [Swift-devel] coasters-hosts.pl script Thank you Jon and Justin. This is a great news. I will get back to you if I have questions. Regards Emalayan _____ From: Jonathan Monette To: Justin M Wozniak Cc: "swift-devel at ci.uchicago.edu Devel " ; emalayan at ece.ubc.ca Sent: Friday, 2 March 2012 2:21 PM Subject: Re: [Swift-devel] coasters-hosts.pl script Emalayan, We believe we have fixed the issue. You can copy the new coasters-hosts.pl script from ~jonmon/surveyor/worker-init-test/coasters-hosts.pl This script reads the worker logs located in the logs directory. The steps to run are as follows: start-coaster-service ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt You MUST clean out the worker logs after you before you start a new coaster service to make sure the script searches the right worker log files. This may not be ideal at the moment but this will help get you started. If you have any other questions feel free to ask. We will need to update the mosaswift site with the new information, we will do this soon. On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on node 172.18.1.83 from the worker log, > instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the cps log? > > They both provide the same ip addresses. And the worker log always has that ip address before the cps log does. > > On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > >> That fix still did not work. I had moved it to the same spot. It is still waiting for the worker-init.pl script to finish before the ip addresses are printed to the cps log. Those ip addresses are what is needed by the coaster-hosts.pl script to finish. If I create an empty file for the coaster-host.pl script to read, then the work continues and the ip addresses show up in the cps log. >> >> Why is log4j waiting to add those lines to the cps log after the worker-init.pl script is finished? >> >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >> >>> Thanks, in my copy I thought I had moved the reconnect to before the init-cmd and it still wasn't working. I will test with your change. I just verified that it was indeed waiting for the worker-init.pl script to finish. I created an empty file for the script to read and it finished connecting and the ip addresses I needed were added to the cps log. I will also be testing your fix. >>> >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>> >>>> >>>> Yes- I must have tested this with a different log file. I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect(). I am a little worried about time outs but it works so far. I will continue testing... >>>> Justin >>>> >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>> >>>>> Justin, >>>>> So I have been trying to help Emalayan get the host list file for the worker-init.pl script. It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script. I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG. So for some reason the workers are not connecting to the service. When I comment out the export WORKER_ENVIRONEMTN="." line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses. However when setting this line it seems they are not connecting. >>>>> >>>>> Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker. The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect. I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them. Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file? Any idea what may be happening? The job is running when looking under cqstat. >>>>> >>>>> A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log. This may need to be changed. Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. >>>>> >>>>> For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBU G >>>> >>>> -- >>>> Justin M Wozniak >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Mar 2 19:24:38 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 2 Mar 2012 19:24:38 -0600 (CST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <00ed01ccf8d4$ac72c660$05585320$@gmail.com> Message-ID: <1975728845.57418.1330737878041.JavaMail.root@zimbra.anl.gov> > Our next task, I think, will be to figure out how to configure Swift > so that the headnode (where Swift runs) will not require any access to > intermediate storage (MosaStore). Only the worker nodes will have > access to intermediate storage. This is to go around the one way > headnode-worker node connectivity issue. For this I would try: - in the .swift script, map the intermediate data to a full path like /mosa/... - in an fs.data file set /mosa to DIRECT mode This means that each app that references /mosa will do so from the worker node with no name translation and no intermediate copying or linking. At least thats the intent. We need to make sure that the CDM direct mode achieves this. The first step is to write a simple example (like a cat to cat copy) and test it in local mode, say using /tmp/mosa, checking its logs and work dir to ensure that direct mode was working as expected. Then try the same thing on a vanilla cluster (the UBC cluster would be fine) using first /tmp/mosa and then the real /mosa. After verifying all that, try it on the BG/P. Two other pre-reqs to check on the BG/P would be: - is Mosa starting up OK and accessible on the worker nodes - to check that, verify that ssh-telnet access to the worker nodes is working. It seemed *not* to be on surveyor earlier this week. Mike > > > Any guidance on how to get this configuration would be much > appreciated. > > > > Thank you again, > > > > -Matei > > > > > > From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] > On Behalf Of Emalayan Vairavanathan > Sent: March-02-12 2:32 PM > To: Jonathan Monette; Justin M Wozniak > Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; > MosaStore > Subject: Re: [Swift-devel] coasters-hosts.pl script > > > > > > Thank you Jon and Justin. > > > > > > This is a great news. I will get back to you if I have questions. > > > > > > Regards > > > Emalayan > > > > > > > > > > From: Jonathan Monette < jonmon at mcs.anl.gov > > To: Justin M Wozniak < wozniak at mcs.anl.gov > > Cc: " swift-devel at ci.uchicago.edu Devel " < > swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca > Sent: Friday, 2 March 2012 2:21 PM > Subject: Re: [Swift-devel] coasters-hosts.pl script > > > Emalayan, > We believe we have fixed the issue. You can copy the new > coasters-hosts.pl script from > ~jonmon/surveyor/worker-init-test/coasters-hosts.pl > > This script reads the worker logs located in the logs directory. The > steps to run are as follows: > start-coaster-service > > ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt > > You MUST clean out the worker logs after you before you start a new > coaster service to make sure the script searches the right worker log > files. This may not be ideal at the moment but this will help get you > started. If you have any other questions feel free to ask. We will > need to update the mosaswift site with the new information, we will do > this soon. > > On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > > > Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on > > node 172.18.1.83 from the worker log, > > instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker > > started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the > > cps log? > > > > They both provide the same ip addresses. And the worker log always > > has that ip address before the cps log does. > > > > On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > > > >> That fix still did not work. I had moved it to the same spot. It is > >> still waiting for the worker-init.pl script to finish before the ip > >> addresses are printed to the cps log. Those ip addresses are what > >> is needed by the coaster-hosts.pl script to finish. If I create an > >> empty file for the coaster-host.pl script to read, then the work > >> continues and the ip addresses show up in the cps log. > >> > >> Why is log4j waiting to add those lines to the cps log after the > >> worker-init.pl script is finished? > >> > >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > >> > >>> Thanks, in my copy I thought I had moved the reconnect to before > >>> the init-cmd and it still wasn't working. I will test with your > >>> change. I just verified that it was indeed waiting for the > >>> worker-init.pl script to finish. I created an empty file for the > >>> script to read and it finished connecting and the ip addresses I > >>> needed were added to the cps log. I will also be testing your fix. > >>> > >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > >>> > >>>> > >>>> Yes- I must have tested this with a different log file. I just > >>>> checked in and installed in ~wozniak/Public a fix for this that > >>>> launches WORKER_INIT_CMD after the reconnect(). I am a little > >>>> worried about time outs but it works so far. I will continue > >>>> testing... > >>>> Justin > >>>> > >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: > >>>> > >>>>> Justin, > >>>>> So I have been trying to help Emalayan get the host list file > >>>>> for the worker-init.pl script. It seems the cps log file is not > >>>>> providing the ip addresses for the coasters-hosts.pl script. I > >>>>> thought this was maybe because we did not have the correct log4j > >>>>> setting set but we have the Coaster service Cpu set to DEBUG. So > >>>>> for some reason the workers are not connecting to the service. > >>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the > >>>>> coaster-service.conf file I see the workers connect and the cps > >>>>> log file shows there ip addresses. However when setting this > >>>>> line it seems they are not connecting. > >>>>> > >>>>> Emalayan thought there might be some sort of circular dependency > >>>>> going with the host-list file and the worker. The worker > >>>>> requires the host-list file so that it can run the > >>>>> worker-init.pl script and then connect but the host-list file > >>>>> cannot be generated because the workers cannot connect. I > >>>>> noticed in your swift-test directory the cps files did have the > >>>>> ip addresses set and coasters-hosts.pl found the ip addresses > >>>>> and reported them. Did you try that test with setting the > >>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? > >>>>> Any idea what may be happening? The job is running when looking > >>>>> under cqstat. > >>>>> > >>>>> A side note: At the mosaswift site, your example talks about > >>>>> running the coasters-hosts.pl on the cps log but the example you > >>>>> provide runs it on logs/coasters.log. This may need to be > >>>>> changed. Also, should provide the log4j setting that is required > >>>>> to generate the Cpu line with the worker ip address just to > >>>>> clarify that this line should be set for this script to work. > >>>>> > >>>>> For reference, this line: > >>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > >>>> > >>>> -- > >>>> Justin M Wozniak > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > You received this message because you are subscribed to the Google > Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com . > To unsubscribe from this group, send email to > mosastore+unsubscribe at googlegroups.com . > For more options, visit this group at > http://groups.google.com/group/mosastore?hl=en . > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Mar 2 19:27:01 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 2 Mar 2012 19:27:01 -0600 (CST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <1975728845.57418.1330737878041.JavaMail.root@zimbra.anl.gov> Message-ID: <2035995545.57424.1330738021666.JavaMail.root@zimbra.anl.gov> > The first step is to write a simple example (like a cat to cat copy) > and test it in local mode, say using /tmp/mosa, checking its logs and > work dir to ensure that direct mode was working as expected. Then try > the same thing on a vanilla cluster (the UBC cluster would be fine) > using first /tmp/mosa and then the real /mosa. I forgot to clarify: do the above test with coaster and provider staging mode, so that there *is* no shared filesystem defined (ie the workdirectory tag in the sites file should be /tmp or /dev/shm) From svemalayan at yahoo.com Fri Mar 2 21:13:30 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Fri, 2 Mar 2012 19:13:30 -0800 (PST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <77426B86-9314-4A0F-AE60-01DC5D64AB8E@mcs.anl.gov> References: <520EEDBE-3C38-44DD-B9C1-97AC8E1BDF19@mcs.anl.gov> <6DE3E57C-4412-4B63-9196-2D532E71DACB@mcs.anl.gov> <77426B86-9314-4A0F-AE60-01DC5D64AB8E@mcs.anl.gov> Message-ID: <1330744410.4245.YahooMailNeo@web39503.mail.mud.yahoo.com> Hi Jon, Thank you again for your time and very quick fix. I tested the fix with zeptoos and zepto-vn-eval/mosatest kernel profiles. The fix only worked with zeptoos. It did not work with zepto-vn-eval/mosatest (This profile contains some Mosastore related bug fixes so to run MosaStore we need this profile ). The reason is: I can generate worker-hosts.txt only with zeptoos and it did not work with zepto-vn-eval/mosatest. This is because coasters-hosts.pl extract worker IP address from workers log files. But with zepto-vn-eval/mosatest worker log files didnt contain the IP address. Please see the log messages attached below. Do you have any idea ? It took few hours for me to narrow down the problem and find out that the issue is with kernel-profile . I hope this information will help you. Thank you Emalayan With zeptoos : 2012/03/03 02:41:01.492 INFO? - 2012.0303.023831.32459 Logging started: Sat Mar? 3 02:41:01 2012 2012/03/03 02:41:01.492 INFO? - 2012.0303.023831.32459 Logging started: Sat Mar? 3 02:41:01 2012 2012/03/03 02:41:01.494 INFO? - Running on node 172.18.1.19 2012/03/03 02:41:01.494 DEBUG - uri=http://172.17.3.12:22346 2012/03/03 02:41:01.494 DEBUG - scheme=http 2012/03/03 02:41:01.495 DEBUG - host=172.17.3.12 2012/03/03 02:41:01.495 DEBUG - port=22346 2012/03/03 02:41:01.495 DEBUG - blockid=2012.0303.023831.32459 With zepto-vn-eval/mosatest: 2012/03/03 02:50:40.667 INFO? - 2012.0303.024814.15474 Logging started: Sat Mar? 3 02:50:40 2012 2012/03/03 02:50:40.683 INFO? - Running on node (none) 2012/03/03 02:50:40.684 DEBUG - uri=http://172.17.3.12:22346 2012/03/03 02:50:40.684 DEBUG - scheme=http 2012/03/03 02:50:40.685 DEBUG - host=172.17.3.12 2012/03/03 02:50:40.685 DEBUG - port=22346 2012/03/03 02:50:40.686 DEBUG - blockid=2012.0303.024814.15474 ________________________________ From: Jonathan Monette To: Justin M Wozniak Cc: "swift-devel at ci.uchicago.edu Devel" ; emalayan at ece.ubc.ca Sent: Friday, 2 March 2012 2:21 PM Subject: Re: [Swift-devel] coasters-hosts.pl script Emalayan, ? We believe we have fixed the issue.? You can copy the new coasters-hosts.pl script from ~jonmon/surveyor/worker-init-test/coasters-hosts.pl This script reads the worker logs located in the logs directory.? The steps to run are as follows: start-coaster-service ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt You MUST clean out the worker logs after you before you start a new coaster service to make sure the script searches the right worker log files.? ? This may not be ideal at the moment but this will help get you started.? If you have any other questions feel free to ask.? We will need to update the mosaswift site with the new information, we will do this soon. On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > Can we match this line: 2012/03/02 17:16:04.712 INFO? - Running on node 172.18.1.83 from the worker log, > instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the cps log? > > They both provide the same ip addresses.? And the worker log always has that ip address before the cps log does. > > On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > >> That fix still did not work.? I had moved it to the same spot.? It is still waiting for the worker-init.pl script to finish before the ip addresses are printed to the cps log.? Those ip addresses are what is needed by the coaster-hosts.pl script to finish.? If I create an empty file for the coaster-host.pl script to read, then the work continues and the ip addresses show up in the cps log.? >> >> Why is log4j waiting to add those lines to the cps log after the worker-init.pl script is finished? >> >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >> >>> Thanks, in my copy I thought I had moved the reconnect to before the init-cmd and it still wasn't working.? I will test with your change.? I just verified that it was indeed waiting for the worker-init.pl script to finish.? I created an empty file for the script to read and it finished connecting and the ip addresses I needed were added to the cps log.? I will also be testing your fix. >>> >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>> >>>> >>>> Yes- I must have tested this with a different log file.? I just checked in and installed in ~wozniak/Public a fix for this that launches WORKER_INIT_CMD after the reconnect().? I am a little worried about time outs but it works so far.? I will continue testing... >>>> ??? Justin >>>> >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>> >>>>> Justin, >>>>> So I have been trying to help Emalayan get the host list file for the worker-init.pl script.? It seems the cps log file is not providing the ip addresses for the coasters-hosts.pl script.? I thought this was maybe because we did not have the correct log4j setting set but we have the Coaster service Cpu set to DEBUG.? So for some reason the workers are not connecting to the service.? When I comment out the export WORKER_ENVIRONEMTN="?" line in the coaster-service.conf file I see the workers connect and the cps log file shows there ip addresses.? However when setting this line it seems they are not connecting. >>>>> >>>>> Emalayan thought there might be some sort of circular dependency going with the host-list file and the worker.? The worker requires the host-list file so that it can run the worker-init.pl script and then connect but the host-list file cannot be generated because the workers cannot connect.? I noticed in your swift-test directory the cps files did have the ip addresses set and coasters-hosts.pl found the ip addresses and reported them.? Did you try that test with setting the WORKER_ENVIRONMENT variable in the coaster-service.conf file?? Any idea what may be happening?? The job is running when looking under cqstat. >>>>> >>>>> A side note: At the mosaswift site, your example talks about running the coasters-hosts.pl on the cps log but the example you provide runs it on logs/coasters.log.? This may need to be changed.? Also, should provide the log4j setting that is required to generate the Cpu line with the worker ip address just to clarify that this line should be set for this script to work. >>>>> >>>>> For reference, this line: log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>> >>>> -- >>>> Justin M Wozniak >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Mar 2 21:31:18 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 2 Mar 2012 21:31:18 -0600 (CST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <1330744410.4245.YahooMailNeo@web39503.mail.mud.yahoo.com> Message-ID: <170800869.57544.1330745478394.JavaMail.root@zimbra.anl.gov> Emalayan, The problem may be due to the hostname command returning something unexpected (perhaps null) on the worker nodes when booted under that kernel profile. These lines are in worker.pl: my $myhost=`hostname`; $myhost =~ s/\s+$//; ... wlog(INFO, "Running on node $myhost\n"); To debug this, it seems useful to be able to login to the compute nodes via the ssh-telnet procedure. That seems not to work last time we tried - perhaps that should be debugged. Also, you could run simple test jobs with cqsub to print the output (and location) of the hostname command. Perhaps hostname is not in the PATH for worker nodes booted under that kernel??? I recall in the distant past we used to need to do various PATH and LD_LIBRARY_PATH initialization steps to get the right /bin and /usr/bin for the ZeptoOS nodes. Maybe something is broken in that regard for the alternate kernel profile? - Mike ----- Original Message ----- > From: "Emalayan Vairavanathan" > To: "Jonathan Monette" , "Justin M Wozniak" > Cc: "swift-devel at ci.uchicago.edu Devel" , emalayan at ece.ubc.ca, "MosaStore" > > Sent: Friday, March 2, 2012 9:13:30 PM > Subject: Re: [Swift-devel] coasters-hosts.pl script > Hi Jon, > > > Thank you again for your time and very quick fix. I tested the fix > with zeptoos and zepto-vn-eval/mosatest kernel profiles. The fix only > worked with zeptoos. It did not work with zepto-vn-eval/mosatest (This > profile contains some Mosastore related bug fixes so to run MosaStore > we need this profile ). > > > > The reason is: > > > I can generate worker-hosts.txt only with zeptoos and it did not work > with zepto-vn-eval/mosatest. This is because coasters-hosts.pl extract > worker IP address from workers log files. But with > zepto-vn-eval/mosatest worker log files didnt contain the IP address. > Please see the log messages attached below. > > > > > > Do you have any idea ? It took few hours for me to narrow down the > problem and find out that the issue is with kernel-profile . I hope > this information will help you. > > > > > > Thank you > Emalayan > > > > > With zeptoos : > > > 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging started: > Sat Mar 3 02:41:01 2012 > 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging started: > Sat Mar 3 02:41:01 2012 > 2012/03/03 02:41:01.494 INFO - Running on node 172.18.1.19 > 2012/03/03 02:41:01.494 DEBUG - uri=http://172.17.3.12:22346 > 2012/03/03 02:41:01.494 DEBUG - scheme=http > 2012/03/03 02:41:01.495 DEBUG - host=172.17.3.12 > 2012/03/03 02:41:01.495 DEBUG - port=22346 > 2012/03/03 02:41:01.495 DEBUG - blockid=2012.0303.023831.32459 > > > With zepto-vn-eval/mosatest: > > > 2012/03/03 02:50:40.667 INFO - 2012.0303.024814.15474 Logging started: > Sat Mar 3 02:50:40 2012 > 2012/03/03 02:50:40.683 INFO - Running on node (none) > 2012/03/03 02:50:40.684 DEBUG - uri=http://172.17.3.12:22346 > 2012/03/03 02:50:40.684 DEBUG - scheme=http > 2012/03/03 02:50:40.685 DEBUG - host=172.17.3.12 > 2012/03/03 02:50:40.685 DEBUG - port=22346 > 2012/03/03 02:50:40.686 DEBUG - blockid=2012.0303.024814.15474 > > > > > > > From: Jonathan Monette > To: Justin M Wozniak > Cc: "swift-devel at ci.uchicago.edu Devel" ; > emalayan at ece.ubc.ca > Sent: Friday, 2 March 2012 2:21 PM > Subject: Re: [Swift-devel] coasters-hosts.pl script > > Emalayan, > We believe we have fixed the issue. You can copy the new > coasters-hosts.pl script from > ~jonmon/surveyor/worker-init-test/coasters-hosts.pl > > This script reads the worker logs located in the logs directory. The > steps to run are as follows: > start-coaster-service > > ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt > > You MUST clean out the worker logs after you before you start a new > coaster service to make sure the script searches the right worker log > files. This may not be ideal at the moment but this will help get you > started. If you have any other questions feel free to ask. We will > need to update the mosaswift site with the new information, we will do > this soon. > > On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > > > Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on > > node 172.18.1.83 from the worker log, > > instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker > > started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the > > cps log? > > > > They both provide the same ip addresses. And the worker log always > > has that ip address before the cps log does. > > > > On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > > > >> That fix still did not work. I had moved it to the same spot. It is > >> still waiting for the worker-init.pl script to finish before the ip > >> addresses are printed to the cps log. Those ip addresses are what > >> is needed by the coaster-hosts.pl script to finish. If I create an > >> empty file for the coaster-host.pl script to read, then the work > >> continues and the ip addresses show up in the cps log. > >> > >> Why is log4j waiting to add those lines to the cps log after the > >> worker-init.pl script is finished? > >> > >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > >> > >>> Thanks, in my copy I thought I had moved the reconnect to before > >>> the init-cmd and it still wasn't working. I will test with your > >>> change. I just verified that it was indeed waiting for the > >>> worker-init.pl script to finish. I created an empty file for the > >>> script to read and it finished connecting and the ip addresses I > >>> needed were added to the cps log. I will also be testing your fix. > >>> > >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > >>> > >>>> > >>>> Yes- I must have tested this with a different log file. I just > >>>> checked in and installed in ~wozniak/Public a fix for this that > >>>> launches WORKER_INIT_CMD after the reconnect(). I am a little > >>>> worried about time outs but it works so far. I will continue > >>>> testing... > >>>> Justin > >>>> > >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: > >>>> > >>>>> Justin, > >>>>> So I have been trying to help Emalayan get the host list file > >>>>> for the worker-init.pl script. It seems the cps log file is not > >>>>> providing the ip addresses for the coasters-hosts.pl script. I > >>>>> thought this was maybe because we did not have the correct log4j > >>>>> setting set but we have the Coaster service Cpu set to DEBUG. So > >>>>> for some reason the workers are not connecting to the service. > >>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the > >>>>> coaster-service.conf file I see the workers connect and the cps > >>>>> log file shows there ip addresses. However when setting this > >>>>> line it seems they are not connecting. > >>>>> > >>>>> Emalayan thought there might be some sort of circular dependency > >>>>> going with the host-list file and the worker. The worker > >>>>> requires the host-list file so that it can run the > >>>>> worker-init.pl script and then connect but the host-list file > >>>>> cannot be generated because the workers cannot connect. I > >>>>> noticed in your swift-test directory the cps files did have the > >>>>> ip addresses set and coasters-hosts.pl found the ip addresses > >>>>> and reported them. Did you try that test with setting the > >>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? > >>>>> Any idea what may be happening? The job is running when looking > >>>>> under cqstat. > >>>>> > >>>>> A side note: At the mosaswift site, your example talks about > >>>>> running the coasters-hosts.pl on the cps log but the example you > >>>>> provide runs it on logs/coasters.log. This may need to be > >>>>> changed. Also, should provide the log4j setting that is required > >>>>> to generate the Cpu line with the worker ip address just to > >>>>> clarify that this line should be set for this script to work. > >>>>> > >>>>> For reference, this line: > >>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > >>>> > >>>> -- > >>>> Justin M Wozniak > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Fri Mar 2 21:46:07 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 2 Mar 2012 21:46:07 -0600 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <170800869.57544.1330745478394.JavaMail.root@zimbra.anl.gov> References: <170800869.57544.1330745478394.JavaMail.root@zimbra.anl.gov> Message-ID: <4DD588E3-1BBC-476C-817E-0DDC6F26A30B@mcs.anl.gov> I was just about to send with this exact information. : ) I do not think that hostname not being in the PATH would cause "none" to appear. hostname has to be configured to return what you want it to. If it wasn't in the path I think there would be more problems with the worker since it would return an error for the binary not being found, but I could be wrong there. But I echo Mike's debugging suggestions to diagnose the problems. Try sshing/telnetting to the compute node to check out the environment that the worker sees while it is running. On Mar 2, 2012, at 9:31 PM, Michael Wilde wrote: > Emalayan, > > The problem may be due to the hostname command returning something unexpected (perhaps null) on the worker nodes when booted under that kernel profile. > > These lines are in worker.pl: > > my $myhost=`hostname`; > $myhost =~ s/\s+$//; > ... > wlog(INFO, "Running on node $myhost\n"); > > To debug this, it seems useful to be able to login to the compute nodes via the ssh-telnet procedure. That seems not to work last time we tried - perhaps that should be debugged. > > Also, you could run simple test jobs with cqsub to print the output (and location) of the hostname command. > > Perhaps hostname is not in the PATH for worker nodes booted under that kernel??? > > I recall in the distant past we used to need to do various PATH and LD_LIBRARY_PATH initialization steps to get the right /bin and /usr/bin for the ZeptoOS nodes. Maybe something is broken in that regard for the alternate kernel profile? > > - Mike > > > > ----- Original Message ----- >> From: "Emalayan Vairavanathan" >> To: "Jonathan Monette" , "Justin M Wozniak" >> Cc: "swift-devel at ci.uchicago.edu Devel" , emalayan at ece.ubc.ca, "MosaStore" >> >> Sent: Friday, March 2, 2012 9:13:30 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> Hi Jon, >> >> >> Thank you again for your time and very quick fix. I tested the fix >> with zeptoos and zepto-vn-eval/mosatest kernel profiles. The fix only >> worked with zeptoos. It did not work with zepto-vn-eval/mosatest (This >> profile contains some Mosastore related bug fixes so to run MosaStore >> we need this profile ). >> >> >> >> The reason is: >> >> >> I can generate worker-hosts.txt only with zeptoos and it did not work >> with zepto-vn-eval/mosatest. This is because coasters-hosts.pl extract >> worker IP address from workers log files. But with >> zepto-vn-eval/mosatest worker log files didnt contain the IP address. >> Please see the log messages attached below. >> >> >> >> >> >> Do you have any idea ? It took few hours for me to narrow down the >> problem and find out that the issue is with kernel-profile . I hope >> this information will help you. >> >> >> >> >> >> Thank you >> Emalayan >> >> >> >> >> With zeptoos : >> >> >> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging started: >> Sat Mar 3 02:41:01 2012 >> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging started: >> Sat Mar 3 02:41:01 2012 >> 2012/03/03 02:41:01.494 INFO - Running on node 172.18.1.19 >> 2012/03/03 02:41:01.494 DEBUG - uri=http://172.17.3.12:22346 >> 2012/03/03 02:41:01.494 DEBUG - scheme=http >> 2012/03/03 02:41:01.495 DEBUG - host=172.17.3.12 >> 2012/03/03 02:41:01.495 DEBUG - port=22346 >> 2012/03/03 02:41:01.495 DEBUG - blockid=2012.0303.023831.32459 >> >> >> With zepto-vn-eval/mosatest: >> >> >> 2012/03/03 02:50:40.667 INFO - 2012.0303.024814.15474 Logging started: >> Sat Mar 3 02:50:40 2012 >> 2012/03/03 02:50:40.683 INFO - Running on node (none) >> 2012/03/03 02:50:40.684 DEBUG - uri=http://172.17.3.12:22346 >> 2012/03/03 02:50:40.684 DEBUG - scheme=http >> 2012/03/03 02:50:40.685 DEBUG - host=172.17.3.12 >> 2012/03/03 02:50:40.685 DEBUG - port=22346 >> 2012/03/03 02:50:40.686 DEBUG - blockid=2012.0303.024814.15474 >> >> >> >> >> >> >> From: Jonathan Monette >> To: Justin M Wozniak >> Cc: "swift-devel at ci.uchicago.edu Devel" ; >> emalayan at ece.ubc.ca >> Sent: Friday, 2 March 2012 2:21 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> >> Emalayan, >> We believe we have fixed the issue. You can copy the new >> coasters-hosts.pl script from >> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl >> >> This script reads the worker logs located in the logs directory. The >> steps to run are as follows: >> start-coaster-service >> >> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt >> >> You MUST clean out the worker logs after you before you start a new >> coaster service to make sure the script searches the right worker log >> files. This may not be ideal at the moment but this will help get you >> started. If you have any other questions feel free to ask. We will >> need to update the mosaswift site with the new information, we will do >> this soon. >> >> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: >> >>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on >>> node 172.18.1.83 from the worker log, >>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker >>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the >>> cps log? >>> >>> They both provide the same ip addresses. And the worker log always >>> has that ip address before the cps log does. >>> >>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: >>> >>>> That fix still did not work. I had moved it to the same spot. It is >>>> still waiting for the worker-init.pl script to finish before the ip >>>> addresses are printed to the cps log. Those ip addresses are what >>>> is needed by the coaster-hosts.pl script to finish. If I create an >>>> empty file for the coaster-host.pl script to read, then the work >>>> continues and the ip addresses show up in the cps log. >>>> >>>> Why is log4j waiting to add those lines to the cps log after the >>>> worker-init.pl script is finished? >>>> >>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >>>> >>>>> Thanks, in my copy I thought I had moved the reconnect to before >>>>> the init-cmd and it still wasn't working. I will test with your >>>>> change. I just verified that it was indeed waiting for the >>>>> worker-init.pl script to finish. I created an empty file for the >>>>> script to read and it finished connecting and the ip addresses I >>>>> needed were added to the cps log. I will also be testing your fix. >>>>> >>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>>>> >>>>>> >>>>>> Yes- I must have tested this with a different log file. I just >>>>>> checked in and installed in ~wozniak/Public a fix for this that >>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>>>> worried about time outs but it works so far. I will continue >>>>>> testing... >>>>>> Justin >>>>>> >>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>>>> >>>>>>> Justin, >>>>>>> So I have been trying to help Emalayan get the host list file >>>>>>> for the worker-init.pl script. It seems the cps log file is not >>>>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>>>> thought this was maybe because we did not have the correct log4j >>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. So >>>>>>> for some reason the workers are not connecting to the service. >>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the >>>>>>> coaster-service.conf file I see the workers connect and the cps >>>>>>> log file shows there ip addresses. However when setting this >>>>>>> line it seems they are not connecting. >>>>>>> >>>>>>> Emalayan thought there might be some sort of circular dependency >>>>>>> going with the host-list file and the worker. The worker >>>>>>> requires the host-list file so that it can run the >>>>>>> worker-init.pl script and then connect but the host-list file >>>>>>> cannot be generated because the workers cannot connect. I >>>>>>> noticed in your swift-test directory the cps files did have the >>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses >>>>>>> and reported them. Did you try that test with setting the >>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? >>>>>>> Any idea what may be happening? The job is running when looking >>>>>>> under cqstat. >>>>>>> >>>>>>> A side note: At the mosaswift site, your example talks about >>>>>>> running the coasters-hosts.pl on the cps log but the example you >>>>>>> provide runs it on logs/coasters.log. This may need to be >>>>>>> changed. Also, should provide the log4j setting that is required >>>>>>> to generate the Cpu line with the worker ip address just to >>>>>>> clarify that this line should be set for this script to work. >>>>>>> >>>>>>> For reference, this line: >>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>>>> >>>>>> -- >>>>>> Justin M Wozniak >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From wilde at mcs.anl.gov Sat Mar 3 09:40:15 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 3 Mar 2012 09:40:15 -0600 (CST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <4DD588E3-1BBC-476C-817E-0DDC6F26A30B@mcs.anl.gov> Message-ID: <2087527384.58120.1330789215663.JavaMail.root@zimbra.anl.gov> Jon, thanks - I missed your note when I signed off last night. We did a few more tests and verified that on the default zeptoos kernel, hostname was returning a numeric dotted IP address like 172.nnn.nnn.nnn, while on the special kernel fixed for Mosa it was returning "(none)". So it seems like some config issue in that kernel profile. Emalyan was going to try to patch the call to hostname with a script that pulls the IP address from ifconfig or some other source. He was also going to try the compute node login procedure (ssh-telnet), and report if it still doesnt work, as that would be handy for debugging in this case. For the moment we were debugging this with cqsub jobs. Im gonna drop out and leave this to you and Emalayan. (I just happened to be online when he reported this problem). Thanks, - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Michael Wilde" > Cc: "Emalayan Vairavanathan" , "swift-devel at ci.uchicago.edu Devel" > , emalayan at ece.ubc.ca, "MosaStore" , "Justin M Wozniak" > > Sent: Friday, March 2, 2012 9:46:07 PM > Subject: Re: [Swift-devel] coasters-hosts.pl script > I was just about to send with this exact information. : ) > > I do not think that hostname not being in the PATH would cause "none" > to appear. hostname has to be configured to return what you want it > to. If it wasn't in the path I think there would be more problems with > the worker since it would return an error for the binary not being > found, but I could be wrong there. > > But I echo Mike's debugging suggestions to diagnose the problems. Try > sshing/telnetting to the compute node to check out the environment > that the worker sees while it is running. > > On Mar 2, 2012, at 9:31 PM, Michael Wilde wrote: > > > Emalayan, > > > > The problem may be due to the hostname command returning something > > unexpected (perhaps null) on the worker nodes when booted under that > > kernel profile. > > > > These lines are in worker.pl: > > > > my $myhost=`hostname`; > > $myhost =~ s/\s+$//; > > ... > > wlog(INFO, "Running on node $myhost\n"); > > > > To debug this, it seems useful to be able to login to the compute > > nodes via the ssh-telnet procedure. That seems not to work last time > > we tried - perhaps that should be debugged. > > > > Also, you could run simple test jobs with cqsub to print the output > > (and location) of the hostname command. > > > > Perhaps hostname is not in the PATH for worker nodes booted under > > that kernel??? > > > > I recall in the distant past we used to need to do various PATH and > > LD_LIBRARY_PATH initialization steps to get the right /bin and > > /usr/bin for the ZeptoOS nodes. Maybe something is broken in that > > regard for the alternate kernel profile? > > > > - Mike > > > > > > > > ----- Original Message ----- > >> From: "Emalayan Vairavanathan" > >> To: "Jonathan Monette" , "Justin M Wozniak" > >> > >> Cc: "swift-devel at ci.uchicago.edu Devel" > >> , emalayan at ece.ubc.ca, "MosaStore" > >> > >> Sent: Friday, March 2, 2012 9:13:30 PM > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > >> Hi Jon, > >> > >> > >> Thank you again for your time and very quick fix. I tested the fix > >> with zeptoos and zepto-vn-eval/mosatest kernel profiles. The fix > >> only > >> worked with zeptoos. It did not work with zepto-vn-eval/mosatest > >> (This > >> profile contains some Mosastore related bug fixes so to run > >> MosaStore > >> we need this profile ). > >> > >> > >> > >> The reason is: > >> > >> > >> I can generate worker-hosts.txt only with zeptoos and it did not > >> work > >> with zepto-vn-eval/mosatest. This is because coasters-hosts.pl > >> extract > >> worker IP address from workers log files. But with > >> zepto-vn-eval/mosatest worker log files didnt contain the IP > >> address. > >> Please see the log messages attached below. > >> > >> > >> > >> > >> > >> Do you have any idea ? It took few hours for me to narrow down the > >> problem and find out that the issue is with kernel-profile . I hope > >> this information will help you. > >> > >> > >> > >> > >> > >> Thank you > >> Emalayan > >> > >> > >> > >> > >> With zeptoos : > >> > >> > >> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging > >> started: > >> Sat Mar 3 02:41:01 2012 > >> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging > >> started: > >> Sat Mar 3 02:41:01 2012 > >> 2012/03/03 02:41:01.494 INFO - Running on node 172.18.1.19 > >> 2012/03/03 02:41:01.494 DEBUG - uri=http://172.17.3.12:22346 > >> 2012/03/03 02:41:01.494 DEBUG - scheme=http > >> 2012/03/03 02:41:01.495 DEBUG - host=172.17.3.12 > >> 2012/03/03 02:41:01.495 DEBUG - port=22346 > >> 2012/03/03 02:41:01.495 DEBUG - blockid=2012.0303.023831.32459 > >> > >> > >> With zepto-vn-eval/mosatest: > >> > >> > >> 2012/03/03 02:50:40.667 INFO - 2012.0303.024814.15474 Logging > >> started: > >> Sat Mar 3 02:50:40 2012 > >> 2012/03/03 02:50:40.683 INFO - Running on node (none) > >> 2012/03/03 02:50:40.684 DEBUG - uri=http://172.17.3.12:22346 > >> 2012/03/03 02:50:40.684 DEBUG - scheme=http > >> 2012/03/03 02:50:40.685 DEBUG - host=172.17.3.12 > >> 2012/03/03 02:50:40.685 DEBUG - port=22346 > >> 2012/03/03 02:50:40.686 DEBUG - blockid=2012.0303.024814.15474 > >> > >> > >> > >> > >> > >> > >> From: Jonathan Monette > >> To: Justin M Wozniak > >> Cc: "swift-devel at ci.uchicago.edu Devel" > >> ; > >> emalayan at ece.ubc.ca > >> Sent: Friday, 2 March 2012 2:21 PM > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > >> > >> Emalayan, > >> We believe we have fixed the issue. You can copy the new > >> coasters-hosts.pl script from > >> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl > >> > >> This script reads the worker logs located in the logs directory. > >> The > >> steps to run are as follows: > >> start-coaster-service > >> > >> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt > >> > >> You MUST clean out the worker logs after you before you start a new > >> coaster service to make sure the script searches the right worker > >> log > >> files. This may not be ideal at the moment but this will help get > >> you > >> started. If you have any other questions feel free to ask. We will > >> need to update the mosaswift site with the new information, we will > >> do > >> this soon. > >> > >> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > >> > >>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on > >>> node 172.18.1.83 from the worker log, > >>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu > >>> worker > >>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the > >>> cps log? > >>> > >>> They both provide the same ip addresses. And the worker log always > >>> has that ip address before the cps log does. > >>> > >>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > >>> > >>>> That fix still did not work. I had moved it to the same spot. It > >>>> is > >>>> still waiting for the worker-init.pl script to finish before the > >>>> ip > >>>> addresses are printed to the cps log. Those ip addresses are what > >>>> is needed by the coaster-hosts.pl script to finish. If I create > >>>> an > >>>> empty file for the coaster-host.pl script to read, then the work > >>>> continues and the ip addresses show up in the cps log. > >>>> > >>>> Why is log4j waiting to add those lines to the cps log after the > >>>> worker-init.pl script is finished? > >>>> > >>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > >>>> > >>>>> Thanks, in my copy I thought I had moved the reconnect to before > >>>>> the init-cmd and it still wasn't working. I will test with your > >>>>> change. I just verified that it was indeed waiting for the > >>>>> worker-init.pl script to finish. I created an empty file for the > >>>>> script to read and it finished connecting and the ip addresses I > >>>>> needed were added to the cps log. I will also be testing your > >>>>> fix. > >>>>> > >>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > >>>>> > >>>>>> > >>>>>> Yes- I must have tested this with a different log file. I just > >>>>>> checked in and installed in ~wozniak/Public a fix for this that > >>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little > >>>>>> worried about time outs but it works so far. I will continue > >>>>>> testing... > >>>>>> Justin > >>>>>> > >>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: > >>>>>> > >>>>>>> Justin, > >>>>>>> So I have been trying to help Emalayan get the host list file > >>>>>>> for the worker-init.pl script. It seems the cps log file is > >>>>>>> not > >>>>>>> providing the ip addresses for the coasters-hosts.pl script. I > >>>>>>> thought this was maybe because we did not have the correct > >>>>>>> log4j > >>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. > >>>>>>> So > >>>>>>> for some reason the workers are not connecting to the service. > >>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in > >>>>>>> the > >>>>>>> coaster-service.conf file I see the workers connect and the > >>>>>>> cps > >>>>>>> log file shows there ip addresses. However when setting this > >>>>>>> line it seems they are not connecting. > >>>>>>> > >>>>>>> Emalayan thought there might be some sort of circular > >>>>>>> dependency > >>>>>>> going with the host-list file and the worker. The worker > >>>>>>> requires the host-list file so that it can run the > >>>>>>> worker-init.pl script and then connect but the host-list file > >>>>>>> cannot be generated because the workers cannot connect. I > >>>>>>> noticed in your swift-test directory the cps files did have > >>>>>>> the > >>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses > >>>>>>> and reported them. Did you try that test with setting the > >>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? > >>>>>>> Any idea what may be happening? The job is running when > >>>>>>> looking > >>>>>>> under cqstat. > >>>>>>> > >>>>>>> A side note: At the mosaswift site, your example talks about > >>>>>>> running the coasters-hosts.pl on the cps log but the example > >>>>>>> you > >>>>>>> provide runs it on logs/coasters.log. This may need to be > >>>>>>> changed. Also, should provide the log4j setting that is > >>>>>>> required > >>>>>>> to generate the Cpu line with the worker ip address just to > >>>>>>> clarify that this line should be set for this script to work. > >>>>>>> > >>>>>>> For reference, this line: > >>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > >>>>>> > >>>>>> -- > >>>>>> Justin M Wozniak > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Sat Mar 3 17:21:47 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sat, 3 Mar 2012 17:21:47 -0600 Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <2087527384.58120.1330789215663.JavaMail.root@zimbra.anl.gov> References: <2087527384.58120.1330789215663.JavaMail.root@zimbra.anl.gov> Message-ID: <0B7A1E46-9407-4DB1-B4DD-7B10F5B761B5@mcs.anl.gov> Sure. I can help him debug and get the ip addresses he needs. On Mar 3, 2012, at 9:40 AM, Michael Wilde wrote: > Jon, thanks - I missed your note when I signed off last night. We did a few more tests and verified that on the default zeptoos kernel, hostname was returning a numeric dotted IP address like 172.nnn.nnn.nnn, while on the special kernel fixed for Mosa it was returning "(none)". So it seems like some config issue in that kernel profile. > > Emalyan was going to try to patch the call to hostname with a script that pulls the IP address from ifconfig or some other source. > > He was also going to try the compute node login procedure (ssh-telnet), and report if it still doesnt work, as that would be handy for debugging in this case. > > For the moment we were debugging this with cqsub jobs. Im gonna drop out and leave this to you and Emalayan. (I just happened to be online when he reported this problem). > > Thanks, > > - Mike > > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Michael Wilde" >> Cc: "Emalayan Vairavanathan" , "swift-devel at ci.uchicago.edu Devel" >> , emalayan at ece.ubc.ca, "MosaStore" , "Justin M Wozniak" >> >> Sent: Friday, March 2, 2012 9:46:07 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> I was just about to send with this exact information. : ) >> >> I do not think that hostname not being in the PATH would cause "none" >> to appear. hostname has to be configured to return what you want it >> to. If it wasn't in the path I think there would be more problems with >> the worker since it would return an error for the binary not being >> found, but I could be wrong there. >> >> But I echo Mike's debugging suggestions to diagnose the problems. Try >> sshing/telnetting to the compute node to check out the environment >> that the worker sees while it is running. >> >> On Mar 2, 2012, at 9:31 PM, Michael Wilde wrote: >> >>> Emalayan, >>> >>> The problem may be due to the hostname command returning something >>> unexpected (perhaps null) on the worker nodes when booted under that >>> kernel profile. >>> >>> These lines are in worker.pl: >>> >>> my $myhost=`hostname`; >>> $myhost =~ s/\s+$//; >>> ... >>> wlog(INFO, "Running on node $myhost\n"); >>> >>> To debug this, it seems useful to be able to login to the compute >>> nodes via the ssh-telnet procedure. That seems not to work last time >>> we tried - perhaps that should be debugged. >>> >>> Also, you could run simple test jobs with cqsub to print the output >>> (and location) of the hostname command. >>> >>> Perhaps hostname is not in the PATH for worker nodes booted under >>> that kernel??? >>> >>> I recall in the distant past we used to need to do various PATH and >>> LD_LIBRARY_PATH initialization steps to get the right /bin and >>> /usr/bin for the ZeptoOS nodes. Maybe something is broken in that >>> regard for the alternate kernel profile? >>> >>> - Mike >>> >>> >>> >>> ----- Original Message ----- >>>> From: "Emalayan Vairavanathan" >>>> To: "Jonathan Monette" , "Justin M Wozniak" >>>> >>>> Cc: "swift-devel at ci.uchicago.edu Devel" >>>> , emalayan at ece.ubc.ca, "MosaStore" >>>> >>>> Sent: Friday, March 2, 2012 9:13:30 PM >>>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>>> Hi Jon, >>>> >>>> >>>> Thank you again for your time and very quick fix. I tested the fix >>>> with zeptoos and zepto-vn-eval/mosatest kernel profiles. The fix >>>> only >>>> worked with zeptoos. It did not work with zepto-vn-eval/mosatest >>>> (This >>>> profile contains some Mosastore related bug fixes so to run >>>> MosaStore >>>> we need this profile ). >>>> >>>> >>>> >>>> The reason is: >>>> >>>> >>>> I can generate worker-hosts.txt only with zeptoos and it did not >>>> work >>>> with zepto-vn-eval/mosatest. This is because coasters-hosts.pl >>>> extract >>>> worker IP address from workers log files. But with >>>> zepto-vn-eval/mosatest worker log files didnt contain the IP >>>> address. >>>> Please see the log messages attached below. >>>> >>>> >>>> >>>> >>>> >>>> Do you have any idea ? It took few hours for me to narrow down the >>>> problem and find out that the issue is with kernel-profile . I hope >>>> this information will help you. >>>> >>>> >>>> >>>> >>>> >>>> Thank you >>>> Emalayan >>>> >>>> >>>> >>>> >>>> With zeptoos : >>>> >>>> >>>> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging >>>> started: >>>> Sat Mar 3 02:41:01 2012 >>>> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging >>>> started: >>>> Sat Mar 3 02:41:01 2012 >>>> 2012/03/03 02:41:01.494 INFO - Running on node 172.18.1.19 >>>> 2012/03/03 02:41:01.494 DEBUG - uri=http://172.17.3.12:22346 >>>> 2012/03/03 02:41:01.494 DEBUG - scheme=http >>>> 2012/03/03 02:41:01.495 DEBUG - host=172.17.3.12 >>>> 2012/03/03 02:41:01.495 DEBUG - port=22346 >>>> 2012/03/03 02:41:01.495 DEBUG - blockid=2012.0303.023831.32459 >>>> >>>> >>>> With zepto-vn-eval/mosatest: >>>> >>>> >>>> 2012/03/03 02:50:40.667 INFO - 2012.0303.024814.15474 Logging >>>> started: >>>> Sat Mar 3 02:50:40 2012 >>>> 2012/03/03 02:50:40.683 INFO - Running on node (none) >>>> 2012/03/03 02:50:40.684 DEBUG - uri=http://172.17.3.12:22346 >>>> 2012/03/03 02:50:40.684 DEBUG - scheme=http >>>> 2012/03/03 02:50:40.685 DEBUG - host=172.17.3.12 >>>> 2012/03/03 02:50:40.685 DEBUG - port=22346 >>>> 2012/03/03 02:50:40.686 DEBUG - blockid=2012.0303.024814.15474 >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Jonathan Monette >>>> To: Justin M Wozniak >>>> Cc: "swift-devel at ci.uchicago.edu Devel" >>>> ; >>>> emalayan at ece.ubc.ca >>>> Sent: Friday, 2 March 2012 2:21 PM >>>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>>> >>>> Emalayan, >>>> We believe we have fixed the issue. You can copy the new >>>> coasters-hosts.pl script from >>>> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl >>>> >>>> This script reads the worker logs located in the logs directory. >>>> The >>>> steps to run are as follows: >>>> start-coaster-service >>>> >>>> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt >>>> >>>> You MUST clean out the worker logs after you before you start a new >>>> coaster service to make sure the script searches the right worker >>>> log >>>> files. This may not be ideal at the moment but this will help get >>>> you >>>> started. If you have any other questions feel free to ask. We will >>>> need to update the mosaswift site with the new information, we will >>>> do >>>> this soon. >>>> >>>> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: >>>> >>>>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on >>>>> node 172.18.1.83 from the worker log, >>>>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu >>>>> worker >>>>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the >>>>> cps log? >>>>> >>>>> They both provide the same ip addresses. And the worker log always >>>>> has that ip address before the cps log does. >>>>> >>>>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: >>>>> >>>>>> That fix still did not work. I had moved it to the same spot. It >>>>>> is >>>>>> still waiting for the worker-init.pl script to finish before the >>>>>> ip >>>>>> addresses are printed to the cps log. Those ip addresses are what >>>>>> is needed by the coaster-hosts.pl script to finish. If I create >>>>>> an >>>>>> empty file for the coaster-host.pl script to read, then the work >>>>>> continues and the ip addresses show up in the cps log. >>>>>> >>>>>> Why is log4j waiting to add those lines to the cps log after the >>>>>> worker-init.pl script is finished? >>>>>> >>>>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >>>>>> >>>>>>> Thanks, in my copy I thought I had moved the reconnect to before >>>>>>> the init-cmd and it still wasn't working. I will test with your >>>>>>> change. I just verified that it was indeed waiting for the >>>>>>> worker-init.pl script to finish. I created an empty file for the >>>>>>> script to read and it finished connecting and the ip addresses I >>>>>>> needed were added to the cps log. I will also be testing your >>>>>>> fix. >>>>>>> >>>>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>>>>>> >>>>>>>> >>>>>>>> Yes- I must have tested this with a different log file. I just >>>>>>>> checked in and installed in ~wozniak/Public a fix for this that >>>>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>>>>>> worried about time outs but it works so far. I will continue >>>>>>>> testing... >>>>>>>> Justin >>>>>>>> >>>>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>>>>>> >>>>>>>>> Justin, >>>>>>>>> So I have been trying to help Emalayan get the host list file >>>>>>>>> for the worker-init.pl script. It seems the cps log file is >>>>>>>>> not >>>>>>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>>>>>> thought this was maybe because we did not have the correct >>>>>>>>> log4j >>>>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. >>>>>>>>> So >>>>>>>>> for some reason the workers are not connecting to the service. >>>>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in >>>>>>>>> the >>>>>>>>> coaster-service.conf file I see the workers connect and the >>>>>>>>> cps >>>>>>>>> log file shows there ip addresses. However when setting this >>>>>>>>> line it seems they are not connecting. >>>>>>>>> >>>>>>>>> Emalayan thought there might be some sort of circular >>>>>>>>> dependency >>>>>>>>> going with the host-list file and the worker. The worker >>>>>>>>> requires the host-list file so that it can run the >>>>>>>>> worker-init.pl script and then connect but the host-list file >>>>>>>>> cannot be generated because the workers cannot connect. I >>>>>>>>> noticed in your swift-test directory the cps files did have >>>>>>>>> the >>>>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses >>>>>>>>> and reported them. Did you try that test with setting the >>>>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? >>>>>>>>> Any idea what may be happening? The job is running when >>>>>>>>> looking >>>>>>>>> under cqstat. >>>>>>>>> >>>>>>>>> A side note: At the mosaswift site, your example talks about >>>>>>>>> running the coasters-hosts.pl on the cps log but the example >>>>>>>>> you >>>>>>>>> provide runs it on logs/coasters.log. This may need to be >>>>>>>>> changed. Also, should provide the log4j setting that is >>>>>>>>> required >>>>>>>>> to generate the Cpu line with the worker ip address just to >>>>>>>>> clarify that this line should be set for this script to work. >>>>>>>>> >>>>>>>>> For reference, this line: >>>>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>>>>>> >>>>>>>> -- >>>>>>>> Justin M Wozniak >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From svemalayan at yahoo.com Sat Mar 3 20:25:27 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Sat, 3 Mar 2012 18:25:27 -0800 (PST) Subject: [Swift-devel] hostname returns none in Surveyor Message-ID: <1330827927.13278.YahooMailNeo@web39502.mail.mud.yahoo.com> Hi All, I am trying to run some experiments in Surveyor. The software I am using gets the IP-address of compute-nodes using hostname command.? With zepto-vn-eval/mosatest profile hostname command returns none. But with zeptoos profile hostname returns the correct IP address.? Is this due to some configuration issues? in zepto-vn-eval/mosatest profile?As a workaround I tired to use ifconfig with both profiles, but it seems ifconfig is not returning the correct IP address. Is there any command / files which I can used to retrieve the hostname on compute nodes? I have pasted the console output with both profiles below. Please let me know if you need more details. Thank you Emalayan =======================With zeptoos profile =============================== / # hostname 172.18.3.19 / # / # cat /proc/sys/kernel/hostname 172.18.3.19 / # / # / # ifconfig -a lo??????? Link encap:Local Loopback? ????????? inet addr:127.0.0.1? Mask:255.0.0.0 ????????? UP LOOPBACK RUNNING? MTU:16436? Metric:1 ????????? RX packets:0 errors:0 dropped:0 overruns:0 frame:0 ????????? TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 ????????? collisions:0 txqueuelen:0 ????????? RX bytes:0 (0.0 B)? TX bytes:0 (0.0 B) tun0????? Link encap:UNSPEC? HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00? ????????? inet addr:192.168.1.64? P-t-P:192.168.1.254? Mask:255.255.255.255 ????????? UP POINTOPOINT RUNNING NOARP MULTICAST? MTU:65535? Metric:1 ????????? RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 ????????? TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 ????????? collisions:0 txqueuelen:500 ????????? RX bytes:140206 (136.9 KiB)? TX bytes:125412 (122.4 KiB) =======================With zepto-vn-eval/mosatest profile =============================== /etc # hostname?? (none) /etc # /etc # cat /proc/sys/kernel/hostname (none) /etc # /etc # ifconfig -a eth0????? Link encap:Ethernet? HWaddr 00:80:46:00:00:00? ????????? BROADCAST MULTICAST? MTU:1500? Metric:1 ????????? RX packets:0 errors:0 dropped:0 overruns:0 frame:0 ????????? TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 ????????? collisions:0 txqueuelen:1000 ????????? RX bytes:0 (0.0 B)? TX bytes:0 (0.0 B) eth1????? Link encap:Ethernet? HWaddr 00:80:47:00:00:00? ????????? BROADCAST MULTICAST? MTU:1500? Metric:1 ????????? RX packets:0 errors:0 dropped:0 overruns:0 frame:0 ????????? TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 ????????? collisions:0 txqueuelen:1000 ????????? RX bytes:0 (0.0 B)? TX bytes:0 (0.0 B) lo??????? Link encap:Local Loopback? ????????? inet addr:127.0.0.1? Mask:255.0.0.0 ????????? inet6 addr: ::1/128 Scope:Host ????????? UP LOOPBACK RUNNING? MTU:16436? Metric:1 ????????? RX packets:0 errors:0 dropped:0 overruns:0 frame:0 ????????? TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 ????????? collisions:0 txqueuelen:0 ????????? RX bytes:0 (0.0 B)? TX bytes:0 (0.0 B) sit0????? Link encap:IPv6-in-IPv4? ????????? NOARP? MTU:1480? Metric:1 ????????? RX packets:0 errors:0 dropped:0 overruns:0 frame:0 ????????? TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 ????????? collisions:0 txqueuelen:0 ????????? RX bytes:0 (0.0 B)? TX bytes:0 (0.0 B) tun0????? Link encap:UNSPEC? HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00? ????????? inet addr:192.168.1.64? P-t-P:192.168.1.254? Mask:255.255.255.255 ????????? UP POINTOPOINT RUNNING NOARP MULTICAST? MTU:65535? Metric:1 ????????? RX packets:965 errors:0 dropped:0 overruns:0 frame:0 ????????? TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 ????????? collisions:0 txqueuelen:500 ????????? RX bytes:50984 (49.7 KiB)? TX bytes:50530 (49.3 KiB) -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Sat Mar 3 20:30:00 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Sat, 3 Mar 2012 18:30:00 -0800 (PST) Subject: [Swift-devel] coasters-hosts.pl script In-Reply-To: <0B7A1E46-9407-4DB1-B4DD-7B10F5B761B5@mcs.anl.gov> References: <2087527384.58120.1330789215663.JavaMail.root@zimbra.anl.gov> <0B7A1E46-9407-4DB1-B4DD-7B10F5B761B5@mcs.anl.gov> Message-ID: <1330828200.92896.YahooMailNeo@web39506.mail.mud.yahoo.com> Hi Mike and Jon, Thank you for all the guidance. It seems even ifconfig is not returning the IP address with both profiles. So I just dropped a mail to zeptoos mailing list to get further help (I cc-d to swift-dev too). Meantime I am going to proceed with setting up Mosa as an intermediate store. Thank you Emalayan ________________________________ From: Jonathan Monette To: Michael Wilde Cc: Emalayan Vairavanathan ; "swift-devel at ci.uchicago.edu Devel" ; "emalayan at ece.ubc.ca" ; MosaStore ; Justin M Wozniak Sent: Saturday, 3 March 2012 3:21 PM Subject: Re: [Swift-devel] coasters-hosts.pl script Sure. I can help him debug and get the ip addresses he needs. On Mar 3, 2012, at 9:40 AM, Michael Wilde wrote: > Jon, thanks - I missed your note when I signed off last night.? We did a few more tests and verified that on the default zeptoos kernel, hostname was returning a numeric dotted IP address like 172.nnn.nnn.nnn, while on the special kernel fixed for Mosa it was returning "(none)".? So it seems like some config issue in that kernel profile. > > Emalyan was going to try to patch the call to hostname with a script that pulls the IP address from ifconfig or some other source. > > He was also going to try the compute node login procedure (ssh-telnet), and report if it still doesnt work, as that would be handy for debugging in this case. > > For the moment we were debugging this with cqsub jobs. Im gonna drop out and leave this to you and Emalayan. (I just happened to be online when he reported this problem). > > Thanks, > > - Mike > > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Michael Wilde" >> Cc: "Emalayan Vairavanathan" , "swift-devel at ci.uchicago.edu Devel" >> , emalayan at ece.ubc.ca, "MosaStore" , "Justin M Wozniak" >> >> Sent: Friday, March 2, 2012 9:46:07 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> I was just about to send with this exact information. : ) >> >> I do not think that hostname not being in the PATH would cause "none" >> to appear. hostname has to be configured to return what you want it >> to. If it wasn't in the path I think there would be more problems with >> the worker since it would return an error for the binary not being >> found, but I could be wrong there. >> >> But I echo Mike's debugging suggestions to diagnose the problems. Try >> sshing/telnetting to the compute node to check out the environment >> that the worker sees while it is running. >> >> On Mar 2, 2012, at 9:31 PM, Michael Wilde wrote: >> >>> Emalayan, >>> >>> The problem may be due to the hostname command returning something >>> unexpected (perhaps null) on the worker nodes when booted under that >>> kernel profile. >>> >>> These lines are in worker.pl: >>> >>> my $myhost=`hostname`; >>> $myhost =~ s/\s+$//; >>> ... >>> wlog(INFO, "Running on node $myhost\n"); >>> >>> To debug this, it seems useful to be able to login to the compute >>> nodes via the ssh-telnet procedure. That seems not to work last time >>> we tried - perhaps that should be debugged. >>> >>> Also, you could run simple test jobs with cqsub to print the output >>> (and location) of the hostname command. >>> >>> Perhaps hostname is not in the PATH for worker nodes booted under >>> that kernel??? >>> >>> I recall in the distant past we used to need to do various PATH and >>> LD_LIBRARY_PATH initialization steps to get the right /bin and >>> /usr/bin for the ZeptoOS nodes. Maybe something is broken in that >>> regard for the alternate kernel profile? >>> >>> - Mike >>> >>> >>> >>> ----- Original Message ----- >>>> From: "Emalayan Vairavanathan" >>>> To: "Jonathan Monette" , "Justin M Wozniak" >>>> >>>> Cc: "swift-devel at ci.uchicago.edu Devel" >>>> , emalayan at ece.ubc.ca, "MosaStore" >>>> >>>> Sent: Friday, March 2, 2012 9:13:30 PM >>>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>>> Hi Jon, >>>> >>>> >>>> Thank you again for your time and very quick fix. I tested the fix >>>> with zeptoos and zepto-vn-eval/mosatest kernel profiles. The fix >>>> only >>>> worked with zeptoos. It did not work with zepto-vn-eval/mosatest >>>> (This >>>> profile contains some Mosastore related bug fixes so to run >>>> MosaStore >>>> we need this profile ). >>>> >>>> >>>> >>>> The reason is: >>>> >>>> >>>> I can generate worker-hosts.txt only with zeptoos and it did not >>>> work >>>> with zepto-vn-eval/mosatest. This is because coasters-hosts.pl >>>> extract >>>> worker IP address from workers log files. But with >>>> zepto-vn-eval/mosatest worker log files didnt contain the IP >>>> address. >>>> Please see the log messages attached below. >>>> >>>> >>>> >>>> >>>> >>>> Do you have any idea ? It took few hours for me to narrow down the >>>> problem and find out that the issue is with kernel-profile . I hope >>>> this information will help you. >>>> >>>> >>>> >>>> >>>> >>>> Thank you >>>> Emalayan >>>> >>>> >>>> >>>> >>>> With zeptoos : >>>> >>>> >>>> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging >>>> started: >>>> Sat Mar 3 02:41:01 2012 >>>> 2012/03/03 02:41:01.492 INFO - 2012.0303.023831.32459 Logging >>>> started: >>>> Sat Mar 3 02:41:01 2012 >>>> 2012/03/03 02:41:01.494 INFO - Running on node 172.18.1.19 >>>> 2012/03/03 02:41:01.494 DEBUG - uri=http://172.17.3.12:22346 >>>> 2012/03/03 02:41:01.494 DEBUG - scheme=http >>>> 2012/03/03 02:41:01.495 DEBUG - host=172.17.3.12 >>>> 2012/03/03 02:41:01.495 DEBUG - port=22346 >>>> 2012/03/03 02:41:01.495 DEBUG - blockid=2012.0303.023831.32459 >>>> >>>> >>>> With zepto-vn-eval/mosatest: >>>> >>>> >>>> 2012/03/03 02:50:40.667 INFO - 2012.0303.024814.15474 Logging >>>> started: >>>> Sat Mar 3 02:50:40 2012 >>>> 2012/03/03 02:50:40.683 INFO - Running on node (none) >>>> 2012/03/03 02:50:40.684 DEBUG - uri=http://172.17.3.12:22346 >>>> 2012/03/03 02:50:40.684 DEBUG - scheme=http >>>> 2012/03/03 02:50:40.685 DEBUG - host=172.17.3.12 >>>> 2012/03/03 02:50:40.685 DEBUG - port=22346 >>>> 2012/03/03 02:50:40.686 DEBUG - blockid=2012.0303.024814.15474 >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Jonathan Monette >>>> To: Justin M Wozniak >>>> Cc: "swift-devel at ci.uchicago.edu Devel" >>>> ; >>>> emalayan at ece.ubc.ca >>>> Sent: Friday, 2 March 2012 2:21 PM >>>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>>> >>>> Emalayan, >>>> We believe we have fixed the issue. You can copy the new >>>> coasters-hosts.pl script from >>>> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl >>>> >>>> This script reads the worker logs located in the logs directory. >>>> The >>>> steps to run are as follows: >>>> start-coaster-service >>>> >>>> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt >>>> >>>> You MUST clean out the worker logs after you before you start a new >>>> coaster service to make sure the script searches the right worker >>>> log >>>> files. This may not be ideal at the moment but this will help get >>>> you >>>> started. If you have any other questions feel free to ask. We will >>>> need to update the mosaswift site with the new information, we will >>>> do >>>> this soon. >>>> >>>> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: >>>> >>>>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on >>>>> node 172.18.1.83 from the worker log, >>>>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu >>>>> worker >>>>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the >>>>> cps log? >>>>> >>>>> They both provide the same ip addresses. And the worker log always >>>>> has that ip address before the cps log does. >>>>> >>>>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: >>>>> >>>>>> That fix still did not work. I had moved it to the same spot. It >>>>>> is >>>>>> still waiting for the worker-init.pl script to finish before the >>>>>> ip >>>>>> addresses are printed to the cps log. Those ip addresses are what >>>>>> is needed by the coaster-hosts.pl script to finish. If I create >>>>>> an >>>>>> empty file for the coaster-host.pl script to read, then the work >>>>>> continues and the ip addresses show up in the cps log. >>>>>> >>>>>> Why is log4j waiting to add those lines to the cps log after the >>>>>> worker-init.pl script is finished? >>>>>> >>>>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >>>>>> >>>>>>> Thanks, in my copy I thought I had moved the reconnect to before >>>>>>> the init-cmd and it still wasn't working. I will test with your >>>>>>> change. I just verified that it was indeed waiting for the >>>>>>> worker-init.pl script to finish. I created an empty file for the >>>>>>> script to read and it finished connecting and the ip addresses I >>>>>>> needed were added to the cps log. I will also be testing your >>>>>>> fix. >>>>>>> >>>>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>>>>>> >>>>>>>> >>>>>>>> Yes- I must have tested this with a different log file. I just >>>>>>>> checked in and installed in ~wozniak/Public a fix for this that >>>>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>>>>>> worried about time outs but it works so far. I will continue >>>>>>>> testing... >>>>>>>> Justin >>>>>>>> >>>>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>>>>>> >>>>>>>>> Justin, >>>>>>>>> So I have been trying to help Emalayan get the host list file >>>>>>>>> for the worker-init.pl script. It seems the cps log file is >>>>>>>>> not >>>>>>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>>>>>> thought this was maybe because we did not have the correct >>>>>>>>> log4j >>>>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. >>>>>>>>> So >>>>>>>>> for some reason the workers are not connecting to the service. >>>>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in >>>>>>>>> the >>>>>>>>> coaster-service.conf file I see the workers connect and the >>>>>>>>> cps >>>>>>>>> log file shows there ip addresses. However when setting this >>>>>>>>> line it seems they are not connecting. >>>>>>>>> >>>>>>>>> Emalayan thought there might be some sort of circular >>>>>>>>> dependency >>>>>>>>> going with the host-list file and the worker. The worker >>>>>>>>> requires the host-list file so that it can run the >>>>>>>>> worker-init.pl script and then connect but the host-list file >>>>>>>>> cannot be generated because the workers cannot connect. I >>>>>>>>> noticed in your swift-test directory the cps files did have >>>>>>>>> the >>>>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses >>>>>>>>> and reported them. Did you try that test with setting the >>>>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? >>>>>>>>> Any idea what may be happening? The job is running when >>>>>>>>> looking >>>>>>>>> under cqstat. >>>>>>>>> >>>>>>>>> A side note: At the mosaswift site, your example talks about >>>>>>>>> running the coasters-hosts.pl on the cps log but the example >>>>>>>>> you >>>>>>>>> provide runs it on logs/coasters.log. This may need to be >>>>>>>>> changed. Also, should provide the log4j setting that is >>>>>>>>> required >>>>>>>>> to generate the Cpu line with the worker ip address just to >>>>>>>>> clarify that this line should be set for this script to work. >>>>>>>>> >>>>>>>>> For reference, this line: >>>>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>>>>>> >>>>>>>> -- >>>>>>>> Justin M Wozniak >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Sat Mar 3 21:39:31 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Sat, 3 Mar 2012 19:39:31 -0800 (PST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <4F52D8D0.9020602@mcs.anl.gov> References: <1330827927.13278.YahooMailNeo@web39502.mail.mud.yahoo.com> <4F52D8D0.9020602@mcs.anl.gov> Message-ID: <1330832371.53342.YahooMailNeo@web39505.mail.mud.yahoo.com> Hi Kaz, Thank you for your reply. BG_IP returns an IP address with zepto-vn-eval/mosatestprofile. Anyway a quick question: It seems the IP returned on a compute-node with zeptoos profile is different than associated I/O node (Please see below). May be I didnt understand something correctly and I would be very happy if you can provide some explanation for my observation. In IO-node /gpfs/home/emalayan $ hostname ion-5 /gpfs/home/emalayan $ hostname -i 172.16.3.5? In compute node / # hostname 172.18.1.19 / # cat /proc/personality.sh | grep 'BG_IP' BG_IP=172.18.1.19 Thank you Emalayan ________________________________ From: Kazutomo Yoshii To: zeptoos at lists.mcs.anl.gov Sent: Saturday, 3 March 2012 6:52 PM Subject: Re: [ZeptoOS] hostname returns none in Surveyor Hi Emalayan, The zeptoos profile returns the IP address of associated I/O node, which is kind of wrong in my opinion (influence of IBM CNK). ifconfig on compute nodes returns CN's IP address, which is correct. e.g. tun0 192.168.1.64 If you want to find associated ION's IP address from CNs, do something like this. $ grep BG_IP= /proc/personality.sh - kaz On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > Hi All, > > I am trying to run some experiments in Surveyor. The software I am using > gets the IP-address of compute-nodes using hostname command. > > With zepto-vn-eval/mosatest profile hostname command returns none. > But with zeptoos profile hostname returns the correct IP address. > > Is this due to some configuration issues in zepto-vn-eval/mosatest > profile?As a workaround I tired to use ifconfig with both profiles, but > it seems ifconfig is not returning the correct IP address. > > Is there any command / files which I can used to retrieve the hostname > on compute nodes? I have pasted the console output with both profiles > below. Please let me know if you need more details. > > Thank you > Emalayan > > > =======================With zeptoos profile =============================== > > / # hostname > 172.18.3.19 > / # > / # cat /proc/sys/kernel/hostname > 172.18.3.19 > / # > / # > / # ifconfig -a > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > tun0 Link encap:UNSPEC HWaddr > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:500 > RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > > > > =======================With zepto-vn-eval/mosatest profile > =============================== > > /etc # hostname > (none) > /etc # > /etc # cat /proc/sys/kernel/hostname > (none) > /etc # > /etc # ifconfig -a > eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > sit0 Link encap:IPv6-in-IPv4 > NOARP MTU:1480 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > tun0 Link encap:UNSPEC HWaddr > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:500 > RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sat Mar 3 21:39:39 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 3 Mar 2012 21:39:39 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <4F52D8D0.9020602@mcs.anl.gov> Message-ID: <2051445639.58964.1330832379134.JavaMail.root@zimbra.anl.gov> Emalayan, I wasnt paying much attention to the actual IP address returned by hostname in the zeptoos profile. Since these are the addresses that Mosa will communicate over, I think you *do* want them to be the 192.168.1.* addresses of the nodes on the torus network (in other words tun0). So, since both profiles return 192.168.1.64 for the tun0 IP, I think thats what you should use. So try replacing `hostname` in worker.pl with something like: `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` You may have to adapt this a bit to meet your needs. Im assuming that the only code that will uses these IPs is MosaStore. - Mike ----- Original Message ----- > From: "Kazutomo Yoshii" > To: zeptoos at lists.mcs.anl.gov > Sent: Saturday, March 3, 2012 8:52:00 PM > Subject: Re: [ZeptoOS] hostname returns none in Surveyor > Hi Emalayan, > > The zeptoos profile returns the IP address of associated I/O node, > which is kind of wrong in my opinion (influence of IBM CNK). > ifconfig on compute nodes returns CN's IP address, which is correct. > e.g. tun0 192.168.1.64 > > If you want to find associated ION's IP address from CNs, > do something like this. > > $ grep BG_IP= /proc/personality.sh > > - kaz > > On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > > Hi All, > > > > I am trying to run some experiments in Surveyor. The software I am > > using > > gets the IP-address of compute-nodes using hostname command. > > > > With zepto-vn-eval/mosatest profile hostname command returns none. > > But with zeptoos profile hostname returns the correct IP address. > > > > Is this due to some configuration issues in zepto-vn-eval/mosatest > > profile?As a workaround I tired to use ifconfig with both profiles, > > but > > it seems ifconfig is not returning the correct IP address. > > > > Is there any command / files which I can used to retrieve the > > hostname > > on compute nodes? I have pasted the console output with both > > profiles > > below. Please let me know if you need more details. > > > > Thank you > > Emalayan > > > > > > =======================With zeptoos profile > > =============================== > > > > / # hostname > > 172.18.3.19 > > / # > > / # cat /proc/sys/kernel/hostname > > 172.18.3.19 > > / # > > / # > > / # ifconfig -a > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > tun0 Link encap:UNSPEC HWaddr > > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > > UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:500 > > RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > > > > > > > > =======================With zepto-vn-eval/mosatest profile > > =============================== > > > > /etc # hostname > > (none) > > /etc # > > /etc # cat /proc/sys/kernel/hostname > > (none) > > /etc # > > /etc # ifconfig -a > > eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > > BROADCAST MULTICAST MTU:1500 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > > BROADCAST MULTICAST MTU:1500 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > inet6 addr: ::1/128 Scope:Host > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > sit0 Link encap:IPv6-in-IPv4 > > NOARP MTU:1480 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > tun0 Link encap:UNSPEC HWaddr > > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > > UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:500 > > RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From svemalayan at yahoo.com Sun Mar 4 01:40:53 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Sat, 3 Mar 2012 23:40:53 -0800 (PST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <2051445639.58964.1330832379134.JavaMail.root@zimbra.anl.gov> References: <4F52D8D0.9020602@mcs.anl.gov> <2051445639.58964.1330832379134.JavaMail.root@zimbra.anl.gov> Message-ID: <1330846853.74740.YahooMailNeo@web39502.mail.mud.yahoo.com> Thank you very much Mike. I agree with you suggestion. I can do that in worker.pl. Thank you Emalayan ________________________________ From: Michael Wilde To: emalayan at ece.ubc.ca Cc: swift-devel Sent: Saturday, 3 March 2012 7:39 PM Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor Emalayan, I wasnt paying much attention to the actual IP address returned by hostname in the zeptoos profile. Since these are the addresses that Mosa will communicate over, I think you *do* want them to be the 192.168.1.* addresses of the nodes on the torus network (in other words tun0). So, since both profiles return 192.168.1.64 for the tun0 IP, I think thats what you should use. So try replacing `hostname` in worker.pl with something like: ? `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` You may have to adapt this a bit to meet your needs. Im assuming that the only code that will uses these IPs is MosaStore. - Mike ----- Original Message ----- > From: "Kazutomo Yoshii" > To: zeptoos at lists.mcs.anl.gov > Sent: Saturday, March 3, 2012 8:52:00 PM > Subject: Re: [ZeptoOS] hostname returns none in Surveyor > Hi Emalayan, > > The zeptoos profile returns the IP address of associated I/O node, > which is kind of wrong in my opinion (influence of IBM CNK). > ifconfig on compute nodes returns CN's IP address, which is correct. > e.g. tun0 192.168.1.64 > > If you want to find associated ION's IP address from CNs, > do something like this. > > $ grep BG_IP= /proc/personality.sh > > - kaz > > On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > > Hi All, > > > > I am trying to run some experiments in Surveyor. The software I am > > using > > gets the IP-address of compute-nodes using hostname command. > > > > With zepto-vn-eval/mosatest profile hostname command returns none. > > But with zeptoos profile hostname returns the correct IP address. > > > > Is this due to some configuration issues in zepto-vn-eval/mosatest > > profile?As a workaround I tired to use ifconfig with both profiles, > > but > > it seems ifconfig is not returning the correct IP address. > > > > Is there any command / files which I can used to retrieve the > > hostname > > on compute nodes? I have pasted the console output with both > > profiles > > below. Please let me know if you need more details. > > > > Thank you > > Emalayan > > > > > > =======================With zeptoos profile > > =============================== > > > > / # hostname > > 172.18.3.19 > > / # > > / # cat /proc/sys/kernel/hostname > > 172.18.3.19 > > / # > > / # > > / # ifconfig -a > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > tun0 Link encap:UNSPEC HWaddr > > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > > UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:500 > > RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > > > > > > > > =======================With zepto-vn-eval/mosatest profile > > =============================== > > > > /etc # hostname > > (none) > > /etc # > > /etc # cat /proc/sys/kernel/hostname > > (none) > > /etc # > > /etc # ifconfig -a > > eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > > BROADCAST MULTICAST MTU:1500 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > > BROADCAST MULTICAST MTU:1500 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > inet6 addr: ::1/128 Scope:Host > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > sit0 Link encap:IPv6-in-IPv4 > > NOARP MTU:1480 Metric:1 > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > tun0 Link encap:UNSPEC HWaddr > > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > > UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:500 > > RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Mar 4 10:24:38 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 4 Mar 2012 10:24:38 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <1330846853.74740.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: <1342057611.59431.1330878278821.JavaMail.root@zimbra.anl.gov> Zhao, Can you tell us if the nodes on the torus network are accessed over the 192.168 network? I just realized they cant all be on the 192.168.1 subnet, so I hope I suggested the right network here. Thanks, - Mike ----- Original Message ----- > From: "Emalayan Vairavanathan" > To: swift-devel at ci.uchicago.edu > Sent: Sunday, March 4, 2012 1:40:53 AM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > Thank you very much Mike. I agree with you suggestion. I can do that > in worker.pl. > > > Thank you > Emalayan > > > > > > > From: Michael Wilde > To: emalayan at ece.ubc.ca > Cc: swift-devel > Sent: Saturday, 3 March 2012 7:39 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > Emalayan, > > I wasnt paying much attention to the actual IP address returned by > hostname in the zeptoos profile. > > Since these are the addresses that Mosa will communicate over, I think > you *do* want them to be the 192.168.1.* addresses of the nodes on the > torus network (in other words tun0). > > So, since both profiles return 192.168.1.64 for the tun0 IP, I think > thats what you should use. So try replacing `hostname` in worker.pl > with something like: > > `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` > > You may have to adapt this a bit to meet your needs. Im assuming that > the only code that will uses these IPs is MosaStore. > > - Mike > > > ----- Original Message ----- > > From: "Kazutomo Yoshii" < kazutomo at mcs.anl.gov > > > To: zeptoos at lists.mcs.anl.gov > > Sent: Saturday, March 3, 2012 8:52:00 PM > > Subject: Re: [ZeptoOS] hostname returns none in Surveyor > > Hi Emalayan, > > > > The zeptoos profile returns the IP address of associated I/O node, > > which is kind of wrong in my opinion (influence of IBM CNK). > > ifconfig on compute nodes returns CN's IP address, which is correct. > > e.g. tun0 192.168.1.64 > > > > If you want to find associated ION's IP address from CNs, > > do something like this. > > > > $ grep BG_IP= /proc/personality.sh > > > > - kaz > > > > On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > > > Hi All, > > > > > > I am trying to run some experiments in Surveyor. The software I am > > > using > > > gets the IP-address of compute-nodes using hostname command. > > > > > > With zepto-vn-eval/mosatest profile hostname command returns none. > > > But with zeptoos profile hostname returns the correct IP address. > > > > > > Is this due to some configuration issues in zepto-vn-eval/mosatest > > > profile?As a workaround I tired to use ifconfig with both > > > profiles, > > > but > > > it seems ifconfig is not returning the correct IP address. > > > > > > Is there any command / files which I can used to retrieve the > > > hostname > > > on compute nodes? I have pasted the console output with both > > > profiles > > > below. Please let me know if you need more details. > > > > > > Thank you > > > Emalayan > > > > > > > > > =======================With zeptoos profile > > > =============================== > > > > > > / # hostname > > > 172.18.3.19 > > > / # > > > / # cat /proc/sys/kernel/hostname > > > 172.18.3.19 > > > / # > > > / # > > > / # ifconfig -a > > > lo Link encap:Local Loopback > > > inet addr:127.0.0.1 Mask:255.0.0.0 > > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:0 > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > tun0 Link encap:UNSPEC HWaddr > > > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > > inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > > > UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > > RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:500 > > > RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > > > > > > > > > > > > =======================With zepto-vn-eval/mosatest profile > > > =============================== > > > > > > /etc # hostname > > > (none) > > > /etc # > > > /etc # cat /proc/sys/kernel/hostname > > > (none) > > > /etc # > > > /etc # ifconfig -a > > > eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > > > BROADCAST MULTICAST MTU:1500 Metric:1 > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:1000 > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > > > BROADCAST MULTICAST MTU:1500 Metric:1 > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:1000 > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > lo Link encap:Local Loopback > > > inet addr:127.0.0.1 Mask:255.0.0.0 > > > inet6 addr: ::1/128 Scope:Host > > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:0 > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > sit0 Link encap:IPv6-in-IPv4 > > > NOARP MTU:1480 Metric:1 > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:0 > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > tun0 Link encap:UNSPEC HWaddr > > > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > > inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > > > UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > > RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:500 > > > RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From benc at hawaga.org.uk Sun Mar 4 11:26:10 2012 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 4 Mar 2012 18:26:10 +0100 Subject: [Swift-devel] Question about retry behavior In-Reply-To: <1049547380.54126.1330704615353.JavaMail.root@zimbra.anl.gov> References: <1049547380.54126.1330704615353.JavaMail.root@zimbra.anl.gov> Message-ID: On Mar 2, 2012, at 5:10 PM, Michael Wilde wrote: > Good points, Ioan - I'd forgotten about that part of the Falkon work. Seems like per-worker fault analysis is a good thing, but that higher level analysis and actions are also needed. Maybe per-worker and per-site analysis and down-ability. I've wondered what this might look like if you did "proper" stats on what was happening - there are all these variables, like choice of worker, the job itself, properties of that job (such as which app its trying to run). I've wondered if you could usefully extract and use information like "this app always fails on these workers", or "this job always fails, no matter where it is run". -- From zhaozhang at uchicago.edu Sun Mar 4 12:18:28 2012 From: zhaozhang at uchicago.edu (ZHAO ZHANG) Date: Sun, 04 Mar 2012 12:18:28 -0600 Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <1342057611.59431.1330878278821.JavaMail.root@zimbra.anl.gov> References: <1342057611.59431.1330878278821.JavaMail.root@zimbra.anl.gov> Message-ID: <4F53B1F4.7040300@uchicago.edu> Hi, Mike With 192.168.1.*, we could only access the tree network. In order to use the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, z here is the coordinates of the compute nodes). The code below could bring the torus ip address up. IP="" set_torus_ip() { x=$1 y=$2 z=$3 z=`expr $3 + 1` ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp IP=12.$x.$y.$z } BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' -f 2` echo ${BG_PSETORG} >> /dev/shm/localip set_torus_ip $BG_PSETORG best zhao On 3/4/2012 10:24 AM, Michael Wilde wrote: > Zhao, > > Can you tell us if the nodes on the torus network are accessed over the 192.168 network? I just realized they cant all be on the 192.168.1 subnet, so I hope I suggested the right network here. > > Thanks, > > - Mike > > ----- Original Message ----- >> From: "Emalayan Vairavanathan" >> To: swift-devel at ci.uchicago.edu >> Sent: Sunday, March 4, 2012 1:40:53 AM >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> Thank you very much Mike. I agree with you suggestion. I can do that >> in worker.pl. >> >> >> Thank you >> Emalayan >> >> >> >> >> >> >> From: Michael Wilde >> To: emalayan at ece.ubc.ca >> Cc: swift-devel >> Sent: Saturday, 3 March 2012 7:39 PM >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> >> Emalayan, >> >> I wasnt paying much attention to the actual IP address returned by >> hostname in the zeptoos profile. >> >> Since these are the addresses that Mosa will communicate over, I think >> you *do* want them to be the 192.168.1.* addresses of the nodes on the >> torus network (in other words tun0). >> >> So, since both profiles return 192.168.1.64 for the tun0 IP, I think >> thats what you should use. So try replacing `hostname` in worker.pl >> with something like: >> >> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` >> >> You may have to adapt this a bit to meet your needs. Im assuming that >> the only code that will uses these IPs is MosaStore. >> >> - Mike >> >> >> ----- Original Message ----- >>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> >>> To: zeptoos at lists.mcs.anl.gov >>> Sent: Saturday, March 3, 2012 8:52:00 PM >>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >>> Hi Emalayan, >>> >>> The zeptoos profile returns the IP address of associated I/O node, >>> which is kind of wrong in my opinion (influence of IBM CNK). >>> ifconfig on compute nodes returns CN's IP address, which is correct. >>> e.g. tun0 192.168.1.64 >>> >>> If you want to find associated ION's IP address from CNs, >>> do something like this. >>> >>> $ grep BG_IP= /proc/personality.sh >>> >>> - kaz >>> >>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >>>> Hi All, >>>> >>>> I am trying to run some experiments in Surveyor. The software I am >>>> using >>>> gets the IP-address of compute-nodes using hostname command. >>>> >>>> With zepto-vn-eval/mosatest profile hostname command returns none. >>>> But with zeptoos profile hostname returns the correct IP address. >>>> >>>> Is this due to some configuration issues in zepto-vn-eval/mosatest >>>> profile?As a workaround I tired to use ifconfig with both >>>> profiles, >>>> but >>>> it seems ifconfig is not returning the correct IP address. >>>> >>>> Is there any command / files which I can used to retrieve the >>>> hostname >>>> on compute nodes? I have pasted the console output with both >>>> profiles >>>> below. Please let me know if you need more details. >>>> >>>> Thank you >>>> Emalayan >>>> >>>> >>>> =======================With zeptoos profile >>>> =============================== >>>> >>>> / # hostname >>>> 172.18.3.19 >>>> / # >>>> / # cat /proc/sys/kernel/hostname >>>> 172.18.3.19 >>>> / # >>>> / # >>>> / # ifconfig -a >>>> lo Link encap:Local Loopback >>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>> >>>> tun0 Link encap:UNSPEC HWaddr >>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:500 >>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >>>> >>>> >>>> >>>> =======================With zepto-vn-eval/mosatest profile >>>> =============================== >>>> >>>> /etc # hostname >>>> (none) >>>> /etc # >>>> /etc # cat /proc/sys/kernel/hostname >>>> (none) >>>> /etc # >>>> /etc # ifconfig -a >>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:1000 >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>> >>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:1000 >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>> >>>> lo Link encap:Local Loopback >>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>> inet6 addr: ::1/128 Scope:Host >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>> >>>> sit0 Link encap:IPv6-in-IPv4 >>>> NOARP MTU:1480 Metric:1 >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>> >>>> tun0 Link encap:UNSPEC HWaddr >>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:500 >>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >>>> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Mar 4 12:53:03 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 4 Mar 2012 12:53:03 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <4F53B1F4.7040300@uchicago.edu> Message-ID: <1555491770.59511.1330887183548.JavaMail.root@zimbra.anl.gov> Thanks, Zhao. Does this need to run on each node at startup? If so should this logic be integrated into the worker startup script, Jon, Justin, Emalayan? Ive not looked at the current scripts much; I would think that all the BG/P specific logic of enabling the torus network and finding each node's IP address on the torus should be done in the init script rather than in the worker. - Mike ----- Original Message ----- > From: "ZHAO ZHANG" > To: "Michael Wilde" > Cc: "Emalayan Vairavanathan" , swift-devel at ci.uchicago.edu > Sent: Sunday, March 4, 2012 12:18:28 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > Hi, Mike > > With 192.168.1.*, we could only access the tree network. In order to > use > the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, z > here is the coordinates of the compute nodes). > The code below could bring the torus ip address up. > > IP="" > set_torus_ip() > { > x=$1 > y=$2 > z=$3 > z=`expr $3 + 1` > ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp > IP=12.$x.$y.$z > } > BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' -f > 2` > echo ${BG_PSETORG} >> /dev/shm/localip > set_torus_ip $BG_PSETORG > > best > zhao > > On 3/4/2012 10:24 AM, Michael Wilde wrote: > > Zhao, > > > > Can you tell us if the nodes on the torus network are accessed over > > the 192.168 network? I just realized they cant all be on the > > 192.168.1 subnet, so I hope I suggested the right network here. > > > > Thanks, > > > > - Mike > > > > ----- Original Message ----- > >> From: "Emalayan Vairavanathan" > >> To: swift-devel at ci.uchicago.edu > >> Sent: Sunday, March 4, 2012 1:40:53 AM > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >> Surveyor > >> Thank you very much Mike. I agree with you suggestion. I can do > >> that > >> in worker.pl. > >> > >> > >> Thank you > >> Emalayan > >> > >> > >> > >> > >> > >> > >> From: Michael Wilde > >> To: emalayan at ece.ubc.ca > >> Cc: swift-devel > >> Sent: Saturday, 3 March 2012 7:39 PM > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >> Surveyor > >> > >> Emalayan, > >> > >> I wasnt paying much attention to the actual IP address returned by > >> hostname in the zeptoos profile. > >> > >> Since these are the addresses that Mosa will communicate over, I > >> think > >> you *do* want them to be the 192.168.1.* addresses of the nodes on > >> the > >> torus network (in other words tun0). > >> > >> So, since both profiles return 192.168.1.64 for the tun0 IP, I > >> think > >> thats what you should use. So try replacing `hostname` in worker.pl > >> with something like: > >> > >> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` > >> > >> You may have to adapt this a bit to meet your needs. Im assuming > >> that > >> the only code that will uses these IPs is MosaStore. > >> > >> - Mike > >> > >> > >> ----- Original Message ----- > >>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> > >>> To: zeptoos at lists.mcs.anl.gov > >>> Sent: Saturday, March 3, 2012 8:52:00 PM > >>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor > >>> Hi Emalayan, > >>> > >>> The zeptoos profile returns the IP address of associated I/O node, > >>> which is kind of wrong in my opinion (influence of IBM CNK). > >>> ifconfig on compute nodes returns CN's IP address, which is > >>> correct. > >>> e.g. tun0 192.168.1.64 > >>> > >>> If you want to find associated ION's IP address from CNs, > >>> do something like this. > >>> > >>> $ grep BG_IP= /proc/personality.sh > >>> > >>> - kaz > >>> > >>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > >>>> Hi All, > >>>> > >>>> I am trying to run some experiments in Surveyor. The software I > >>>> am > >>>> using > >>>> gets the IP-address of compute-nodes using hostname command. > >>>> > >>>> With zepto-vn-eval/mosatest profile hostname command returns > >>>> none. > >>>> But with zeptoos profile hostname returns the correct IP address. > >>>> > >>>> Is this due to some configuration issues in > >>>> zepto-vn-eval/mosatest > >>>> profile?As a workaround I tired to use ifconfig with both > >>>> profiles, > >>>> but > >>>> it seems ifconfig is not returning the correct IP address. > >>>> > >>>> Is there any command / files which I can used to retrieve the > >>>> hostname > >>>> on compute nodes? I have pasted the console output with both > >>>> profiles > >>>> below. Please let me know if you need more details. > >>>> > >>>> Thank you > >>>> Emalayan > >>>> > >>>> > >>>> =======================With zeptoos profile > >>>> =============================== > >>>> > >>>> / # hostname > >>>> 172.18.3.19 > >>>> / # > >>>> / # cat /proc/sys/kernel/hostname > >>>> 172.18.3.19 > >>>> / # > >>>> / # > >>>> / # ifconfig -a > >>>> lo Link encap:Local Loopback > >>>> inet addr:127.0.0.1 Mask:255.0.0.0 > >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>> collisions:0 txqueuelen:0 > >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>> > >>>> tun0 Link encap:UNSPEC HWaddr > >>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > >>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > >>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > >>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > >>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > >>>> collisions:0 txqueuelen:500 > >>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > >>>> > >>>> > >>>> > >>>> =======================With zepto-vn-eval/mosatest profile > >>>> =============================== > >>>> > >>>> /etc # hostname > >>>> (none) > >>>> /etc # > >>>> /etc # cat /proc/sys/kernel/hostname > >>>> (none) > >>>> /etc # > >>>> /etc # ifconfig -a > >>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > >>>> BROADCAST MULTICAST MTU:1500 Metric:1 > >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>> collisions:0 txqueuelen:1000 > >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>> > >>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > >>>> BROADCAST MULTICAST MTU:1500 Metric:1 > >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>> collisions:0 txqueuelen:1000 > >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>> > >>>> lo Link encap:Local Loopback > >>>> inet addr:127.0.0.1 Mask:255.0.0.0 > >>>> inet6 addr: ::1/128 Scope:Host > >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>> collisions:0 txqueuelen:0 > >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>> > >>>> sit0 Link encap:IPv6-in-IPv4 > >>>> NOARP MTU:1480 Metric:1 > >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>> collisions:0 txqueuelen:0 > >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>> > >>>> tun0 Link encap:UNSPEC HWaddr > >>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > >>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > >>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > >>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > >>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > >>>> collisions:0 txqueuelen:500 > >>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > >>>> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From zhaozhang at uchicago.edu Sun Mar 4 13:17:18 2012 From: zhaozhang at uchicago.edu (ZHAO ZHANG) Date: Sun, 04 Mar 2012 13:17:18 -0600 Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <1555491770.59511.1330887183548.JavaMail.root@zimbra.anl.gov> References: <1555491770.59511.1330887183548.JavaMail.root@zimbra.anl.gov> Message-ID: <4F53BFBE.3020205@uchicago.edu> Yes, each compute node needs to run this script to bring up the network interface. zhao On 3/4/2012 12:53 PM, Michael Wilde wrote: > Thanks, Zhao. Does this need to run on each node at startup? > > If so should this logic be integrated into the worker startup script, Jon, Justin, Emalayan? > > Ive not looked at the current scripts much; I would think that all the BG/P specific logic of enabling the torus network and finding each node's IP address on the torus should be done in the init script rather than in the worker. > > - Mike > > ----- Original Message ----- >> From: "ZHAO ZHANG" >> To: "Michael Wilde" >> Cc: "Emalayan Vairavanathan", swift-devel at ci.uchicago.edu >> Sent: Sunday, March 4, 2012 12:18:28 PM >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> Hi, Mike >> >> With 192.168.1.*, we could only access the tree network. In order to >> use >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, z >> here is the coordinates of the compute nodes). >> The code below could bring the torus ip address up. >> >> IP="" >> set_torus_ip() >> { >> x=$1 >> y=$2 >> z=$3 >> z=`expr $3 + 1` >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp >> IP=12.$x.$y.$z >> } >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' -f >> 2` >> echo ${BG_PSETORG}>> /dev/shm/localip >> set_torus_ip $BG_PSETORG >> >> best >> zhao >> >> On 3/4/2012 10:24 AM, Michael Wilde wrote: >>> Zhao, >>> >>> Can you tell us if the nodes on the torus network are accessed over >>> the 192.168 network? I just realized they cant all be on the >>> 192.168.1 subnet, so I hope I suggested the right network here. >>> >>> Thanks, >>> >>> - Mike >>> >>> ----- Original Message ----- >>>> From: "Emalayan Vairavanathan" >>>> To: swift-devel at ci.uchicago.edu >>>> Sent: Sunday, March 4, 2012 1:40:53 AM >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>> Surveyor >>>> Thank you very much Mike. I agree with you suggestion. I can do >>>> that >>>> in worker.pl. >>>> >>>> >>>> Thank you >>>> Emalayan >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: Michael Wilde >>>> To: emalayan at ece.ubc.ca >>>> Cc: swift-devel >>>> Sent: Saturday, 3 March 2012 7:39 PM >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>> Surveyor >>>> >>>> Emalayan, >>>> >>>> I wasnt paying much attention to the actual IP address returned by >>>> hostname in the zeptoos profile. >>>> >>>> Since these are the addresses that Mosa will communicate over, I >>>> think >>>> you *do* want them to be the 192.168.1.* addresses of the nodes on >>>> the >>>> torus network (in other words tun0). >>>> >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I >>>> think >>>> thats what you should use. So try replacing `hostname` in worker.pl >>>> with something like: >>>> >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` >>>> >>>> You may have to adapt this a bit to meet your needs. Im assuming >>>> that >>>> the only code that will uses these IPs is MosaStore. >>>> >>>> - Mike >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> >>>>> To: zeptoos at lists.mcs.anl.gov >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >>>>> Hi Emalayan, >>>>> >>>>> The zeptoos profile returns the IP address of associated I/O node, >>>>> which is kind of wrong in my opinion (influence of IBM CNK). >>>>> ifconfig on compute nodes returns CN's IP address, which is >>>>> correct. >>>>> e.g. tun0 192.168.1.64 >>>>> >>>>> If you want to find associated ION's IP address from CNs, >>>>> do something like this. >>>>> >>>>> $ grep BG_IP= /proc/personality.sh >>>>> >>>>> - kaz >>>>> >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >>>>>> Hi All, >>>>>> >>>>>> I am trying to run some experiments in Surveyor. The software I >>>>>> am >>>>>> using >>>>>> gets the IP-address of compute-nodes using hostname command. >>>>>> >>>>>> With zepto-vn-eval/mosatest profile hostname command returns >>>>>> none. >>>>>> But with zeptoos profile hostname returns the correct IP address. >>>>>> >>>>>> Is this due to some configuration issues in >>>>>> zepto-vn-eval/mosatest >>>>>> profile?As a workaround I tired to use ifconfig with both >>>>>> profiles, >>>>>> but >>>>>> it seems ifconfig is not returning the correct IP address. >>>>>> >>>>>> Is there any command / files which I can used to retrieve the >>>>>> hostname >>>>>> on compute nodes? I have pasted the console output with both >>>>>> profiles >>>>>> below. Please let me know if you need more details. >>>>>> >>>>>> Thank you >>>>>> Emalayan >>>>>> >>>>>> >>>>>> =======================With zeptoos profile >>>>>> =============================== >>>>>> >>>>>> / # hostname >>>>>> 172.18.3.19 >>>>>> / # >>>>>> / # cat /proc/sys/kernel/hostname >>>>>> 172.18.3.19 >>>>>> / # >>>>>> / # >>>>>> / # ifconfig -a >>>>>> lo Link encap:Local Loopback >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:0 >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>> >>>>>> tun0 Link encap:UNSPEC HWaddr >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:500 >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >>>>>> >>>>>> >>>>>> >>>>>> =======================With zepto-vn-eval/mosatest profile >>>>>> =============================== >>>>>> >>>>>> /etc # hostname >>>>>> (none) >>>>>> /etc # >>>>>> /etc # cat /proc/sys/kernel/hostname >>>>>> (none) >>>>>> /etc # >>>>>> /etc # ifconfig -a >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:1000 >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>> >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:1000 >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>> >>>>>> lo Link encap:Local Loopback >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>> inet6 addr: ::1/128 Scope:Host >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:0 >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>> >>>>>> sit0 Link encap:IPv6-in-IPv4 >>>>>> NOARP MTU:1480 Metric:1 >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:0 >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>> >>>>>> tun0 Link encap:UNSPEC HWaddr >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:500 >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >>>>>> >>>> -- >>>> Michael Wilde >>>> Computation Institute, University of Chicago >>>> Mathematics and Computer Science Division >>>> Argonne National Laboratory >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Sun Mar 4 13:30:53 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 4 Mar 2012 13:30:53 -0600 Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <4F53BFBE.3020205@uchicago.edu> References: <1555491770.59511.1330887183548.JavaMail.root@zimbra.anl.gov> <4F53BFBE.3020205@uchicago.edu> Message-ID: This logic could be added to the worker-init.pl script. It shouldn't be to difficult. But one thought that has been nagging me is can the workers connect back to the service and actually do work in the other kernel profile? The coaster service needs to know what worker to send jobs to doesn't it? How does it send the work to the worker if the worker doesn't know it's ip to peovide to the service. So the worker logic still may need to be changed a bit to work with this kernel profile. On Mar 4, 2012, at 1:17 PM, ZHAO ZHANG wrote: > Yes, each compute node needs to run this script to bring up the network > interface. > > zhao > > On 3/4/2012 12:53 PM, Michael Wilde wrote: >> Thanks, Zhao. Does this need to run on each node at startup? >> >> If so should this logic be integrated into the worker startup script, Jon, Justin, Emalayan? >> >> Ive not looked at the current scripts much; I would think that all the BG/P specific logic of enabling the torus network and finding each node's IP address on the torus should be done in the init script rather than in the worker. >> >> - Mike >> >> ----- Original Message ----- >>> From: "ZHAO ZHANG" >>> To: "Michael Wilde" >>> Cc: "Emalayan Vairavanathan", swift-devel at ci.uchicago.edu >>> Sent: Sunday, March 4, 2012 12:18:28 PM >>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >>> Hi, Mike >>> >>> With 192.168.1.*, we could only access the tree network. In order to >>> use >>> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, z >>> here is the coordinates of the compute nodes). >>> The code below could bring the torus ip address up. >>> >>> IP="" >>> set_torus_ip() >>> { >>> x=$1 >>> y=$2 >>> z=$3 >>> z=`expr $3 + 1` >>> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp >>> IP=12.$x.$y.$z >>> } >>> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' -f >>> 2` >>> echo ${BG_PSETORG}>> /dev/shm/localip >>> set_torus_ip $BG_PSETORG >>> >>> best >>> zhao >>> >>> On 3/4/2012 10:24 AM, Michael Wilde wrote: >>>> Zhao, >>>> >>>> Can you tell us if the nodes on the torus network are accessed over >>>> the 192.168 network? I just realized they cant all be on the >>>> 192.168.1 subnet, so I hope I suggested the right network here. >>>> >>>> Thanks, >>>> >>>> - Mike >>>> >>>> ----- Original Message ----- >>>>> From: "Emalayan Vairavanathan" >>>>> To: swift-devel at ci.uchicago.edu >>>>> Sent: Sunday, March 4, 2012 1:40:53 AM >>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>>> Surveyor >>>>> Thank you very much Mike. I agree with you suggestion. I can do >>>>> that >>>>> in worker.pl. >>>>> >>>>> >>>>> Thank you >>>>> Emalayan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Michael Wilde >>>>> To: emalayan at ece.ubc.ca >>>>> Cc: swift-devel >>>>> Sent: Saturday, 3 March 2012 7:39 PM >>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>>> Surveyor >>>>> >>>>> Emalayan, >>>>> >>>>> I wasnt paying much attention to the actual IP address returned by >>>>> hostname in the zeptoos profile. >>>>> >>>>> Since these are the addresses that Mosa will communicate over, I >>>>> think >>>>> you *do* want them to be the 192.168.1.* addresses of the nodes on >>>>> the >>>>> torus network (in other words tun0). >>>>> >>>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I >>>>> think >>>>> thats what you should use. So try replacing `hostname` in worker.pl >>>>> with something like: >>>>> >>>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` >>>>> >>>>> You may have to adapt this a bit to meet your needs. Im assuming >>>>> that >>>>> the only code that will uses these IPs is MosaStore. >>>>> >>>>> - Mike >>>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> >>>>>> To: zeptoos at lists.mcs.anl.gov >>>>>> Sent: Saturday, March 3, 2012 8:52:00 PM >>>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >>>>>> Hi Emalayan, >>>>>> >>>>>> The zeptoos profile returns the IP address of associated I/O node, >>>>>> which is kind of wrong in my opinion (influence of IBM CNK). >>>>>> ifconfig on compute nodes returns CN's IP address, which is >>>>>> correct. >>>>>> e.g. tun0 192.168.1.64 >>>>>> >>>>>> If you want to find associated ION's IP address from CNs, >>>>>> do something like this. >>>>>> >>>>>> $ grep BG_IP= /proc/personality.sh >>>>>> >>>>>> - kaz >>>>>> >>>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >>>>>>> Hi All, >>>>>>> >>>>>>> I am trying to run some experiments in Surveyor. The software I >>>>>>> am >>>>>>> using >>>>>>> gets the IP-address of compute-nodes using hostname command. >>>>>>> >>>>>>> With zepto-vn-eval/mosatest profile hostname command returns >>>>>>> none. >>>>>>> But with zeptoos profile hostname returns the correct IP address. >>>>>>> >>>>>>> Is this due to some configuration issues in >>>>>>> zepto-vn-eval/mosatest >>>>>>> profile?As a workaround I tired to use ifconfig with both >>>>>>> profiles, >>>>>>> but >>>>>>> it seems ifconfig is not returning the correct IP address. >>>>>>> >>>>>>> Is there any command / files which I can used to retrieve the >>>>>>> hostname >>>>>>> on compute nodes? I have pasted the console output with both >>>>>>> profiles >>>>>>> below. Please let me know if you need more details. >>>>>>> >>>>>>> Thank you >>>>>>> Emalayan >>>>>>> >>>>>>> >>>>>>> =======================With zeptoos profile >>>>>>> =============================== >>>>>>> >>>>>>> / # hostname >>>>>>> 172.18.3.19 >>>>>>> / # >>>>>>> / # cat /proc/sys/kernel/hostname >>>>>>> 172.18.3.19 >>>>>>> / # >>>>>>> / # >>>>>>> / # ifconfig -a >>>>>>> lo Link encap:Local Loopback >>>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>> collisions:0 txqueuelen:0 >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>> >>>>>>> tun0 Link encap:UNSPEC HWaddr >>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >>>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>> collisions:0 txqueuelen:500 >>>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >>>>>>> >>>>>>> >>>>>>> >>>>>>> =======================With zepto-vn-eval/mosatest profile >>>>>>> =============================== >>>>>>> >>>>>>> /etc # hostname >>>>>>> (none) >>>>>>> /etc # >>>>>>> /etc # cat /proc/sys/kernel/hostname >>>>>>> (none) >>>>>>> /etc # >>>>>>> /etc # ifconfig -a >>>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >>>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>> collisions:0 txqueuelen:1000 >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>> >>>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >>>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>> collisions:0 txqueuelen:1000 >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>> >>>>>>> lo Link encap:Local Loopback >>>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>>> inet6 addr: ::1/128 Scope:Host >>>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>> collisions:0 txqueuelen:0 >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>> >>>>>>> sit0 Link encap:IPv6-in-IPv4 >>>>>>> NOARP MTU:1480 Metric:1 >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>> collisions:0 txqueuelen:0 >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>> >>>>>>> tun0 Link encap:UNSPEC HWaddr >>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >>>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>> collisions:0 txqueuelen:500 >>>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >>>>>>> >>>>> -- >>>>> Michael Wilde >>>>> Computation Institute, University of Chicago >>>>> Mathematics and Computer Science Division >>>>> Argonne National Laboratory >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Mar 4 13:50:39 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 4 Mar 2012 13:50:39 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: Message-ID: <245383691.59555.1330890639825.JavaMail.root@zimbra.anl.gov> John, the workers can connect to the coaster service IP which they are passed as an argument. They should be able to reach the coaster service via NAT. Once the workers connect, the service and workers communicate on the bidirectional socket. The service doesnt need to know the workers' IPs. - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "ZHAO ZHANG" > Cc: "Michael Wilde" , swift-devel at ci.uchicago.edu > Sent: Sunday, March 4, 2012 1:30:53 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > This logic could be added to the worker-init.pl script. It shouldn't > be to difficult. > > But one thought that has been nagging me is can the workers connect > back to the service and actually do work in the other kernel profile? > The coaster service needs to know what worker to send jobs to doesn't > it? How does it send the work to the worker if the worker doesn't know > it's ip to peovide to the service. So the worker logic still may need > to be changed a bit to work with this kernel profile. > > On Mar 4, 2012, at 1:17 PM, ZHAO ZHANG wrote: > > > Yes, each compute node needs to run this script to bring up the > > network > > interface. > > > > zhao > > > > On 3/4/2012 12:53 PM, Michael Wilde wrote: > >> Thanks, Zhao. Does this need to run on each node at startup? > >> > >> If so should this logic be integrated into the worker startup > >> script, Jon, Justin, Emalayan? > >> > >> Ive not looked at the current scripts much; I would think that all > >> the BG/P specific logic of enabling the torus network and finding > >> each node's IP address on the torus should be done in the init > >> script rather than in the worker. > >> > >> - Mike > >> > >> ----- Original Message ----- > >>> From: "ZHAO ZHANG" > >>> To: "Michael Wilde" > >>> Cc: "Emalayan Vairavanathan", > >>> swift-devel at ci.uchicago.edu > >>> Sent: Sunday, March 4, 2012 12:18:28 PM > >>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >>> Surveyor > >>> Hi, Mike > >>> > >>> With 192.168.1.*, we could only access the tree network. In order > >>> to > >>> use > >>> the torus network, we need to use the 12.x.y.z+1 ip address. (x, > >>> y, z > >>> here is the coordinates of the compute nodes). > >>> The code below could bring the torus ip address up. > >>> > >>> IP="" > >>> set_torus_ip() > >>> { > >>> x=$1 > >>> y=$2 > >>> z=$3 > >>> z=`expr $3 + 1` > >>> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp > >>> IP=12.$x.$y.$z > >>> } > >>> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d > >>> '"' -f > >>> 2` > >>> echo ${BG_PSETORG}>> /dev/shm/localip > >>> set_torus_ip $BG_PSETORG > >>> > >>> best > >>> zhao > >>> > >>> On 3/4/2012 10:24 AM, Michael Wilde wrote: > >>>> Zhao, > >>>> > >>>> Can you tell us if the nodes on the torus network are accessed > >>>> over > >>>> the 192.168 network? I just realized they cant all be on the > >>>> 192.168.1 subnet, so I hope I suggested the right network here. > >>>> > >>>> Thanks, > >>>> > >>>> - Mike > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Emalayan Vairavanathan" > >>>>> To: swift-devel at ci.uchicago.edu > >>>>> Sent: Sunday, March 4, 2012 1:40:53 AM > >>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >>>>> Surveyor > >>>>> Thank you very much Mike. I agree with you suggestion. I can do > >>>>> that > >>>>> in worker.pl. > >>>>> > >>>>> > >>>>> Thank you > >>>>> Emalayan > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> From: Michael Wilde > >>>>> To: emalayan at ece.ubc.ca > >>>>> Cc: swift-devel > >>>>> Sent: Saturday, 3 March 2012 7:39 PM > >>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >>>>> Surveyor > >>>>> > >>>>> Emalayan, > >>>>> > >>>>> I wasnt paying much attention to the actual IP address returned > >>>>> by > >>>>> hostname in the zeptoos profile. > >>>>> > >>>>> Since these are the addresses that Mosa will communicate over, I > >>>>> think > >>>>> you *do* want them to be the 192.168.1.* addresses of the nodes > >>>>> on > >>>>> the > >>>>> torus network (in other words tun0). > >>>>> > >>>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I > >>>>> think > >>>>> thats what you should use. So try replacing `hostname` in > >>>>> worker.pl > >>>>> with something like: > >>>>> > >>>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ > >>>>> .*//'` > >>>>> > >>>>> You may have to adapt this a bit to meet your needs. Im assuming > >>>>> that > >>>>> the only code that will uses these IPs is MosaStore. > >>>>> > >>>>> - Mike > >>>>> > >>>>> > >>>>> ----- Original Message ----- > >>>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> > >>>>>> To: zeptoos at lists.mcs.anl.gov > >>>>>> Sent: Saturday, March 3, 2012 8:52:00 PM > >>>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor > >>>>>> Hi Emalayan, > >>>>>> > >>>>>> The zeptoos profile returns the IP address of associated I/O > >>>>>> node, > >>>>>> which is kind of wrong in my opinion (influence of IBM CNK). > >>>>>> ifconfig on compute nodes returns CN's IP address, which is > >>>>>> correct. > >>>>>> e.g. tun0 192.168.1.64 > >>>>>> > >>>>>> If you want to find associated ION's IP address from CNs, > >>>>>> do something like this. > >>>>>> > >>>>>> $ grep BG_IP= /proc/personality.sh > >>>>>> > >>>>>> - kaz > >>>>>> > >>>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > >>>>>>> Hi All, > >>>>>>> > >>>>>>> I am trying to run some experiments in Surveyor. The software > >>>>>>> I > >>>>>>> am > >>>>>>> using > >>>>>>> gets the IP-address of compute-nodes using hostname command. > >>>>>>> > >>>>>>> With zepto-vn-eval/mosatest profile hostname command returns > >>>>>>> none. > >>>>>>> But with zeptoos profile hostname returns the correct IP > >>>>>>> address. > >>>>>>> > >>>>>>> Is this due to some configuration issues in > >>>>>>> zepto-vn-eval/mosatest > >>>>>>> profile?As a workaround I tired to use ifconfig with both > >>>>>>> profiles, > >>>>>>> but > >>>>>>> it seems ifconfig is not returning the correct IP address. > >>>>>>> > >>>>>>> Is there any command / files which I can used to retrieve the > >>>>>>> hostname > >>>>>>> on compute nodes? I have pasted the console output with both > >>>>>>> profiles > >>>>>>> below. Please let me know if you need more details. > >>>>>>> > >>>>>>> Thank you > >>>>>>> Emalayan > >>>>>>> > >>>>>>> > >>>>>>> =======================With zeptoos profile > >>>>>>> =============================== > >>>>>>> > >>>>>>> / # hostname > >>>>>>> 172.18.3.19 > >>>>>>> / # > >>>>>>> / # cat /proc/sys/kernel/hostname > >>>>>>> 172.18.3.19 > >>>>>>> / # > >>>>>>> / # > >>>>>>> / # ifconfig -a > >>>>>>> lo Link encap:Local Loopback > >>>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > >>>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>>> collisions:0 txqueuelen:0 > >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>>> > >>>>>>> tun0 Link encap:UNSPEC HWaddr > >>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > >>>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 > >>>>>>> Mask:255.255.255.255 > >>>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > >>>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > >>>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>>> collisions:0 txqueuelen:500 > >>>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> =======================With zepto-vn-eval/mosatest profile > >>>>>>> =============================== > >>>>>>> > >>>>>>> /etc # hostname > >>>>>>> (none) > >>>>>>> /etc # > >>>>>>> /etc # cat /proc/sys/kernel/hostname > >>>>>>> (none) > >>>>>>> /etc # > >>>>>>> /etc # ifconfig -a > >>>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > >>>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>>> collisions:0 txqueuelen:1000 > >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>>> > >>>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > >>>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>>> collisions:0 txqueuelen:1000 > >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>>> > >>>>>>> lo Link encap:Local Loopback > >>>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > >>>>>>> inet6 addr: ::1/128 Scope:Host > >>>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>>> collisions:0 txqueuelen:0 > >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>>> > >>>>>>> sit0 Link encap:IPv6-in-IPv4 > >>>>>>> NOARP MTU:1480 Metric:1 > >>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>>> collisions:0 txqueuelen:0 > >>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>>> > >>>>>>> tun0 Link encap:UNSPEC HWaddr > >>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > >>>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 > >>>>>>> Mask:255.255.255.255 > >>>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > >>>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > >>>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>>> collisions:0 txqueuelen:500 > >>>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > >>>>>>> > >>>>> -- > >>>>> Michael Wilde > >>>>> Computation Institute, University of Chicago > >>>>> Mathematics and Computer Science Division > >>>>> Argonne National Laboratory > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Sun Mar 4 13:52:31 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 4 Mar 2012 13:52:31 -0600 Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <245383691.59555.1330890639825.JavaMail.root@zimbra.anl.gov> References: <245383691.59555.1330890639825.JavaMail.root@zimbra.anl.gov> Message-ID: Ah, I thought the workers provided the service with the ip to communicate with. That makes sense. On Mar 4, 2012, at 1:50 PM, Michael Wilde wrote: > John, the workers can connect to the coaster service IP which they are passed as an argument. They should be able to reach the coaster service via NAT. Once the workers connect, the service and workers communicate on the bidirectional socket. The service doesnt need to know the workers' IPs. > > - Mike > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "ZHAO ZHANG" >> Cc: "Michael Wilde" , swift-devel at ci.uchicago.edu >> Sent: Sunday, March 4, 2012 1:30:53 PM >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> This logic could be added to the worker-init.pl script. It shouldn't >> be to difficult. >> >> But one thought that has been nagging me is can the workers connect >> back to the service and actually do work in the other kernel profile? >> The coaster service needs to know what worker to send jobs to doesn't >> it? How does it send the work to the worker if the worker doesn't know >> it's ip to peovide to the service. So the worker logic still may need >> to be changed a bit to work with this kernel profile. >> >> On Mar 4, 2012, at 1:17 PM, ZHAO ZHANG wrote: >> >>> Yes, each compute node needs to run this script to bring up the >>> network >>> interface. >>> >>> zhao >>> >>> On 3/4/2012 12:53 PM, Michael Wilde wrote: >>>> Thanks, Zhao. Does this need to run on each node at startup? >>>> >>>> If so should this logic be integrated into the worker startup >>>> script, Jon, Justin, Emalayan? >>>> >>>> Ive not looked at the current scripts much; I would think that all >>>> the BG/P specific logic of enabling the torus network and finding >>>> each node's IP address on the torus should be done in the init >>>> script rather than in the worker. >>>> >>>> - Mike >>>> >>>> ----- Original Message ----- >>>>> From: "ZHAO ZHANG" >>>>> To: "Michael Wilde" >>>>> Cc: "Emalayan Vairavanathan", >>>>> swift-devel at ci.uchicago.edu >>>>> Sent: Sunday, March 4, 2012 12:18:28 PM >>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>>> Surveyor >>>>> Hi, Mike >>>>> >>>>> With 192.168.1.*, we could only access the tree network. In order >>>>> to >>>>> use >>>>> the torus network, we need to use the 12.x.y.z+1 ip address. (x, >>>>> y, z >>>>> here is the coordinates of the compute nodes). >>>>> The code below could bring the torus ip address up. >>>>> >>>>> IP="" >>>>> set_torus_ip() >>>>> { >>>>> x=$1 >>>>> y=$2 >>>>> z=$3 >>>>> z=`expr $3 + 1` >>>>> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp >>>>> IP=12.$x.$y.$z >>>>> } >>>>> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d >>>>> '"' -f >>>>> 2` >>>>> echo ${BG_PSETORG}>> /dev/shm/localip >>>>> set_torus_ip $BG_PSETORG >>>>> >>>>> best >>>>> zhao >>>>> >>>>> On 3/4/2012 10:24 AM, Michael Wilde wrote: >>>>>> Zhao, >>>>>> >>>>>> Can you tell us if the nodes on the torus network are accessed >>>>>> over >>>>>> the 192.168 network? I just realized they cant all be on the >>>>>> 192.168.1 subnet, so I hope I suggested the right network here. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> - Mike >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Emalayan Vairavanathan" >>>>>>> To: swift-devel at ci.uchicago.edu >>>>>>> Sent: Sunday, March 4, 2012 1:40:53 AM >>>>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>>>>> Surveyor >>>>>>> Thank you very much Mike. I agree with you suggestion. I can do >>>>>>> that >>>>>>> in worker.pl. >>>>>>> >>>>>>> >>>>>>> Thank you >>>>>>> Emalayan >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> From: Michael Wilde >>>>>>> To: emalayan at ece.ubc.ca >>>>>>> Cc: swift-devel >>>>>>> Sent: Saturday, 3 March 2012 7:39 PM >>>>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>>>>> Surveyor >>>>>>> >>>>>>> Emalayan, >>>>>>> >>>>>>> I wasnt paying much attention to the actual IP address returned >>>>>>> by >>>>>>> hostname in the zeptoos profile. >>>>>>> >>>>>>> Since these are the addresses that Mosa will communicate over, I >>>>>>> think >>>>>>> you *do* want them to be the 192.168.1.* addresses of the nodes >>>>>>> on >>>>>>> the >>>>>>> torus network (in other words tun0). >>>>>>> >>>>>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I >>>>>>> think >>>>>>> thats what you should use. So try replacing `hostname` in >>>>>>> worker.pl >>>>>>> with something like: >>>>>>> >>>>>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ >>>>>>> .*//'` >>>>>>> >>>>>>> You may have to adapt this a bit to meet your needs. Im assuming >>>>>>> that >>>>>>> the only code that will uses these IPs is MosaStore. >>>>>>> >>>>>>> - Mike >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> >>>>>>>> To: zeptoos at lists.mcs.anl.gov >>>>>>>> Sent: Saturday, March 3, 2012 8:52:00 PM >>>>>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >>>>>>>> Hi Emalayan, >>>>>>>> >>>>>>>> The zeptoos profile returns the IP address of associated I/O >>>>>>>> node, >>>>>>>> which is kind of wrong in my opinion (influence of IBM CNK). >>>>>>>> ifconfig on compute nodes returns CN's IP address, which is >>>>>>>> correct. >>>>>>>> e.g. tun0 192.168.1.64 >>>>>>>> >>>>>>>> If you want to find associated ION's IP address from CNs, >>>>>>>> do something like this. >>>>>>>> >>>>>>>> $ grep BG_IP= /proc/personality.sh >>>>>>>> >>>>>>>> - kaz >>>>>>>> >>>>>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> I am trying to run some experiments in Surveyor. The software >>>>>>>>> I >>>>>>>>> am >>>>>>>>> using >>>>>>>>> gets the IP-address of compute-nodes using hostname command. >>>>>>>>> >>>>>>>>> With zepto-vn-eval/mosatest profile hostname command returns >>>>>>>>> none. >>>>>>>>> But with zeptoos profile hostname returns the correct IP >>>>>>>>> address. >>>>>>>>> >>>>>>>>> Is this due to some configuration issues in >>>>>>>>> zepto-vn-eval/mosatest >>>>>>>>> profile?As a workaround I tired to use ifconfig with both >>>>>>>>> profiles, >>>>>>>>> but >>>>>>>>> it seems ifconfig is not returning the correct IP address. >>>>>>>>> >>>>>>>>> Is there any command / files which I can used to retrieve the >>>>>>>>> hostname >>>>>>>>> on compute nodes? I have pasted the console output with both >>>>>>>>> profiles >>>>>>>>> below. Please let me know if you need more details. >>>>>>>>> >>>>>>>>> Thank you >>>>>>>>> Emalayan >>>>>>>>> >>>>>>>>> >>>>>>>>> =======================With zeptoos profile >>>>>>>>> =============================== >>>>>>>>> >>>>>>>>> / # hostname >>>>>>>>> 172.18.3.19 >>>>>>>>> / # >>>>>>>>> / # cat /proc/sys/kernel/hostname >>>>>>>>> 172.18.3.19 >>>>>>>>> / # >>>>>>>>> / # >>>>>>>>> / # ifconfig -a >>>>>>>>> lo Link encap:Local Loopback >>>>>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>>> collisions:0 txqueuelen:0 >>>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>>> >>>>>>>>> tun0 Link encap:UNSPEC HWaddr >>>>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>>>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 >>>>>>>>> Mask:255.255.255.255 >>>>>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>>>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>>> collisions:0 txqueuelen:500 >>>>>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> =======================With zepto-vn-eval/mosatest profile >>>>>>>>> =============================== >>>>>>>>> >>>>>>>>> /etc # hostname >>>>>>>>> (none) >>>>>>>>> /etc # >>>>>>>>> /etc # cat /proc/sys/kernel/hostname >>>>>>>>> (none) >>>>>>>>> /etc # >>>>>>>>> /etc # ifconfig -a >>>>>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >>>>>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>>> collisions:0 txqueuelen:1000 >>>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>>> >>>>>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >>>>>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>>> collisions:0 txqueuelen:1000 >>>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>>> >>>>>>>>> lo Link encap:Local Loopback >>>>>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>>>>> inet6 addr: ::1/128 Scope:Host >>>>>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>>> collisions:0 txqueuelen:0 >>>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>>> >>>>>>>>> sit0 Link encap:IPv6-in-IPv4 >>>>>>>>> NOARP MTU:1480 Metric:1 >>>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>>> collisions:0 txqueuelen:0 >>>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>>> >>>>>>>>> tun0 Link encap:UNSPEC HWaddr >>>>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>>>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 >>>>>>>>> Mask:255.255.255.255 >>>>>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>>>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>>> collisions:0 txqueuelen:500 >>>>>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >>>>>>>>> >>>>>>> -- >>>>>>> Michael Wilde >>>>>>> Computation Institute, University of Chicago >>>>>>> Mathematics and Computer Science Division >>>>>>> Argonne National Laboratory >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From wilde at mcs.anl.gov Sun Mar 4 16:33:52 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 4 Mar 2012 16:33:52 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <4F53BFBE.3020205@uchicago.edu> Message-ID: <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> Zhao, with this procedure do you get consecutive host IP addresses starting from 0.0 through 640*64 in the two low order octets? In other words, does your just do what this page describes under "IP over Torus": http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages Is the "ipfwd.sh" script mentioned there still needed, or does that now happen automatically? If so, perhaps we can greatly simplify the Mosa startup: we need only pass the max rank of the running job, and Mosa will know that it can use 12.128.0.0 for example. Then we dont need any manual intervention, nor complicated/brittle file-waiting logic. Zhao, I dont understand why your example is using the 12.0.0.0 network vs the example on the page above which uses 10.128.0.0. Can you help me understand what is going on here? Is the "IP Over Torus" info on the ZeptoOS wiki outdated? Or does it describe a different technique? Justin, have you also mastered similar techniques for JETS? Do we need help form the ZeptoOS team on this? Thanks, - Mike ----- Original Message ----- > From: "ZHAO ZHANG" > To: "Michael Wilde" > Cc: "Emalayan Vairavanathan" , swift-devel at ci.uchicago.edu > Sent: Sunday, March 4, 2012 1:17:18 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > Yes, each compute node needs to run this script to bring up the > network > interface. > > zhao > > On 3/4/2012 12:53 PM, Michael Wilde wrote: > > Thanks, Zhao. Does this need to run on each node at startup? > > > > If so should this logic be integrated into the worker startup > > script, Jon, Justin, Emalayan? > > > > Ive not looked at the current scripts much; I would think that all > > the BG/P specific logic of enabling the torus network and finding > > each node's IP address on the torus should be done in the init > > script rather than in the worker. > > > > - Mike > > > > ----- Original Message ----- > >> From: "ZHAO ZHANG" > >> To: "Michael Wilde" > >> Cc: "Emalayan Vairavanathan", > >> swift-devel at ci.uchicago.edu > >> Sent: Sunday, March 4, 2012 12:18:28 PM > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >> Surveyor > >> Hi, Mike > >> > >> With 192.168.1.*, we could only access the tree network. In order > >> to > >> use > >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, > >> z > >> here is the coordinates of the compute nodes). > >> The code below could bring the torus ip address up. > >> > >> IP="" > >> set_torus_ip() > >> { > >> x=$1 > >> y=$2 > >> z=$3 > >> z=`expr $3 + 1` > >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp > >> IP=12.$x.$y.$z > >> } > >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' > >> -f > >> 2` > >> echo ${BG_PSETORG}>> /dev/shm/localip > >> set_torus_ip $BG_PSETORG > >> > >> best > >> zhao > >> > >> On 3/4/2012 10:24 AM, Michael Wilde wrote: > >>> Zhao, > >>> > >>> Can you tell us if the nodes on the torus network are accessed > >>> over > >>> the 192.168 network? I just realized they cant all be on the > >>> 192.168.1 subnet, so I hope I suggested the right network here. > >>> > >>> Thanks, > >>> > >>> - Mike > >>> > >>> ----- Original Message ----- > >>>> From: "Emalayan Vairavanathan" > >>>> To: swift-devel at ci.uchicago.edu > >>>> Sent: Sunday, March 4, 2012 1:40:53 AM > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >>>> Surveyor > >>>> Thank you very much Mike. I agree with you suggestion. I can do > >>>> that > >>>> in worker.pl. > >>>> > >>>> > >>>> Thank you > >>>> Emalayan > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> From: Michael Wilde > >>>> To: emalayan at ece.ubc.ca > >>>> Cc: swift-devel > >>>> Sent: Saturday, 3 March 2012 7:39 PM > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >>>> Surveyor > >>>> > >>>> Emalayan, > >>>> > >>>> I wasnt paying much attention to the actual IP address returned > >>>> by > >>>> hostname in the zeptoos profile. > >>>> > >>>> Since these are the addresses that Mosa will communicate over, I > >>>> think > >>>> you *do* want them to be the 192.168.1.* addresses of the nodes > >>>> on > >>>> the > >>>> torus network (in other words tun0). > >>>> > >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I > >>>> think > >>>> thats what you should use. So try replacing `hostname` in > >>>> worker.pl > >>>> with something like: > >>>> > >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` > >>>> > >>>> You may have to adapt this a bit to meet your needs. Im assuming > >>>> that > >>>> the only code that will uses these IPs is MosaStore. > >>>> > >>>> - Mike > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> > >>>>> To: zeptoos at lists.mcs.anl.gov > >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM > >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor > >>>>> Hi Emalayan, > >>>>> > >>>>> The zeptoos profile returns the IP address of associated I/O > >>>>> node, > >>>>> which is kind of wrong in my opinion (influence of IBM CNK). > >>>>> ifconfig on compute nodes returns CN's IP address, which is > >>>>> correct. > >>>>> e.g. tun0 192.168.1.64 > >>>>> > >>>>> If you want to find associated ION's IP address from CNs, > >>>>> do something like this. > >>>>> > >>>>> $ grep BG_IP= /proc/personality.sh > >>>>> > >>>>> - kaz > >>>>> > >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > >>>>>> Hi All, > >>>>>> > >>>>>> I am trying to run some experiments in Surveyor. The software I > >>>>>> am > >>>>>> using > >>>>>> gets the IP-address of compute-nodes using hostname command. > >>>>>> > >>>>>> With zepto-vn-eval/mosatest profile hostname command returns > >>>>>> none. > >>>>>> But with zeptoos profile hostname returns the correct IP > >>>>>> address. > >>>>>> > >>>>>> Is this due to some configuration issues in > >>>>>> zepto-vn-eval/mosatest > >>>>>> profile?As a workaround I tired to use ifconfig with both > >>>>>> profiles, > >>>>>> but > >>>>>> it seems ifconfig is not returning the correct IP address. > >>>>>> > >>>>>> Is there any command / files which I can used to retrieve the > >>>>>> hostname > >>>>>> on compute nodes? I have pasted the console output with both > >>>>>> profiles > >>>>>> below. Please let me know if you need more details. > >>>>>> > >>>>>> Thank you > >>>>>> Emalayan > >>>>>> > >>>>>> > >>>>>> =======================With zeptoos profile > >>>>>> =============================== > >>>>>> > >>>>>> / # hostname > >>>>>> 172.18.3.19 > >>>>>> / # > >>>>>> / # cat /proc/sys/kernel/hostname > >>>>>> 172.18.3.19 > >>>>>> / # > >>>>>> / # > >>>>>> / # ifconfig -a > >>>>>> lo Link encap:Local Loopback > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:0 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> tun0 Link encap:UNSPEC HWaddr > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:500 > >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > >>>>>> > >>>>>> > >>>>>> > >>>>>> =======================With zepto-vn-eval/mosatest profile > >>>>>> =============================== > >>>>>> > >>>>>> /etc # hostname > >>>>>> (none) > >>>>>> /etc # > >>>>>> /etc # cat /proc/sys/kernel/hostname > >>>>>> (none) > >>>>>> /etc # > >>>>>> /etc # ifconfig -a > >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:1000 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:1000 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> lo Link encap:Local Loopback > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > >>>>>> inet6 addr: ::1/128 Scope:Host > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:0 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> sit0 Link encap:IPv6-in-IPv4 > >>>>>> NOARP MTU:1480 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:0 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> tun0 Link encap:UNSPEC HWaddr > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:500 > >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > >>>>>> > >>>> -- > >>>> Michael Wilde > >>>> Computation Institute, University of Chicago > >>>> Mathematics and Computer Science Division > >>>> Argonne National Laboratory > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From zhaozhang at uchicago.edu Sun Mar 4 17:57:59 2012 From: zhaozhang at uchicago.edu (ZHAO ZHANG) Date: Sun, 04 Mar 2012 17:57:59 -0600 Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> References: <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> Message-ID: <4F540187.4010402@uchicago.edu> Hi, Mike The 12.x.y.z+1 interface uses the IBM ethernet over torus driver. The 10.128. interface documented in http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages uses Kamil's MPI based IP communication. The difference of 10.128 network is that Kamil's fix will take up one core out of four, while the 12.x.y.z+1 does not. best zhao On 3/4/2012 4:33 PM, Michael Wilde wrote: > Zhao, with this procedure do you get consecutive host IP addresses starting from 0.0 through 640*64 in the two low order octets? > > In other words, does your just do what this page describes under "IP over Torus": > > http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages > > Is the "ipfwd.sh" script mentioned there still needed, or does that now happen automatically? > > If so, perhaps we can greatly simplify the Mosa startup: we need only pass the max rank of the running job, and Mosa will know that it can use 12.128.0.0 for example. Then we dont need any manual intervention, nor complicated/brittle file-waiting logic. > > Zhao, I dont understand why your example is using the 12.0.0.0 network vs the example on the page above which uses 10.128.0.0. Can you help me understand what is going on here? Is the "IP Over Torus" info on the ZeptoOS wiki outdated? Or does it describe a different technique? > > Justin, have you also mastered similar techniques for JETS? Do we need help form the ZeptoOS team on this? > > Thanks, > > - Mike > > > > ----- Original Message ----- >> From: "ZHAO ZHANG" >> To: "Michael Wilde" >> Cc: "Emalayan Vairavanathan", swift-devel at ci.uchicago.edu >> Sent: Sunday, March 4, 2012 1:17:18 PM >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> Yes, each compute node needs to run this script to bring up the >> network >> interface. >> >> zhao >> >> On 3/4/2012 12:53 PM, Michael Wilde wrote: >>> Thanks, Zhao. Does this need to run on each node at startup? >>> >>> If so should this logic be integrated into the worker startup >>> script, Jon, Justin, Emalayan? >>> >>> Ive not looked at the current scripts much; I would think that all >>> the BG/P specific logic of enabling the torus network and finding >>> each node's IP address on the torus should be done in the init >>> script rather than in the worker. >>> >>> - Mike >>> >>> ----- Original Message ----- >>>> From: "ZHAO ZHANG" >>>> To: "Michael Wilde" >>>> Cc: "Emalayan Vairavanathan", >>>> swift-devel at ci.uchicago.edu >>>> Sent: Sunday, March 4, 2012 12:18:28 PM >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>> Surveyor >>>> Hi, Mike >>>> >>>> With 192.168.1.*, we could only access the tree network. In order >>>> to >>>> use >>>> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, >>>> z >>>> here is the coordinates of the compute nodes). >>>> The code below could bring the torus ip address up. >>>> >>>> IP="" >>>> set_torus_ip() >>>> { >>>> x=$1 >>>> y=$2 >>>> z=$3 >>>> z=`expr $3 + 1` >>>> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp >>>> IP=12.$x.$y.$z >>>> } >>>> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' >>>> -f >>>> 2` >>>> echo ${BG_PSETORG}>> /dev/shm/localip >>>> set_torus_ip $BG_PSETORG >>>> >>>> best >>>> zhao >>>> >>>> On 3/4/2012 10:24 AM, Michael Wilde wrote: >>>>> Zhao, >>>>> >>>>> Can you tell us if the nodes on the torus network are accessed >>>>> over >>>>> the 192.168 network? I just realized they cant all be on the >>>>> 192.168.1 subnet, so I hope I suggested the right network here. >>>>> >>>>> Thanks, >>>>> >>>>> - Mike >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Emalayan Vairavanathan" >>>>>> To: swift-devel at ci.uchicago.edu >>>>>> Sent: Sunday, March 4, 2012 1:40:53 AM >>>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>>>> Surveyor >>>>>> Thank you very much Mike. I agree with you suggestion. I can do >>>>>> that >>>>>> in worker.pl. >>>>>> >>>>>> >>>>>> Thank you >>>>>> Emalayan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> From: Michael Wilde >>>>>> To: emalayan at ece.ubc.ca >>>>>> Cc: swift-devel >>>>>> Sent: Saturday, 3 March 2012 7:39 PM >>>>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>>>>> Surveyor >>>>>> >>>>>> Emalayan, >>>>>> >>>>>> I wasnt paying much attention to the actual IP address returned >>>>>> by >>>>>> hostname in the zeptoos profile. >>>>>> >>>>>> Since these are the addresses that Mosa will communicate over, I >>>>>> think >>>>>> you *do* want them to be the 192.168.1.* addresses of the nodes >>>>>> on >>>>>> the >>>>>> torus network (in other words tun0). >>>>>> >>>>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I >>>>>> think >>>>>> thats what you should use. So try replacing `hostname` in >>>>>> worker.pl >>>>>> with something like: >>>>>> >>>>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` >>>>>> >>>>>> You may have to adapt this a bit to meet your needs. Im assuming >>>>>> that >>>>>> the only code that will uses these IPs is MosaStore. >>>>>> >>>>>> - Mike >>>>>> >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> >>>>>>> To: zeptoos at lists.mcs.anl.gov >>>>>>> Sent: Saturday, March 3, 2012 8:52:00 PM >>>>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >>>>>>> Hi Emalayan, >>>>>>> >>>>>>> The zeptoos profile returns the IP address of associated I/O >>>>>>> node, >>>>>>> which is kind of wrong in my opinion (influence of IBM CNK). >>>>>>> ifconfig on compute nodes returns CN's IP address, which is >>>>>>> correct. >>>>>>> e.g. tun0 192.168.1.64 >>>>>>> >>>>>>> If you want to find associated ION's IP address from CNs, >>>>>>> do something like this. >>>>>>> >>>>>>> $ grep BG_IP= /proc/personality.sh >>>>>>> >>>>>>> - kaz >>>>>>> >>>>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I am trying to run some experiments in Surveyor. The software I >>>>>>>> am >>>>>>>> using >>>>>>>> gets the IP-address of compute-nodes using hostname command. >>>>>>>> >>>>>>>> With zepto-vn-eval/mosatest profile hostname command returns >>>>>>>> none. >>>>>>>> But with zeptoos profile hostname returns the correct IP >>>>>>>> address. >>>>>>>> >>>>>>>> Is this due to some configuration issues in >>>>>>>> zepto-vn-eval/mosatest >>>>>>>> profile?As a workaround I tired to use ifconfig with both >>>>>>>> profiles, >>>>>>>> but >>>>>>>> it seems ifconfig is not returning the correct IP address. >>>>>>>> >>>>>>>> Is there any command / files which I can used to retrieve the >>>>>>>> hostname >>>>>>>> on compute nodes? I have pasted the console output with both >>>>>>>> profiles >>>>>>>> below. Please let me know if you need more details. >>>>>>>> >>>>>>>> Thank you >>>>>>>> Emalayan >>>>>>>> >>>>>>>> >>>>>>>> =======================With zeptoos profile >>>>>>>> =============================== >>>>>>>> >>>>>>>> / # hostname >>>>>>>> 172.18.3.19 >>>>>>>> / # >>>>>>>> / # cat /proc/sys/kernel/hostname >>>>>>>> 172.18.3.19 >>>>>>>> / # >>>>>>>> / # >>>>>>>> / # ifconfig -a >>>>>>>> lo Link encap:Local Loopback >>>>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>> collisions:0 txqueuelen:0 >>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>> >>>>>>>> tun0 Link encap:UNSPEC HWaddr >>>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>>>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>> collisions:0 txqueuelen:500 >>>>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> =======================With zepto-vn-eval/mosatest profile >>>>>>>> =============================== >>>>>>>> >>>>>>>> /etc # hostname >>>>>>>> (none) >>>>>>>> /etc # >>>>>>>> /etc # cat /proc/sys/kernel/hostname >>>>>>>> (none) >>>>>>>> /etc # >>>>>>>> /etc # ifconfig -a >>>>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >>>>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>> collisions:0 txqueuelen:1000 >>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>> >>>>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >>>>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>> collisions:0 txqueuelen:1000 >>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>> >>>>>>>> lo Link encap:Local Loopback >>>>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>>>> inet6 addr: ::1/128 Scope:Host >>>>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>> collisions:0 txqueuelen:0 >>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>> >>>>>>>> sit0 Link encap:IPv6-in-IPv4 >>>>>>>> NOARP MTU:1480 Metric:1 >>>>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>> collisions:0 txqueuelen:0 >>>>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>>>>>> >>>>>>>> tun0 Link encap:UNSPEC HWaddr >>>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>>>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>>>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>>>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >>>>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >>>>>>>> collisions:0 txqueuelen:500 >>>>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >>>>>>>> >>>>>> -- >>>>>> Michael Wilde >>>>>> Computation Institute, University of Chicago >>>>>> Mathematics and Computer Science Division >>>>>> Argonne National Laboratory >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From svemalayan at yahoo.com Sun Mar 4 18:00:48 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Sun, 4 Mar 2012 16:00:48 -0800 (PST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> References: <4F53BFBE.3020205@uchicago.edu> <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> Message-ID: <1330905648.31137.YahooMailNeo@web39508.mail.mud.yahoo.com> Mike, that sounds like a good idea. Zhao, In addition to Mike's questions I have two more questions. 1) Is it possible to get/ calculate the MAX_RANK / number of nodes in an allocation from persoanlity.h or some other data structure? ? 2) Which interface should be configured to have Tours ? (Does this matter at all ?) ??? In your scripts you are configuring eth1. But in http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages tun1 is configured. Thank you Emalayan ________________________________ From: Michael Wilde To: ZHAO ZHANG ; Justin M Wozniak Cc: Emalayan Vairavanathan ; swift-devel at ci.uchicago.edu Sent: Sunday, 4 March 2012 2:33 PM Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor Zhao, with this procedure do you get consecutive host IP addresses starting from 0.0 through 640*64 in the two low order octets? In other words, does your? just do what this page describes under "IP over Torus": ? http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages Is the "ipfwd.sh" script mentioned there still needed, or does that now happen automatically? If so, perhaps we can greatly simplify the Mosa startup: we need only pass the max rank of the running job, and Mosa will know that it can use 12.128.0.0 for example.? Then we dont need any manual intervention, nor complicated/brittle file-waiting logic. Zhao, I dont understand why your example is using the 12.0.0.0 network vs the example on the page above which uses 10.128.0.0. Can you help me understand what is going on here? Is the "IP Over Torus" info on the ZeptoOS wiki outdated? Or does it describe a different technique? Justin, have you also mastered similar techniques for JETS?? Do we need help form the ZeptoOS team on this? Thanks, - Mike ----- Original Message ----- > From: "ZHAO ZHANG" > To: "Michael Wilde" > Cc: "Emalayan Vairavanathan" , swift-devel at ci.uchicago.edu > Sent: Sunday, March 4, 2012 1:17:18 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > Yes, each compute node needs to run this script to bring up the > network > interface. > > zhao > > On 3/4/2012 12:53 PM, Michael Wilde wrote: > > Thanks, Zhao. Does this need to run on each node at startup? > > > > If so should this logic be integrated into the worker startup > > script, Jon, Justin, Emalayan? > > > > Ive not looked at the current scripts much; I would think that all > > the BG/P specific logic of enabling the torus network and finding > > each node's IP address on the torus should be done in the init > > script rather than in the worker. > > > > - Mike > > > > ----- Original Message ----- > >> From: "ZHAO ZHANG" > >> To: "Michael Wilde" > >> Cc: "Emalayan Vairavanathan", > >> swift-devel at ci.uchicago.edu > >> Sent: Sunday, March 4, 2012 12:18:28 PM > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >> Surveyor > >> Hi, Mike > >> > >> With 192.168.1.*, we could only access the tree network. In order > >> to > >> use > >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, > >> z > >> here is the coordinates of the compute nodes). > >> The code below could bring the torus ip address up. > >> > >> IP="" > >> set_torus_ip() > >> { > >> x=$1 > >> y=$2 > >> z=$3 > >> z=`expr $3 + 1` > >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp > >> IP=12.$x.$y.$z > >> } > >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' > >> -f > >> 2` > >> echo ${BG_PSETORG}>> /dev/shm/localip > >> set_torus_ip $BG_PSETORG > >> > >> best > >> zhao > >> > >> On 3/4/2012 10:24 AM, Michael Wilde wrote: > >>> Zhao, > >>> > >>> Can you tell us if the nodes on the torus network are accessed > >>> over > >>> the 192.168 network? I just realized they cant all be on the > >>> 192.168.1 subnet, so I hope I suggested the right network here. > >>> > >>> Thanks, > >>> > >>> - Mike > >>> > >>> ----- Original Message ----- > >>>> From: "Emalayan Vairavanathan" > >>>> To: swift-devel at ci.uchicago.edu > >>>> Sent: Sunday, March 4, 2012 1:40:53 AM > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >>>> Surveyor > >>>> Thank you very much Mike. I agree with you suggestion. I can do > >>>> that > >>>> in worker.pl. > >>>> > >>>> > >>>> Thank you > >>>> Emalayan > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> From: Michael Wilde > >>>> To: emalayan at ece.ubc.ca > >>>> Cc: swift-devel > >>>> Sent: Saturday, 3 March 2012 7:39 PM > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > >>>> Surveyor > >>>> > >>>> Emalayan, > >>>> > >>>> I wasnt paying much attention to the actual IP address returned > >>>> by > >>>> hostname in the zeptoos profile. > >>>> > >>>> Since these are the addresses that Mosa will communicate over, I > >>>> think > >>>> you *do* want them to be the 192.168.1.* addresses of the nodes > >>>> on > >>>> the > >>>> torus network (in other words tun0). > >>>> > >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I > >>>> think > >>>> thats what you should use. So try replacing `hostname` in > >>>> worker.pl > >>>> with something like: > >>>> > >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` > >>>> > >>>> You may have to adapt this a bit to meet your needs. Im assuming > >>>> that > >>>> the only code that will uses these IPs is MosaStore. > >>>> > >>>> - Mike > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> > >>>>> To: zeptoos at lists.mcs.anl.gov > >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM > >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor > >>>>> Hi Emalayan, > >>>>> > >>>>> The zeptoos profile returns the IP address of associated I/O > >>>>> node, > >>>>> which is kind of wrong in my opinion (influence of IBM CNK). > >>>>> ifconfig on compute nodes returns CN's IP address, which is > >>>>> correct. > >>>>> e.g. tun0 192.168.1.64 > >>>>> > >>>>> If you want to find associated ION's IP address from CNs, > >>>>> do something like this. > >>>>> > >>>>> $ grep BG_IP= /proc/personality.sh > >>>>> > >>>>> - kaz > >>>>> > >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > >>>>>> Hi All, > >>>>>> > >>>>>> I am trying to run some experiments in Surveyor. The software I > >>>>>> am > >>>>>> using > >>>>>> gets the IP-address of compute-nodes using hostname command. > >>>>>> > >>>>>> With zepto-vn-eval/mosatest profile hostname command returns > >>>>>> none. > >>>>>> But with zeptoos profile hostname returns the correct IP > >>>>>> address. > >>>>>> > >>>>>> Is this due to some configuration issues in > >>>>>> zepto-vn-eval/mosatest > >>>>>> profile?As a workaround I tired to use ifconfig with both > >>>>>> profiles, > >>>>>> but > >>>>>> it seems ifconfig is not returning the correct IP address. > >>>>>> > >>>>>> Is there any command / files which I can used to retrieve the > >>>>>> hostname > >>>>>> on compute nodes? I have pasted the console output with both > >>>>>> profiles > >>>>>> below. Please let me know if you need more details. > >>>>>> > >>>>>> Thank you > >>>>>> Emalayan > >>>>>> > >>>>>> > >>>>>> =======================With zeptoos profile > >>>>>> =============================== > >>>>>> > >>>>>> / # hostname > >>>>>> 172.18.3.19 > >>>>>> / # > >>>>>> / # cat /proc/sys/kernel/hostname > >>>>>> 172.18.3.19 > >>>>>> / # > >>>>>> / # > >>>>>> / # ifconfig -a > >>>>>> lo Link encap:Local Loopback > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:0 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> tun0 Link encap:UNSPEC HWaddr > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:500 > >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > >>>>>> > >>>>>> > >>>>>> > >>>>>> =======================With zepto-vn-eval/mosatest profile > >>>>>> =============================== > >>>>>> > >>>>>> /etc # hostname > >>>>>> (none) > >>>>>> /etc # > >>>>>> /etc # cat /proc/sys/kernel/hostname > >>>>>> (none) > >>>>>> /etc # > >>>>>> /etc # ifconfig -a > >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:1000 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:1000 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> lo Link encap:Local Loopback > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > >>>>>> inet6 addr: ::1/128 Scope:Host > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:0 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> sit0 Link encap:IPv6-in-IPv4 > >>>>>> NOARP MTU:1480 Metric:1 > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:0 > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > >>>>>> > >>>>>> tun0 Link encap:UNSPEC HWaddr > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > >>>>>> collisions:0 txqueuelen:500 > >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > >>>>>> > >>>> -- > >>>> Michael Wilde > >>>> Computation Institute, University of Chicago > >>>> Mathematics and Computer Science Division > >>>> Argonne National Laboratory > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Sun Mar 4 18:06:40 2012 From: zhaozhang at uchicago.edu (ZHAO ZHANG) Date: Sun, 04 Mar 2012 18:06:40 -0600 Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <1330905648.31137.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <4F53BFBE.3020205@uchicago.edu> <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> <1330905648.31137.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <4F540390.8010505@uchicago.edu> Hi, Emalayan On 3/4/2012 6:00 PM, Emalayan Vairavanathan wrote: > Mike, that sounds like a good idea. > > Zhao, In addition to Mike's questions I have two more questions. > > 1) Is it possible to get/ calculate the MAX_RANK / number of nodes in > an allocation from persoanlity.h or some other data structure ? Yes, you could calculate the MAX_RANK from personality.sh. > > 2) Which interface should be configured to have Tours ? (Does this > matter at all ?) > In your scripts you are configuring eth1. But in > http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages tun1 is > configured. To use the torus network, there are two ways. One is to use the 12.x.y.z+1 interface, which we have to configure ourselves. The other way is to use the "ipfwd.sh", aka the 10.128 interface. The drawback of the second interface is it takes up one core for polling, and there is some scalability issue beyond 2K compute nodes as far as I remember. Mosa could use either of them. best zhao > > Thank you > Emalayan > > ------------------------------------------------------------------------ > *From:* Michael Wilde > *To:* ZHAO ZHANG ; Justin M Wozniak > > *Cc:* Emalayan Vairavanathan ; > swift-devel at ci.uchicago.edu > *Sent:* Sunday, 4 March 2012 2:33 PM > *Subject:* Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > Zhao, with this procedure do you get consecutive host IP addresses > starting from 0.0 through 640*64 in the two low order octets? > > In other words, does your just do what this page describes under "IP > over Torus": > > http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages > > Is the "ipfwd.sh" script mentioned there still needed, or does that > now happen automatically? > > If so, perhaps we can greatly simplify the Mosa startup: we need only > pass the max rank of the running job, and Mosa will know that it can > use 12.128.0.0 for example. Then we dont need any manual > intervention, nor complicated/brittle file-waiting logic. > > Zhao, I dont understand why your example is using the 12.0.0.0 network > vs the example on the page above which uses 10.128.0.0. Can you help > me understand what is going on here? Is the "IP Over Torus" info on > the ZeptoOS wiki outdated? Or does it describe a different technique? > > Justin, have you also mastered similar techniques for JETS? Do we > need help form the ZeptoOS team on this? > > Thanks, > > - Mike > > > > ----- Original Message ----- > > From: "ZHAO ZHANG" > > > To: "Michael Wilde" > > > Cc: "Emalayan Vairavanathan" >, swift-devel at ci.uchicago.edu > > > Sent: Sunday, March 4, 2012 1:17:18 PM > > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > Yes, each compute node needs to run this script to bring up the > > network > > interface. > > > > zhao > > > > On 3/4/2012 12:53 PM, Michael Wilde wrote: > > > Thanks, Zhao. Does this need to run on each node at startup? > > > > > > If so should this logic be integrated into the worker startup > > > script, Jon, Justin, Emalayan? > > > > > > Ive not looked at the current scripts much; I would think that all > > > the BG/P specific logic of enabling the torus network and finding > > > each node's IP address on the torus should be done in the init > > > script rather than in the worker. > > > > > > - Mike > > > > > > ----- Original Message ----- > > >> From: "ZHAO ZHANG" > > > >> To: "Michael Wilde"> > > >> Cc: "Emalayan Vairavanathan" >, > > >> swift-devel at ci.uchicago.edu > > >> Sent: Sunday, March 4, 2012 12:18:28 PM > > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >> Surveyor > > >> Hi, Mike > > >> > > >> With 192.168.1.*, we could only access the tree network. In order > > >> to > > >> use > > >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, > > >> z > > >> here is the coordinates of the compute nodes). > > >> The code below could bring the torus ip address up. > > >> > > >> IP="" > > >> set_torus_ip() > > >> { > > >> x=$1 > > >> y=$2 > > >> z=$3 > > >> z=`expr $3 + 1` > > >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp > > >> IP=12.$x.$y.$z > > >> } > > >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' > > >> -f > > >> 2` > > >> echo ${BG_PSETORG}>> /dev/shm/localip > > >> set_torus_ip $BG_PSETORG > > >> > > >> best > > >> zhao > > >> > > >> On 3/4/2012 10:24 AM, Michael Wilde wrote: > > >>> Zhao, > > >>> > > >>> Can you tell us if the nodes on the torus network are accessed > > >>> over > > >>> the 192.168 network? I just realized they cant all be on the > > >>> 192.168.1 subnet, so I hope I suggested the right network here. > > >>> > > >>> Thanks, > > >>> > > >>> - Mike > > >>> > > >>> ----- Original Message ----- > > >>>> From: "Emalayan Vairavanathan" > > > >>>> To: swift-devel at ci.uchicago.edu > > > >>>> Sent: Sunday, March 4, 2012 1:40:53 AM > > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >>>> Surveyor > > >>>> Thank you very much Mike. I agree with you suggestion. I can do > > >>>> that > > >>>> in worker.pl. > > >>>> > > >>>> > > >>>> Thank you > > >>>> Emalayan > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> From: Michael Wilde> > > >>>> To: emalayan at ece.ubc.ca > > >>>> Cc: swift-devel > > > >>>> Sent: Saturday, 3 March 2012 7:39 PM > > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >>>> Surveyor > > >>>> > > >>>> Emalayan, > > >>>> > > >>>> I wasnt paying much attention to the actual IP address returned > > >>>> by > > >>>> hostname in the zeptoos profile. > > >>>> > > >>>> Since these are the addresses that Mosa will communicate over, I > > >>>> think > > >>>> you *do* want them to be the 192.168.1.* addresses of the nodes > > >>>> on > > >>>> the > > >>>> torus network (in other words tun0). > > >>>> > > >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I > > >>>> think > > >>>> thats what you should use. So try replacing `hostname` in > > >>>> worker.pl > > >>>> with something like: > > >>>> > > >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` > > >>>> > > >>>> You may have to adapt this a bit to meet your needs. Im assuming > > >>>> that > > >>>> the only code that will uses these IPs is MosaStore. > > >>>> > > >>>> - Mike > > >>>> > > >>>> > > >>>> ----- Original Message ----- > > >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov > > > > >>>>> To: zeptoos at lists.mcs.anl.gov > > >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM > > >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor > > >>>>> Hi Emalayan, > > >>>>> > > >>>>> The zeptoos profile returns the IP address of associated I/O > > >>>>> node, > > >>>>> which is kind of wrong in my opinion (influence of IBM CNK). > > >>>>> ifconfig on compute nodes returns CN's IP address, which is > > >>>>> correct. > > >>>>> e.g. tun0 192.168.1.64 > > >>>>> > > >>>>> If you want to find associated ION's IP address from CNs, > > >>>>> do something like this. > > >>>>> > > >>>>> $ grep BG_IP= /proc/personality.sh > > >>>>> > > >>>>> - kaz > > >>>>> > > >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > > >>>>>> Hi All, > > >>>>>> > > >>>>>> I am trying to run some experiments in Surveyor. The software I > > >>>>>> am > > >>>>>> using > > >>>>>> gets the IP-address of compute-nodes using hostname command. > > >>>>>> > > >>>>>> With zepto-vn-eval/mosatest profile hostname command returns > > >>>>>> none. > > >>>>>> But with zeptoos profile hostname returns the correct IP > > >>>>>> address. > > >>>>>> > > >>>>>> Is this due to some configuration issues in > > >>>>>> zepto-vn-eval/mosatest > > >>>>>> profile?As a workaround I tired to use ifconfig with both > > >>>>>> profiles, > > >>>>>> but > > >>>>>> it seems ifconfig is not returning the correct IP address. > > >>>>>> > > >>>>>> Is there any command / files which I can used to retrieve the > > >>>>>> hostname > > >>>>>> on compute nodes? I have pasted the console output with both > > >>>>>> profiles > > >>>>>> below. Please let me know if you need more details. > > >>>>>> > > >>>>>> Thank you > > >>>>>> Emalayan > > >>>>>> > > >>>>>> > > >>>>>> =======================With zeptoos profile > > >>>>>> =============================== > > >>>>>> > > >>>>>> / # hostname > > >>>>>> 172.18.3.19 > > >>>>>> / # > > >>>>>> / # cat /proc/sys/kernel/hostname > > >>>>>> 172.18.3.19 > > >>>>>> / # > > >>>>>> / # > > >>>>>> / # ifconfig -a > > >>>>>> lo Link encap:Local Loopback > > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> tun0 Link encap:UNSPEC HWaddr > > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:500 > > >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> =======================With zepto-vn-eval/mosatest profile > > >>>>>> =============================== > > >>>>>> > > >>>>>> /etc # hostname > > >>>>>> (none) > > >>>>>> /etc # > > >>>>>> /etc # cat /proc/sys/kernel/hostname > > >>>>>> (none) > > >>>>>> /etc # > > >>>>>> /etc # ifconfig -a > > >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:1000 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:1000 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> lo Link encap:Local Loopback > > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > > >>>>>> inet6 addr: ::1/128 Scope:Host > > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> sit0 Link encap:IPv6-in-IPv4 > > >>>>>> NOARP MTU:1480 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> tun0 Link encap:UNSPEC HWaddr > > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 > > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:500 > > >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > > >>>>>> > > >>>> -- > > >>>> Michael Wilde > > >>>> Computation Institute, University of Chicago > > >>>> Mathematics and Computer Science Division > > >>>> Argonne National Laboratory > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>> > > >>>> > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Sun Mar 4 18:23:08 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Sun, 4 Mar 2012 16:23:08 -0800 (PST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <4F540390.8010505@uchicago.edu> References: <4F53BFBE.3020205@uchicago.edu> <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> <1330905648.31137.YahooMailNeo@web39508.mail.mud.yahoo.com> <4F540390.8010505@uchicago.edu> Message-ID: <1330906988.77015.YahooMailNeo@web39502.mail.mud.yahoo.com> Zhao, Thank you very much for the answers. One more question: :) Do you know how I can calculate this ? Does this method works regardless of the number of nodes allocated (even with any fraction of pset ) ? Thank you Emalayan ________________________________ From: ZHAO ZHANG To: Emalayan Vairavanathan Cc: Michael Wilde ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu" Sent: Sunday, 4 March 2012 4:06 PM Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor Hi, Emalayan On 3/4/2012 6:00 PM, Emalayan Vairavanathan wrote: Mike, that sounds like a good idea. > > >Zhao, In addition to Mike's questions I have two more questions. > > >1) Is it possible to get/ calculate the MAX_RANK / number of nodes in an allocation from persoanlity.h or some other data structure? ? Yes, you could calculate the MAX_RANK from personality.sh. > >2) Which interface should be configured to have Tours ? (Does this matter at all ?) >??? In your scripts you are configuring eth1. But in http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages tun1 is configured. To use the torus network, there are two ways. One is to use the 12.x.y.z+1 interface, which we have to configure ourselves. The other way is to use the "ipfwd.sh", aka the 10.128 interface. The drawback of the second interface is it takes up one core for polling, and there is some scalability issue beyond 2K compute nodes as far as I remember. Mosa could use either of them. best zhao > >Thank you >Emalayan > > > > >________________________________ > From: Michael Wilde >To: ZHAO ZHANG ; Justin M Wozniak >Cc: Emalayan Vairavanathan ; swift-devel at ci.uchicago.edu >Sent: Sunday, 4 March 2012 2:33 PM >Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > >Zhao, with this procedure do you get consecutive host IP addresses starting from 0.0 through 640*64 in the two low order octets? > >In other words, does your? just do what this page describes under "IP over Torus": > >? http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages > >Is the "ipfwd.sh" script mentioned there still needed, or does that now happen automatically? > >If so, perhaps we can greatly simplify the Mosa startup: we need only pass the max rank of the running job, and Mosa will know that it can use 12.128.0.0 for example.? Then we dont need any manual intervention, nor complicated/brittle file-waiting logic. > >Zhao, I dont understand why your example is using the 12.0.0.0 network vs the example on the page above which uses 10.128.0.0. Can you help me understand what is going on here? Is the "IP Over Torus" info on the ZeptoOS wiki outdated? Or does it describe a different technique? > >Justin, have you also mastered similar techniques for JETS?? Do we need help form the ZeptoOS team on this? > >Thanks, > >- Mike > > > >----- Original Message ----- >> From: "ZHAO ZHANG" >> To: "Michael Wilde" >> Cc: "Emalayan Vairavanathan" , swift-devel at ci.uchicago.edu >> Sent: Sunday, March 4, 2012 1:17:18 PM >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> Yes, each compute node needs to run this script to bring up the >> network >> interface. >> >> zhao >> >> On 3/4/2012 12:53 PM, Michael Wilde wrote: >> > Thanks, Zhao. Does this need to run on each node at startup? >> > >> > If so should this logic be integrated into the worker startup >> > script, Jon, Justin, Emalayan? >> > >> > Ive not looked at the current scripts much; I would think that all >> > the BG/P specific logic of enabling the torus network and finding >> > each node's IP address on the torus should be done in the init >> > script rather than in the worker. >> > >> > - Mike >> > >> > ----- Original Message ----- >> >> From: "ZHAO ZHANG" >> >> To: "Michael Wilde" >> >> Cc: "Emalayan Vairavanathan", >> >> swift-devel at ci.uchicago.edu >> >> Sent: Sunday, March 4, 2012 12:18:28 PM >> >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >> >> Surveyor >> >> Hi, Mike >> >> >> >> With 192.168.1.*, we could only access the tree network. In order >> >> to >> >> use >> >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, >> >> z >> >> here is the coordinates of the compute nodes). >> >> The code below could bring the torus ip address up. >> >> >> >> IP="" >> >> set_torus_ip() >> >> { >> >> x=$1 >> >> y=$2 >> >> z=$3 >> >> z=`expr $3 + 1` >> >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp >> >> IP=12.$x.$y.$z >> >> } >> >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' >> >> -f >> >> 2` >> >> echo ${BG_PSETORG}>> /dev/shm/localip >> >> set_torus_ip $BG_PSETORG >> >> >> >> best >> >> zhao >> >> >> >> On 3/4/2012 10:24 AM, Michael Wilde wrote: >> >>> Zhao, >> >>> >> >>> Can you tell us if the nodes on the torus network are accessed >> >>> over >> >>> the 192.168 network? I just realized they cant all be on the >> >>> 192.168.1 subnet, so I hope I suggested the right network here. >> >>> >> >>> Thanks, >> >>> >> >>> - Mike >> >>> >> >>> ----- Original Message ----- >> >>>> From: "Emalayan Vairavanathan" >> >>>> To: swift-devel at ci.uchicago.edu >> >>>> Sent: Sunday, March 4, 2012 1:40:53 AM >> >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >> >>>> Surveyor >> >>>> Thank you very much Mike. I agree with you suggestion. I can do >> >>>> that >> >>>> in worker.pl. >> >>>> >> >>>> >> >>>> Thank you >> >>>> Emalayan >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> From: Michael Wilde >> >>>> To: emalayan at ece.ubc.ca >> >>>> Cc: swift-devel >> >>>> Sent: Saturday, 3 March 2012 7:39 PM >> >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >> >>>> Surveyor >> >>>> >> >>>> Emalayan, >> >>>> >> >>>> I wasnt paying much attention to the actual IP address returned >> >>>> by >> >>>> hostname in the zeptoos profile. >> >>>> >> >>>> Since these are the addresses that Mosa will communicate over, I >> >>>> think >> >>>> you *do* want them to be the 192.168.1.* addresses of the nodes >> >>>> on >> >>>> the >> >>>> torus network (in other words tun0). >> >>>> >> >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I >> >>>> think >> >>>> thats what you should use. So try replacing `hostname` in >> >>>> worker.pl >> >>>> with something like: >> >>>> >> >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` >> >>>> >> >>>> You may have to adapt this a bit to meet your needs. Im assuming >> >>>> that >> >>>> the only code that will uses these IPs is MosaStore. >> >>>> >> >>>> - Mike >> >>>> >> >>>> >> >>>> ----- Original Message ----- >> >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> >> >>>>> To: zeptoos at lists.mcs.anl.gov >> >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM >> >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >> >>>>> Hi Emalayan, >> >>>>> >> >>>>> The zeptoos profile returns the IP address of associated I/O >> >>>>> node, >> >>>>> which is kind of wrong in my opinion (influence of IBM CNK). >> >>>>> ifconfig on compute nodes returns CN's IP address, which is >> >>>>> correct. >> >>>>> e.g. tun0 192.168.1.64 >> >>>>> >> >>>>> If you want to find associated ION's IP address from CNs, >> >>>>> do something like this. >> >>>>> >> >>>>> $ grep BG_IP= /proc/personality.sh >> >>>>> >> >>>>> - kaz >> >>>>> >> >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >> >>>>>> Hi All, >> >>>>>> >> >>>>>> I am trying to run some experiments in Surveyor. The software I >> >>>>>> am >> >>>>>> using >> >>>>>> gets the IP-address of compute-nodes using hostname command. >> >>>>>> >> >>>>>> With zepto-vn-eval/mosatest profile hostname command returns >> >>>>>> none. >> >>>>>> But with zeptoos profile hostname returns the correct IP >> >>>>>> address. >> >>>>>> >> >>>>>> Is this due to some configuration issues in >> >>>>>> zepto-vn-eval/mosatest >> >>>>>> profile?As a workaround I tired to use ifconfig with both >> >>>>>> profiles, >> >>>>>> but >> >>>>>> it seems ifconfig is not returning the correct IP address. >> >>>>>> >> >>>>>> Is there any command / files which I can used to retrieve the >> >>>>>> hostname >> >>>>>> on compute nodes? I have pasted the console output with both >> >>>>>> profiles >> >>>>>> below. Please let me know if you need more details. >> >>>>>> >> >>>>>> Thank you >> >>>>>> Emalayan >> >>>>>> >> >>>>>> >> >>>>>> =======================With zeptoos profile >> >>>>>> =============================== >> >>>>>> >> >>>>>> / # hostname >> >>>>>> 172.18.3.19 >> >>>>>> / # >> >>>>>> / # cat /proc/sys/kernel/hostname >> >>>>>> 172.18.3.19 >> >>>>>> / # >> >>>>>> / # >> >>>>>> / # ifconfig -a >> >>>>>> lo Link encap:Local Loopback >> >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >> >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> >>>>>> collisions:0 txqueuelen:0 >> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> >>>>>> >> >>>>>> tun0 Link encap:UNSPEC HWaddr >> >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >> >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >> >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >> >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >> >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >> >>>>>> collisions:0 txqueuelen:500 >> >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> =======================With zepto-vn-eval/mosatest profile >> >>>>>> =============================== >> >>>>>> >> >>>>>> /etc # hostname >> >>>>>> (none) >> >>>>>> /etc # >> >>>>>> /etc # cat /proc/sys/kernel/hostname >> >>>>>> (none) >> >>>>>> /etc # >> >>>>>> /etc # ifconfig -a >> >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >> >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> >>>>>> collisions:0 txqueuelen:1000 >> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> >>>>>> >> >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >> >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> >>>>>> collisions:0 txqueuelen:1000 >> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> >>>>>> >> >>>>>> lo Link encap:Local Loopback >> >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >> >>>>>> inet6 addr: ::1/128 Scope:Host >> >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> >>>>>> collisions:0 txqueuelen:0 >> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> >>>>>> >> >>>>>> sit0 Link encap:IPv6-in-IPv4 >> >>>>>> NOARP MTU:1480 Metric:1 >> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> >>>>>> collisions:0 txqueuelen:0 >> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> >>>>>> >> >>>>>> tun0 Link encap:UNSPEC HWaddr >> >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >> >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >> >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >> >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >> >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >> >>>>>> collisions:0 txqueuelen:500 >> >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >> >>>>>> >> >>>> -- >> >>>> Michael Wilde >> >>>> Computation Institute, University of Chicago >> >>>> Mathematics and Computer Science Division >> >>>> Argonne National Laboratory >> >>>> >> >>>> _______________________________________________ >> >>>> Swift-devel mailing list >> >>>> Swift-devel at ci.uchicago.edu >> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> Swift-devel mailing list >> >>>> Swift-devel at ci.uchicago.edu >> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >-- >Michael Wilde >Computation Institute, University of Chicago >Mathematics and Computer Science Division >Argonne National Laboratory > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Sun Mar 4 18:28:08 2012 From: zhaozhang at uchicago.edu (ZHAO ZHANG) Date: Sun, 04 Mar 2012 18:28:08 -0600 Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <1330906988.77015.YahooMailNeo@web39502.mail.mud.yahoo.com> References: <4F53BFBE.3020205@uchicago.edu> <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> <1330905648.31137.YahooMailNeo@web39508.mail.mud.yahoo.com> <4F540390.8010505@uchicago.edu> <1330906988.77015.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: <4F540898.5060402@uchicago.edu> Hi, I am attaching one personality.sh file of a 4096 CN allocation. From that you can see, there are couple of ways to calculate the allocation size: 1) BG_BLOCKID="ANL-R10-R13-4096" 2) Multiplication of BG_XSIZE, BG_YSIZE, BG_ZSIZE I am not quite sure how it will work on allocation that is smaller than 64, you could simply give it a check. best zhao BG_UCI=68801700 BG_LOCATION=R10-M0-N00-J23 BG_MAC=00:00:00:00:00:00 BG_IP=0.0.0.0 BG_NETMASK=255.255.255.112 BG_BROADCAST=0.0.0.0 BG_GATEWAY=0.0.0.0 BG_MTU=1536 BG_FS=0.0.0.0 BG_EXPORTDIR="" BG_SIMULATION=0 BG_PSETNUM=0 BG_NUMPSETS=64 BG_NODESINPSET=64 BG_XSIZE=8 BG_YSIZE=16 BG_ZSIZE=32 BG_VERBOSE=0 BG_PSETSIZE="4 4 4" BG_PSETORG="0 0 0" BG_CLOCKHZ=850 BG_GLINTS=1 BG_ISTORUS="" BG_BLOCKID="ANL-R10-R13-4096" BG_SN=0.0.0.0 BG_IS_IO_NODE=0 BG_RANK_IN_PSET=64 BG_RANK=0 BG_IP_OVER_COL=0 BG_IP_OVER_TOR=0 BG_IP_OVER_COL_VC=0 BG_CIO_MODE=FULL BG_BGSYS_FS_TYPE=NFSv3 BG_HTC_MODE=0 On 3/4/2012 6:23 PM, Emalayan Vairavanathan wrote: > Zhao, Thank you very much for the answers. > > One more question: :) > > Do you know how I can calculate this ? Does this method works > regardless of the number of nodes allocated (even with any fraction of > pset ) ? > > Thank you > Emalayan > > ------------------------------------------------------------------------ > *From:* ZHAO ZHANG > *To:* Emalayan Vairavanathan > *Cc:* Michael Wilde ; Justin M Wozniak > ; "swift-devel at ci.uchicago.edu" > > *Sent:* Sunday, 4 March 2012 4:06 PM > *Subject:* Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > Hi, Emalayan > > On 3/4/2012 6:00 PM, Emalayan Vairavanathan wrote: >> Mike, that sounds like a good idea. >> >> Zhao, In addition to Mike's questions I have two more questions. >> >> 1) Is it possible to get/ calculate the MAX_RANK / number of nodes in >> an allocation from persoanlity.h or some other data structure ? > Yes, you could calculate the MAX_RANK from personality.sh. >> >> 2) Which interface should be configured to have Tours ? (Does this >> matter at all ?) >> In your scripts you are configuring eth1. But in >> http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages tun1 is >> configured. > To use the torus network, there are two ways. One is to use the > 12.x.y.z+1 interface, which we have to configure ourselves. > The other way is to use the "ipfwd.sh", aka the 10.128 interface. The > drawback of the second interface is it takes up one core > for polling, and there is some scalability issue beyond 2K compute > nodes as far as I remember. Mosa could use either of them. > > best > zhao >> >> Thank you >> Emalayan >> >> ------------------------------------------------------------------------ >> *From:* Michael Wilde >> *To:* ZHAO ZHANG >> ; Justin M Wozniak >> >> *Cc:* Emalayan Vairavanathan >> ; swift-devel at ci.uchicago.edu >> >> *Sent:* Sunday, 4 March 2012 2:33 PM >> *Subject:* Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> >> Zhao, with this procedure do you get consecutive host IP addresses >> starting from 0.0 through 640*64 in the two low order octets? >> >> In other words, does your just do what this page describes under "IP >> over Torus": >> >> http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages >> >> Is the "ipfwd.sh" script mentioned there still needed, or does that >> now happen automatically? >> >> If so, perhaps we can greatly simplify the Mosa startup: we need only >> pass the max rank of the running job, and Mosa will know that it can >> use 12.128.0.0 for example. Then we dont need any manual >> intervention, nor complicated/brittle file-waiting logic. >> >> Zhao, I dont understand why your example is using the 12.0.0.0 >> network vs the example on the page above which uses 10.128.0.0. Can >> you help me understand what is going on here? Is the "IP Over Torus" >> info on the ZeptoOS wiki outdated? Or does it describe a different >> technique? >> >> Justin, have you also mastered similar techniques for JETS? Do we >> need help form the ZeptoOS team on this? >> >> Thanks, >> >> - Mike >> >> >> >> ----- Original Message ----- >> > From: "ZHAO ZHANG" > > >> > To: "Michael Wilde" > >> > Cc: "Emalayan Vairavanathan" > >, swift-devel at ci.uchicago.edu >> >> > Sent: Sunday, March 4, 2012 1:17:18 PM >> > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> > Yes, each compute node needs to run this script to bring up the >> > network >> > interface. >> > >> > zhao >> > >> > On 3/4/2012 12:53 PM, Michael Wilde wrote: >> > > Thanks, Zhao. Does this need to run on each node at startup? >> > > >> > > If so should this logic be integrated into the worker startup >> > > script, Jon, Justin, Emalayan? >> > > >> > > Ive not looked at the current scripts much; I would think that all >> > > the BG/P specific logic of enabling the torus network and finding >> > > each node's IP address on the torus should be done in the init >> > > script rather than in the worker. >> > > >> > > - Mike >> > > >> > > ----- Original Message ----- >> > >> From: "ZHAO ZHANG"> > >> > >> To: "Michael Wilde"> >> > >> Cc: "Emalayan Vairavanathan"> >, >> > >> swift-devel at ci.uchicago.edu >> > >> Sent: Sunday, March 4, 2012 12:18:28 PM >> > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >> > >> Surveyor >> > >> Hi, Mike >> > >> >> > >> With 192.168.1.*, we could only access the tree network. In order >> > >> to >> > >> use >> > >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, >> > >> z >> > >> here is the coordinates of the compute nodes). >> > >> The code below could bring the torus ip address up. >> > >> >> > >> IP="" >> > >> set_torus_ip() >> > >> { >> > >> x=$1 >> > >> y=$2 >> > >> z=$3 >> > >> z=`expr $3 + 1` >> > >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp >> > >> IP=12.$x.$y.$z >> > >> } >> > >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' >> > >> -f >> > >> 2` >> > >> echo ${BG_PSETORG}>> /dev/shm/localip >> > >> set_torus_ip $BG_PSETORG >> > >> >> > >> best >> > >> zhao >> > >> >> > >> On 3/4/2012 10:24 AM, Michael Wilde wrote: >> > >>> Zhao, >> > >>> >> > >>> Can you tell us if the nodes on the torus network are accessed >> > >>> over >> > >>> the 192.168 network? I just realized they cant all be on the >> > >>> 192.168.1 subnet, so I hope I suggested the right network here. >> > >>> >> > >>> Thanks, >> > >>> >> > >>> - Mike >> > >>> >> > >>> ----- Original Message ----- >> > >>>> From: "Emalayan Vairavanathan"> > >> > >>>> To: swift-devel at ci.uchicago.edu >> >> > >>>> Sent: Sunday, March 4, 2012 1:40:53 AM >> > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >> > >>>> Surveyor >> > >>>> Thank you very much Mike. I agree with you suggestion. I can do >> > >>>> that >> > >>>> in worker.pl. >> > >>>> >> > >>>> >> > >>>> Thank you >> > >>>> Emalayan >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> From: Michael Wilde> >> > >>>> To: emalayan at ece.ubc.ca >> > >>>> Cc: swift-devel> > >> > >>>> Sent: Saturday, 3 March 2012 7:39 PM >> > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >> > >>>> Surveyor >> > >>>> >> > >>>> Emalayan, >> > >>>> >> > >>>> I wasnt paying much attention to the actual IP address returned >> > >>>> by >> > >>>> hostname in the zeptoos profile. >> > >>>> >> > >>>> Since these are the addresses that Mosa will communicate over, I >> > >>>> think >> > >>>> you *do* want them to be the 192.168.1.* addresses of the nodes >> > >>>> on >> > >>>> the >> > >>>> torus network (in other words tun0). >> > >>>> >> > >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I >> > >>>> think >> > >>>> thats what you should use. So try replacing `hostname` in >> > >>>> worker.pl >> > >>>> with something like: >> > >>>> >> > >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` >> > >>>> >> > >>>> You may have to adapt this a bit to meet your needs. Im assuming >> > >>>> that >> > >>>> the only code that will uses these IPs is MosaStore. >> > >>>> >> > >>>> - Mike >> > >>>> >> > >>>> >> > >>>> ----- Original Message ----- >> > >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov >> > >> > >>>>> To: zeptoos at lists.mcs.anl.gov >> > >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM >> > >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >> > >>>>> Hi Emalayan, >> > >>>>> >> > >>>>> The zeptoos profile returns the IP address of associated I/O >> > >>>>> node, >> > >>>>> which is kind of wrong in my opinion (influence of IBM CNK). >> > >>>>> ifconfig on compute nodes returns CN's IP address, which is >> > >>>>> correct. >> > >>>>> e.g. tun0 192.168.1.64 >> > >>>>> >> > >>>>> If you want to find associated ION's IP address from CNs, >> > >>>>> do something like this. >> > >>>>> >> > >>>>> $ grep BG_IP= /proc/personality.sh >> > >>>>> >> > >>>>> - kaz >> > >>>>> >> > >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >> > >>>>>> Hi All, >> > >>>>>> >> > >>>>>> I am trying to run some experiments in Surveyor. The software I >> > >>>>>> am >> > >>>>>> using >> > >>>>>> gets the IP-address of compute-nodes using hostname command. >> > >>>>>> >> > >>>>>> With zepto-vn-eval/mosatest profile hostname command returns >> > >>>>>> none. >> > >>>>>> But with zeptoos profile hostname returns the correct IP >> > >>>>>> address. >> > >>>>>> >> > >>>>>> Is this due to some configuration issues in >> > >>>>>> zepto-vn-eval/mosatest >> > >>>>>> profile?As a workaround I tired to use ifconfig with both >> > >>>>>> profiles, >> > >>>>>> but >> > >>>>>> it seems ifconfig is not returning the correct IP address. >> > >>>>>> >> > >>>>>> Is there any command / files which I can used to retrieve the >> > >>>>>> hostname >> > >>>>>> on compute nodes? I have pasted the console output with both >> > >>>>>> profiles >> > >>>>>> below. Please let me know if you need more details. >> > >>>>>> >> > >>>>>> Thank you >> > >>>>>> Emalayan >> > >>>>>> >> > >>>>>> >> > >>>>>> =======================With zeptoos profile >> > >>>>>> =============================== >> > >>>>>> >> > >>>>>> / # hostname >> > >>>>>> 172.18.3.19 >> > >>>>>> / # >> > >>>>>> / # cat /proc/sys/kernel/hostname >> > >>>>>> 172.18.3.19 >> > >>>>>> / # >> > >>>>>> / # >> > >>>>>> / # ifconfig -a >> > >>>>>> lo Link encap:Local Loopback >> > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >> > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> > >>>>>> collisions:0 txqueuelen:0 >> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> > >>>>>> >> > >>>>>> tun0 Link encap:UNSPEC HWaddr >> > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >> > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >> > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >> > >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >> > >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >> > >>>>>> collisions:0 txqueuelen:500 >> > >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> =======================With zepto-vn-eval/mosatest profile >> > >>>>>> =============================== >> > >>>>>> >> > >>>>>> /etc # hostname >> > >>>>>> (none) >> > >>>>>> /etc # >> > >>>>>> /etc # cat /proc/sys/kernel/hostname >> > >>>>>> (none) >> > >>>>>> /etc # >> > >>>>>> /etc # ifconfig -a >> > >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >> > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> > >>>>>> collisions:0 txqueuelen:1000 >> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> > >>>>>> >> > >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >> > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> > >>>>>> collisions:0 txqueuelen:1000 >> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> > >>>>>> >> > >>>>>> lo Link encap:Local Loopback >> > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >> > >>>>>> inet6 addr: ::1/128 Scope:Host >> > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> > >>>>>> collisions:0 txqueuelen:0 >> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> > >>>>>> >> > >>>>>> sit0 Link encap:IPv6-in-IPv4 >> > >>>>>> NOARP MTU:1480 Metric:1 >> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> > >>>>>> collisions:0 txqueuelen:0 >> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> > >>>>>> >> > >>>>>> tun0 Link encap:UNSPEC HWaddr >> > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >> > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >> > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >> > >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >> > >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >> > >>>>>> collisions:0 txqueuelen:500 >> > >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >> > >>>>>> >> > >>>> -- >> > >>>> Michael Wilde >> > >>>> Computation Institute, University of Chicago >> > >>>> Mathematics and Computer Science Division >> > >>>> Argonne National Laboratory >> > >>>> >> > >>>> _______________________________________________ >> > >>>> Swift-devel mailing list >> > >>>> Swift-devel at ci.uchicago.edu >> > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > >>>> >> > >>>> >> > >>>> >> > >>>> _______________________________________________ >> > >>>> Swift-devel mailing list >> > >>>> Swift-devel at ci.uchicago.edu >> > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Sun Mar 4 18:54:25 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Sun, 4 Mar 2012 16:54:25 -0800 (PST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <4F540898.5060402@uchicago.edu> References: <4F53BFBE.3020205@uchicago.edu> <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> <1330905648.31137.YahooMailNeo@web39508.mail.mud.yahoo.com> <4F540390.8010505@uchicago.edu> <1330906988.77015.YahooMailNeo@web39502.mail.mud.yahoo.com> <4F540898.5060402@uchicago.edu> Message-ID: <1330908865.58128.YahooMailNeo@web39508.mail.mud.yahoo.com> Zhao, Thank you, I think this is not going to work with allocations with fraction of PSET. Please see below (I have pasted personality.sh with 2 node ) May be we need to get it from other sources. Or may be swift can pass this information to workers once the connection has been established. Regards Emalayan BG_UCI=68021700 BG_LOCATION=R00-M0-N08-J23 BG_MAC=12:01:13:80:00:00 BG_IP=172.18.1.19 BG_NETMASK=255.240.0.0 BG_BROADCAST=172.32.255.255 BG_GATEWAY=172.16.3.5 BG_MTU=9000 BG_FS=172.17.3.1 BG_EXPORTDIR="/bgsys" BG_SIMULATION=0 BG_PSETNUM=0 BG_NUMPSETS=1 BG_NODESINPSET=64 BG_XSIZE=4 BG_YSIZE=4 BG_ZSIZE=4 BG_VERBOSE=0 BG_PSETSIZE="4 4 4" BG_PSETORG="0 0 0" BG_CLOCKHZ=850 BG_GLINTS=1 BG_ISTORUS="XYZ" BG_BLOCKID="ANL-R00-M0-N08-64" BG_SN=172.17.3.1 BG_IS_IO_NODE=0 BG_RANK_IN_PSET=64 BG_RANK=0 BG_IP_OVER_COL=0 BG_IP_OVER_TOR=0 BG_IP_OVER_COL_VC=0 BG_CIO_MODE=FULL BG_BGSYS_FS_TYPE=NFSv3 BG_HTC_MODE=0 ________________________________ From: ZHAO ZHANG To: Emalayan Vairavanathan Cc: Michael Wilde ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu" Sent: Sunday, 4 March 2012 4:28 PM Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor Hi, I am attaching one personality.sh file of a 4096 CN allocation. From that you can see, there are couple of ways to calculate the allocation size: 1) BG_BLOCKID="ANL-R10-R13-4096" 2) Multiplication of BG_XSIZE, BG_YSIZE, BG_ZSIZE I am not quite sure how it will work on allocation that is smaller than 64, you could simply give it a check. best zhao BG_UCI=68801700 BG_LOCATION=R10-M0-N00-J23 BG_MAC=00:00:00:00:00:00 BG_IP=0.0.0.0 BG_NETMASK=255.255.255.112 BG_BROADCAST=0.0.0.0 BG_GATEWAY=0.0.0.0 BG_MTU=1536 BG_FS=0.0.0.0 BG_EXPORTDIR="" BG_SIMULATION=0 BG_PSETNUM=0 BG_NUMPSETS=64 BG_NODESINPSET=64 BG_XSIZE=8 BG_YSIZE=16 BG_ZSIZE=32 BG_VERBOSE=0 BG_PSETSIZE="4 4 4" BG_PSETORG="0 0 0" BG_CLOCKHZ=850 BG_GLINTS=1 BG_ISTORUS="" BG_BLOCKID="ANL-R10-R13-4096" BG_SN=0.0.0.0 BG_IS_IO_NODE=0 BG_RANK_IN_PSET=64 BG_RANK=0 BG_IP_OVER_COL=0 BG_IP_OVER_TOR=0 BG_IP_OVER_COL_VC=0 BG_CIO_MODE=FULL BG_BGSYS_FS_TYPE=NFSv3 BG_HTC_MODE=0 On 3/4/2012 6:23 PM, Emalayan Vairavanathan wrote: Zhao, Thank you very much for the answers. > > >One more question: :) > > >Do you know how I can calculate this ? Does this method works regardless of the number of nodes allocated (even with any fraction of pset ) ? > > >Thank you >Emalayan > > > >________________________________ > From: ZHAO ZHANG >To: Emalayan Vairavanathan >Cc: Michael Wilde ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu" >Sent: Sunday, 4 March 2012 4:06 PM >Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > >Hi, Emalayan > >On 3/4/2012 6:00 PM, Emalayan Vairavanathan wrote: >Mike, that sounds like a good idea. >> >> >>Zhao, In addition to Mike's questions I have two more questions. >> >> >>1) Is it possible to get/ calculate the MAX_RANK / number of nodes in an allocation from persoanlity.h or some other data structure? ? Yes, you could calculate the MAX_RANK from personality.sh. > > >> >>2) Which interface should be configured to have Tours ? (Does this matter at all ?) >>??? In your scripts you are configuring eth1. But in http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages tun1 is configured. To use the torus network, there are two ways. One is to use the 12.x.y.z+1 interface, which we have to configure ourselves. >The other way is to use the "ipfwd.sh", aka the 10.128 interface. The drawback of the second interface is it takes up one core >for polling, and there is some scalability issue beyond 2K compute nodes as far as I remember. Mosa could use either of them. > >best >zhao > > >> >>Thank you >>Emalayan >> >> >> >> >>________________________________ >> From: Michael Wilde >>To: ZHAO ZHANG ; Justin M Wozniak >>Cc: Emalayan Vairavanathan ; swift-devel at ci.uchicago.edu >>Sent: Sunday, 4 March 2012 2:33 PM >>Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> >>Zhao, with this procedure do you get consecutive host IP addresses starting from 0.0 through 640*64 in the two low order octets? >> >>In other words, does your? just do what this page describes under "IP over Torus": >> >>? http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages >> >>Is the "ipfwd.sh" script mentioned there still needed, or does that now happen automatically? >> >>If so, perhaps we can greatly simplify the Mosa startup: we need only pass the max rank of the running job, and Mosa will know that it can use 12.128.0.0 for example.? Then we dont need any manual intervention, nor complicated/brittle file-waiting logic. >> >>Zhao, I dont understand why your example is using the 12.0.0.0 network vs the example on the page above which uses 10.128.0.0. Can you help me understand what is going on here? Is the "IP Over Torus" info on the ZeptoOS wiki outdated? Or does it describe a different technique? >> >>Justin, have you also mastered similar techniques for JETS?? Do we need help form the ZeptoOS team on this? >> >>Thanks, >> >>- Mike >> >> >> >>----- Original Message ----- >>> From: "ZHAO ZHANG" >>> To: "Michael Wilde" >>> Cc: "Emalayan Vairavanathan" , swift-devel at ci.uchicago.edu >>> Sent: Sunday, March 4, 2012 1:17:18 PM >>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >>> Yes, each compute node needs to run this script to bring up the >>> network >>> interface. >>> >>> zhao >>> >>> On 3/4/2012 12:53 PM, Michael Wilde wrote: >>> > Thanks, Zhao. Does this need to run on each node at startup? >>> > >>> > If so should this logic be integrated into the worker startup >>> > script, Jon, Justin, Emalayan? >>> > >>> > Ive not looked at the current scripts much; I would think that all >>> > the BG/P specific logic of enabling the torus network and finding >>> > each node's IP address on the torus should be done in the init >>> > script rather than in the worker. >>> > >>> > - Mike >>> > >>> > ----- Original Message ----- >>> >> From: "ZHAO ZHANG" >>> >> To: "Michael Wilde" >>> >> Cc: "Emalayan Vairavanathan", >>> >> swift-devel at ci.uchicago.edu >>> >> Sent: Sunday, March 4, 2012 12:18:28 PM >>> >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>> >> Surveyor >>> >> Hi, Mike >>> >> >>> >> With 192.168.1.*, we could only access the tree network. In order >>> >> to >>> >> use >>> >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, >>> >> z >>> >> here is the coordinates of the compute nodes). >>> >> The code below could bring the torus ip address up. >>> >> >>> >> IP="" >>> >> set_torus_ip() >>> >> { >>> >> x=$1 >>> >> y=$2 >>> >> z=$3 >>> >> z=`expr $3 + 1` >>> >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp >>> >> IP=12.$x.$y.$z >>> >> } >>> >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' >>> >> -f >>> >> 2` >>> >> echo ${BG_PSETORG}>> /dev/shm/localip >>> >> set_torus_ip $BG_PSETORG >>> >> >>> >> best >>> >> zhao >>> >> >>> >> On 3/4/2012 10:24 AM, Michael Wilde wrote: >>> >>> Zhao, >>> >>> >>> >>> Can you tell us if the nodes on the torus network are accessed >>> >>> over >>> >>> the 192.168 network? I just realized they cant all be on the >>> >>> 192.168.1 subnet, so I hope I suggested the right network here. >>> >>> >>> >>> Thanks, >>> >>> >>> >>> - Mike >>> >>> >>> >>> ----- Original Message ----- >>> >>>> From: "Emalayan Vairavanathan" >>> >>>> To: swift-devel at ci.uchicago.edu >>> >>>> Sent: Sunday, March 4, 2012 1:40:53 AM >>> >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>> >>>> Surveyor >>> >>>> Thank you very much Mike. I agree with you suggestion. I can do >>> >>>> that >>> >>>> in worker.pl. >>> >>>> >>> >>>> >>> >>>> Thank you >>> >>>> Emalayan >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> From: Michael Wilde >>> >>>> To: emalayan at ece.ubc.ca >>> >>>> Cc: swift-devel >>> >>>> Sent: Saturday, 3 March 2012 7:39 PM >>> >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>> >>>> Surveyor >>> >>>> >>> >>>> Emalayan, >>> >>>> >>> >>>> I wasnt paying much attention to the actual IP address returned >>> >>>> by >>> >>>> hostname in the zeptoos profile. >>> >>>> >>> >>>> Since these are the addresses that Mosa will communicate over, I >>> >>>> think >>> >>>> you *do* want them to be the 192.168.1.* addresses of the nodes >>> >>>> on >>> >>>> the >>> >>>> torus network (in other words tun0). >>> >>>> >>> >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I >>> >>>> think >>> >>>> thats what you should use. So try replacing `hostname` in >>> >>>> worker.pl >>> >>>> with something like: >>> >>>> >>> >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` >>> >>>> >>> >>>> You may have to adapt this a bit to meet your needs. Im assuming >>> >>>> that >>> >>>> the only code that will uses these IPs is MosaStore. >>> >>>> >>> >>>> - Mike >>> >>>> >>> >>>> >>> >>>> ----- Original Message ----- >>> >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov> >>> >>>>> To: zeptoos at lists.mcs.anl.gov >>> >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM >>> >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >>> >>>>> Hi Emalayan, >>> >>>>> >>> >>>>> The zeptoos profile returns the IP address of associated I/O >>> >>>>> node, >>> >>>>> which is kind of wrong in my opinion (influence of IBM CNK). >>> >>>>> ifconfig on compute nodes returns CN's IP address, which is >>> >>>>> correct. >>> >>>>> e.g. tun0 192.168.1.64 >>> >>>>> >>> >>>>> If you want to find associated ION's IP address from CNs, >>> >>>>> do something like this. >>> >>>>> >>> >>>>> $ grep BG_IP= /proc/personality.sh >>> >>>>> >>> >>>>> - kaz >>> >>>>> >>> >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >>> >>>>>> Hi All, >>> >>>>>> >>> >>>>>> I am trying to run some experiments in Surveyor. The software I >>> >>>>>> am >>> >>>>>> using >>> >>>>>> gets the IP-address of compute-nodes using hostname command. >>> >>>>>> >>> >>>>>> With zepto-vn-eval/mosatest profile hostname command returns >>> >>>>>> none. >>> >>>>>> But with zeptoos profile hostname returns the correct IP >>> >>>>>> address. >>> >>>>>> >>> >>>>>> Is this due to some configuration issues in >>> >>>>>> zepto-vn-eval/mosatest >>> >>>>>> profile?As a workaround I tired to use ifconfig with both >>> >>>>>> profiles, >>> >>>>>> but >>> >>>>>> it seems ifconfig is not returning the correct IP address. >>> >>>>>> >>> >>>>>> Is there any command / files which I can used to retrieve the >>> >>>>>> hostname >>> >>>>>> on compute nodes? I have pasted the console output with both >>> >>>>>> profiles >>> >>>>>> below. Please let me know if you need more details. >>> >>>>>> >>> >>>>>> Thank you >>> >>>>>> Emalayan >>> >>>>>> >>> >>>>>> >>> >>>>>> =======================With zeptoos profile >>> >>>>>> =============================== >>> >>>>>> >>> >>>>>> / # hostname >>> >>>>>> 172.18.3.19 >>> >>>>>> / # >>> >>>>>> / # cat /proc/sys/kernel/hostname >>> >>>>>> 172.18.3.19 >>> >>>>>> / # >>> >>>>>> / # >>> >>>>>> / # ifconfig -a >>> >>>>>> lo Link encap:Local Loopback >>> >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> >>>>>> collisions:0 txqueuelen:0 >>> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> >>>>>> >>> >>>>>> tun0 Link encap:UNSPEC HWaddr >>> >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>> >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>> >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>> >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >>> >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >>> >>>>>> collisions:0 txqueuelen:500 >>> >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> =======================With zepto-vn-eval/mosatest profile >>> >>>>>> =============================== >>> >>>>>> >>> >>>>>> /etc # hostname >>> >>>>>> (none) >>> >>>>>> /etc # >>> >>>>>> /etc # cat /proc/sys/kernel/hostname >>> >>>>>> (none) >>> >>>>>> /etc # >>> >>>>>> /etc # ifconfig -a >>> >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >>> >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> >>>>>> collisions:0 txqueuelen:1000 >>> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> >>>>>> >>> >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >>> >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> >>>>>> collisions:0 txqueuelen:1000 >>> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> >>>>>> >>> >>>>>> lo Link encap:Local Loopback >>> >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> >>>>>> inet6 addr: ::1/128 Scope:Host >>> >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> >>>>>> collisions:0 txqueuelen:0 >>> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> >>>>>> >>> >>>>>> sit0 Link encap:IPv6-in-IPv4 >>> >>>>>> NOARP MTU:1480 Metric:1 >>> >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> >>>>>> collisions:0 txqueuelen:0 >>> >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> >>>>>> >>> >>>>>> tun0 Link encap:UNSPEC HWaddr >>> >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>> >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>> >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>> >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >>> >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >>> >>>>>> collisions:0 txqueuelen:500 >>> >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >>> >>>>>> >>> >>>> -- >>> >>>> Michael Wilde >>> >>>> Computation Institute, University of Chicago >>> >>>> Mathematics and Computer Science Division >>> >>>> Argonne National Laboratory >>> >>>> >>> >>>> _______________________________________________ >>> >>>> Swift-devel mailing list >>> >>>> Swift-devel at ci.uchicago.edu >>> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>>> >>> >>>> >>> >>>> >>> >>>> _______________________________________________ >>> >>>> Swift-devel mailing list >>> >>>> Swift-devel at ci.uchicago.edu >>> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >>-- >>Michael Wilde >>Computation Institute, University of Chicago >>Mathematics and Computer Science Division >>Argonne National Laboratory >> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Mar 4 19:04:08 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 4 Mar 2012 19:04:08 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <4F540898.5060402@uchicago.edu> Message-ID: <797845159.59843.1330909448048.JavaMail.root@zimbra.anl.gov> Thanks, Zhao. Im guessing that BG_PSETORG is a triplet in the space XSIZE, YSIZE, ZSIZE? If thats true, then unlike the 10.128 driver, the low three octets do not form a consecutive integer when using Zhao's script, and you need to form the set of IP addresses from: 12.{0..ZSIZE-1}.{0..YSIZE-1}.{1..ZSIZE} And then you could designate one (eg 12.0.0.1) to be the master. I *suspect* that the selection of these IPs is arbitrary, so you *may* be able to use rank as the low order octets. I think a simple first test is to dump out all the BG_PSETORG values for a few sample job sizes submitted by cqsub. Also do tests to verify that you can ifconfig each interface and ping the others. Also in answer to your prior question, I also *suspect* that you can name the interface anything (such as tor0) except for the interface name thats already assigned. - Mike ----- Original Message ----- > From: "ZHAO ZHANG" > To: "Emalayan Vairavanathan" > Cc: "Michael Wilde" , "Justin M Wozniak" , swift-devel at ci.uchicago.edu > Sent: Sunday, March 4, 2012 6:28:08 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > Hi, > > I am attaching one personality.sh file of a 4096 CN allocation. From > that you can see, there are couple of ways to calculate the > allocation size: > 1) BG_BLOCKID="ANL-R10-R13-4096" > 2) Multiplication of BG_XSIZE, BG_YSIZE, BG_ZSIZE > > I am not quite sure how it will work on allocation that is smaller > than 64, you could simply give it a check. > > best > zhao > > BG_UCI=68801700 > BG_LOCATION=R10-M0-N00-J23 > BG_MAC=00:00:00:00:00:00 > BG_IP=0.0.0.0 > BG_NETMASK=255.255.255.112 > BG_BROADCAST=0.0.0.0 > BG_GATEWAY=0.0.0.0 > BG_MTU=1536 > BG_FS=0.0.0.0 > BG_EXPORTDIR="" > BG_SIMULATION=0 > BG_PSETNUM=0 > BG_NUMPSETS=64 > BG_NODESINPSET=64 > BG_XSIZE=8 > BG_YSIZE=16 > BG_ZSIZE=32 > BG_VERBOSE=0 > BG_PSETSIZE="4 4 4" > BG_PSETORG="0 0 0" > BG_CLOCKHZ=850 > BG_GLINTS=1 > BG_ISTORUS="" > BG_BLOCKID="ANL-R10-R13-4096" > BG_SN=0.0.0.0 > BG_IS_IO_NODE=0 > BG_RANK_IN_PSET=64 > BG_RANK=0 > BG_IP_OVER_COL=0 > BG_IP_OVER_TOR=0 > BG_IP_OVER_COL_VC=0 > BG_CIO_MODE=FULL > BG_BGSYS_FS_TYPE=NFSv3 > BG_HTC_MODE=0 > > > On 3/4/2012 6:23 PM, Emalayan Vairavanathan wrote: > > > > Zhao, Thank you very much for the answers. > > > One more question: :) > > > Do you know how I can calculate this ? Does this method works > regardless of the number of nodes allocated (even with any fraction of > pset ) ? > > > Thank you > Emalayan > > > > > > From: ZHAO ZHANG > To: Emalayan Vairavanathan > Cc: Michael Wilde ; Justin M Wozniak > ; "swift-devel at ci.uchicago.edu" > > Sent: Sunday, 4 March 2012 4:06 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > > > Hi, Emalayan > > On 3/4/2012 6:00 PM, Emalayan Vairavanathan wrote: > > > > Mike, that sounds like a good idea. > > > Zhao , In addition to Mike's questions I have two more questions. > > > 1) Is it possible to get/ calculate the MAX_RANK / number of nodes in > an allocation from persoanlity.h or some other data structure ? Yes, > you could calculate the MAX_RANK from personality.sh. > > > > > > > 2) Which interface should be configured to have Tours ? (Does this > matter at all ?) > In your scripts you are configuring eth1. But in > http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages tun1 is > configured. To use the torus network, there are two ways. One is to > use the 12.x.y.z+1 interface, which we have to configure ourselves. > The other way is to use the "ipfwd.sh", aka the 10.128 interface. The > drawback of the second interface is it takes up one core > for polling, and there is some scalability issue beyond 2K compute > nodes as far as I remember. Mosa could use either of them. > > best > zhao > > > > > > > Thank you > Emalayan > > > > > > > From: Michael Wilde > To: ZHAO ZHANG ; Justin M Wozniak > > Cc: Emalayan Vairavanathan ; > swift-devel at ci.uchicago.edu > Sent: Sunday, 4 March 2012 2:33 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > Zhao, with this procedure do you get consecutive host IP addresses > starting from 0.0 through 640*64 in the two low order octets? > > In other words, does your just do what this page describes under "IP > over Torus": > > http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages > > Is the "ipfwd.sh" script mentioned there still needed, or does that > now happen automatically? > > If so, perhaps we can greatly simplify the Mosa startup: we need only > pass the max rank of the running job, and Mosa will know that it can > use 12.128.0.0 for example. Then we dont need any manual intervention, > nor complicated/brittle file-waiting logic. > > Zhao, I dont understand why your example is using the 12.0.0.0 network > vs the example on the page above which uses 10.128.0.0. Can you help > me understand what is going on here? Is the "IP Over Torus" info on > the ZeptoOS wiki outdated? Or does it describe a different technique? > > Justin, have you also mastered similar techniques for JETS? Do we need > help form the ZeptoOS team on this? > > Thanks, > > - Mike > > > > ----- Original Message ----- > > From: "ZHAO ZHANG" < zhaozhang at uchicago.edu > > > To: "Michael Wilde" < wilde at mcs.anl.gov > > > Cc: "Emalayan Vairavanathan" < svemalayan at yahoo.com >, > > swift-devel at ci.uchicago.edu > > Sent: Sunday, March 4, 2012 1:17:18 PM > > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > Surveyor > > Yes, each compute node needs to run this script to bring up the > > network > > interface. > > > > zhao > > > > On 3/4/2012 12:53 PM, Michael Wilde wrote: > > > Thanks, Zhao. Does this need to run on each node at startup? > > > > > > If so should this logic be integrated into the worker startup > > > script, Jon, Justin, Emalayan? > > > > > > Ive not looked at the current scripts much; I would think that all > > > the BG/P specific logic of enabling the torus network and finding > > > each node's IP address on the torus should be done in the init > > > script rather than in the worker. > > > > > > - Mike > > > > > > ----- Original Message ----- > > >> From: "ZHAO ZHANG"< zhaozhang at uchicago.edu > > > >> To: "Michael Wilde"< wilde at mcs.anl.gov > > > >> Cc: "Emalayan Vairavanathan"< svemalayan at yahoo.com >, > > >> swift-devel at ci.uchicago.edu > > >> Sent: Sunday, March 4, 2012 12:18:28 PM > > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >> Surveyor > > >> Hi, Mike > > >> > > >> With 192.168.1.*, we could only access the tree network. In order > > >> to > > >> use > > >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, > > >> y, > > >> z > > >> here is the coordinates of the compute nodes). > > >> The code below could bring the torus ip address up. > > >> > > >> IP="" > > >> set_torus_ip() > > >> { > > >> x=$1 > > >> y=$2 > > >> z=$3 > > >> z=`expr $3 + 1` > > >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp > > >> IP=12.$x.$y.$z > > >> } > > >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d > > >> '"' > > >> -f > > >> 2` > > >> echo ${BG_PSETORG}>> /dev/shm/localip > > >> set_torus_ip $BG_PSETORG > > >> > > >> best > > >> zhao > > >> > > >> On 3/4/2012 10:24 AM, Michael Wilde wrote: > > >>> Zhao, > > >>> > > >>> Can you tell us if the nodes on the torus network are accessed > > >>> over > > >>> the 192.168 network? I just realized they cant all be on the > > >>> 192.168.1 subnet, so I hope I suggested the right network here. > > >>> > > >>> Thanks, > > >>> > > >>> - Mike > > >>> > > >>> ----- Original Message ----- > > >>>> From: "Emalayan Vairavanathan"< svemalayan at yahoo.com > > > >>>> To: swift-devel at ci.uchicago.edu > > >>>> Sent: Sunday, March 4, 2012 1:40:53 AM > > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >>>> Surveyor > > >>>> Thank you very much Mike. I agree with you suggestion. I can do > > >>>> that > > >>>> in worker.pl. > > >>>> > > >>>> > > >>>> Thank you > > >>>> Emalayan > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> From: Michael Wilde< wilde at mcs.anl.gov > > > >>>> To: emalayan at ece.ubc.ca > > >>>> Cc: swift-devel< swift-devel at ci.uchicago.edu > > > >>>> Sent: Saturday, 3 March 2012 7:39 PM > > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >>>> Surveyor > > >>>> > > >>>> Emalayan, > > >>>> > > >>>> I wasnt paying much attention to the actual IP address returned > > >>>> by > > >>>> hostname in the zeptoos profile. > > >>>> > > >>>> Since these are the addresses that Mosa will communicate over, > > >>>> I > > >>>> think > > >>>> you *do* want them to be the 192.168.1.* addresses of the nodes > > >>>> on > > >>>> the > > >>>> torus network (in other words tun0). > > >>>> > > >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I > > >>>> think > > >>>> thats what you should use. So try replacing `hostname` in > > >>>> worker.pl > > >>>> with something like: > > >>>> > > >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ > > >>>> .*//'` > > >>>> > > >>>> You may have to adapt this a bit to meet your needs. Im > > >>>> assuming > > >>>> that > > >>>> the only code that will uses these IPs is MosaStore. > > >>>> > > >>>> - Mike > > >>>> > > >>>> > > >>>> ----- Original Message ----- > > >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov > > > >>>>> To: zeptoos at lists.mcs.anl.gov > > >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM > > >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor > > >>>>> Hi Emalayan, > > >>>>> > > >>>>> The zeptoos profile returns the IP address of associated I/O > > >>>>> node, > > >>>>> which is kind of wrong in my opinion (influence of IBM CNK). > > >>>>> ifconfig on compute nodes returns CN's IP address, which is > > >>>>> correct. > > >>>>> e.g. tun0 192.168.1.64 > > >>>>> > > >>>>> If you want to find associated ION's IP address from CNs, > > >>>>> do something like this. > > >>>>> > > >>>>> $ grep BG_IP= /proc/personality.sh > > >>>>> > > >>>>> - kaz > > >>>>> > > >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > > >>>>>> Hi All, > > >>>>>> > > >>>>>> I am trying to run some experiments in Surveyor. The software > > >>>>>> I > > >>>>>> am > > >>>>>> using > > >>>>>> gets the IP-address of compute-nodes using hostname command. > > >>>>>> > > >>>>>> With zepto-vn-eval/mosatest profile hostname command returns > > >>>>>> none. > > >>>>>> But with zeptoos profile hostname returns the correct IP > > >>>>>> address. > > >>>>>> > > >>>>>> Is this due to some configuration issues in > > >>>>>> zepto-vn-eval/mosatest > > >>>>>> profile?As a workaround I tired to use ifconfig with both > > >>>>>> profiles, > > >>>>>> but > > >>>>>> it seems ifconfig is not returning the correct IP address. > > >>>>>> > > >>>>>> Is there any command / files which I can used to retrieve the > > >>>>>> hostname > > >>>>>> on compute nodes? I have pasted the console output with both > > >>>>>> profiles > > >>>>>> below. Please let me know if you need more details. > > >>>>>> > > >>>>>> Thank you > > >>>>>> Emalayan > > >>>>>> > > >>>>>> > > >>>>>> =======================With zeptoos profile > > >>>>>> =============================== > > >>>>>> > > >>>>>> / # hostname > > >>>>>> 172.18.3.19 > > >>>>>> / # > > >>>>>> / # cat /proc/sys/kernel/hostname > > >>>>>> 172.18.3.19 > > >>>>>> / # > > >>>>>> / # > > >>>>>> / # ifconfig -a > > >>>>>> lo Link encap:Local Loopback > > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> tun0 Link encap:UNSPEC HWaddr > > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 > > >>>>>> Mask:255.255.255.255 > > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:500 > > >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> =======================With zepto-vn-eval/mosatest profile > > >>>>>> =============================== > > >>>>>> > > >>>>>> /etc # hostname > > >>>>>> (none) > > >>>>>> /etc # > > >>>>>> /etc # cat /proc/sys/kernel/hostname > > >>>>>> (none) > > >>>>>> /etc # > > >>>>>> /etc # ifconfig -a > > >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:1000 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:1000 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> lo Link encap:Local Loopback > > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > > >>>>>> inet6 addr: ::1/128 Scope:Host > > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> sit0 Link encap:IPv6-in-IPv4 > > >>>>>> NOARP MTU:1480 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> tun0 Link encap:UNSPEC HWaddr > > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 > > >>>>>> Mask:255.255.255.255 > > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:500 > > >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > > >>>>>> > > >>>> -- > > >>>> Michael Wilde > > >>>> Computation Institute, University of Chicago > > >>>> Mathematics and Computer Science Division > > >>>> Argonne National Laboratory > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>> > > >>>> > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sun Mar 4 19:08:13 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 4 Mar 2012 19:08:13 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <1330908865.58128.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <380186591.59849.1330909693755.JavaMail.root@zimbra.anl.gov> Hi Emalayan, I dont see anything wrong with what you pasted here. But I'd like to see BG_PSETORG for each node. I would not trust the answers for any job with a node count that is not a multiple of 64 nodes (ie, some number of complete psets). I suspect the system is always going to round up your allocation to some number of psets. It just might limit the number of nodes within the pset on which jobs are started. Feel free to txt me in skype me to discuss. - Mike ----- Original Message ----- > From: "Emalayan Vairavanathan" > To: "ZHAO ZHANG" > Cc: "Michael Wilde" , "Justin M Wozniak" , swift-devel at ci.uchicago.edu > Sent: Sunday, March 4, 2012 6:54:25 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > Zhao, Thank you, I think this is not going to work with allocations > with fraction of PSET. Please see below (I have pasted personality.sh > with 2 node ) > > > May be we need to get it from other sources. Or may be swift can pass > this information to workers once the connection has been established. > > > Regards > Emalayan > > > > BG_UCI=68021700 > BG_LOCATION=R00-M0-N08-J23 > BG_MAC=12:01:13:80:00:00 > BG_IP=172.18.1.19 > BG_NETMASK=255.240.0.0 > BG_BROADCAST=172.32.255.255 > BG_GATEWAY=172.16.3.5 > BG_MTU=9000 > BG_FS=172.17.3.1 > BG_EXPORTDIR="/bgsys" > BG_SIMULATION=0 > BG_PSETNUM=0 > BG_NUMPSETS=1 > BG_NODESINPSET=64 > BG_XSIZE=4 > BG_YSIZE=4 > BG_ZSIZE=4 > BG_VERBOSE=0 > BG_PSETSIZE="4 4 4" > BG_PSETORG="0 0 0" > BG_CLOCKHZ=850 > BG_GLINTS=1 > BG_ISTORUS="XYZ" > BG_BLOCKID="ANL-R00-M0-N08-64" > BG_SN=172.17.3.1 > BG_IS_IO_NODE=0 > BG_RANK_IN_PSET=64 > BG_RANK=0 > BG_IP_OVER_COL=0 > BG_IP_OVER_TOR=0 > BG_IP_OVER_COL_VC=0 > BG_CIO_MODE=FULL > BG_BGSYS_FS_TYPE=NFSv3 > BG_HTC_MODE=0 > > > > > > > From: ZHAO ZHANG > To: Emalayan Vairavanathan > Cc: Michael Wilde ; Justin M Wozniak > ; "swift-devel at ci.uchicago.edu" > > Sent: Sunday, 4 March 2012 4:28 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > > > Hi, > > I am attaching one personality.sh file of a 4096 CN allocation. From > that you can see, there are couple of ways to calculate the > allocation size: > 1) BG_BLOCKID="ANL-R10-R13-4096" > 2) Multiplication of BG_XSIZE, BG_YSIZE, BG_ZSIZE > > I am not quite sure how it will work on allocation that is smaller > than 64, you could simply give it a check. > > best > zhao > > BG_UCI=68801700 > BG_LOCATION=R10-M0-N00-J23 > BG_MAC=00:00:00:00:00:00 > BG_IP=0.0.0.0 > BG_NETMASK=255.255.255.112 > BG_BROADCAST=0.0.0.0 > BG_GATEWAY=0.0.0.0 > BG_MTU=1536 > BG_FS=0.0.0.0 > BG_EXPORTDIR="" > BG_SIMULATION=0 > BG_PSETNUM=0 > BG_NUMPSETS=64 > BG_NODESINPSET=64 > BG_XSIZE=8 > BG_YSIZE=16 > BG_ZSIZE=32 > BG_VERBOSE=0 > BG_PSETSIZE="4 4 4" > BG_PSETORG="0 0 0" > BG_CLOCKHZ=850 > BG_GLINTS=1 > BG_ISTORUS="" > BG_BLOCKID="ANL-R10-R13-4096" > BG_SN=0.0.0.0 > BG_IS_IO_NODE=0 > BG_RANK_IN_PSET=64 > BG_RANK=0 > BG_IP_OVER_COL=0 > BG_IP_OVER_TOR=0 > BG_IP_OVER_COL_VC=0 > BG_CIO_MODE=FULL > BG_BGSYS_FS_TYPE=NFSv3 > BG_HTC_MODE=0 > > > On 3/4/2012 6:23 PM, Emalayan Vairavanathan wrote: > > > > Zhao, Thank you very much for the answers. > > > One more question: :) > > > Do you know how I can calculate this ? Does this method works > regardless of the number of nodes allocated (even with any fraction of > pset ) ? > > > Thank you > Emalayan > > > > > > From: ZHAO ZHANG > To: Emalayan Vairavanathan > Cc: Michael Wilde ; Justin M Wozniak > ; "swift-devel at ci.uchicago.edu" > > Sent: Sunday, 4 March 2012 4:06 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > > > Hi, Emalayan > > On 3/4/2012 6:00 PM, Emalayan Vairavanathan wrote: > > > > Mike, that sounds like a good idea. > > > Zhao , In addition to Mike's questions I have two more questions. > > > 1) Is it possible to get/ calculate the MAX_RANK / number of nodes in > an allocation from persoanlity.h or some other data structure ? Yes, > you could calculate the MAX_RANK from personality.sh. > > > > > > > 2) Which interface should be configured to have Tours ? (Does this > matter at all ?) > In your scripts you are configuring eth1. But in > http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages tun1 is > configured. To use the torus network, there are two ways. One is to > use the 12.x.y.z+1 interface, which we have to configure ourselves. > The other way is to use the "ipfwd.sh", aka the 10.128 interface. The > drawback of the second interface is it takes up one core > for polling, and there is some scalability issue beyond 2K compute > nodes as far as I remember. Mosa could use either of them. > > best > zhao > > > > > > > Thank you > Emalayan > > > > > > > From: Michael Wilde > To: ZHAO ZHANG ; Justin M Wozniak > > Cc: Emalayan Vairavanathan ; > swift-devel at ci.uchicago.edu > Sent: Sunday, 4 March 2012 2:33 PM > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > Zhao, with this procedure do you get consecutive host IP addresses > starting from 0.0 through 640*64 in the two low order octets? > > In other words, does your just do what this page describes under "IP > over Torus": > > http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages > > Is the "ipfwd.sh" script mentioned there still needed, or does that > now happen automatically? > > If so, perhaps we can greatly simplify the Mosa startup: we need only > pass the max rank of the running job, and Mosa will know that it can > use 12.128.0.0 for example. Then we dont need any manual intervention, > nor complicated/brittle file-waiting logic. > > Zhao, I dont understand why your example is using the 12.0.0.0 network > vs the example on the page above which uses 10.128.0.0. Can you help > me understand what is going on here? Is the "IP Over Torus" info on > the ZeptoOS wiki outdated? Or does it describe a different technique? > > Justin, have you also mastered similar techniques for JETS? Do we need > help form the ZeptoOS team on this? > > Thanks, > > - Mike > > > > ----- Original Message ----- > > From: "ZHAO ZHANG" < zhaozhang at uchicago.edu > > > To: "Michael Wilde" < wilde at mcs.anl.gov > > > Cc: "Emalayan Vairavanathan" < svemalayan at yahoo.com >, > > swift-devel at ci.uchicago.edu > > Sent: Sunday, March 4, 2012 1:17:18 PM > > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > Surveyor > > Yes, each compute node needs to run this script to bring up the > > network > > interface. > > > > zhao > > > > On 3/4/2012 12:53 PM, Michael Wilde wrote: > > > Thanks, Zhao. Does this need to run on each node at startup? > > > > > > If so should this logic be integrated into the worker startup > > > script, Jon, Justin, Emalayan? > > > > > > Ive not looked at the current scripts much; I would think that all > > > the BG/P specific logic of enabling the torus network and finding > > > each node's IP address on the torus should be done in the init > > > script rather than in the worker. > > > > > > - Mike > > > > > > ----- Original Message ----- > > >> From: "ZHAO ZHANG"< zhaozhang at uchicago.edu > > > >> To: "Michael Wilde"< wilde at mcs.anl.gov > > > >> Cc: "Emalayan Vairavanathan"< svemalayan at yahoo.com >, > > >> swift-devel at ci.uchicago.edu > > >> Sent: Sunday, March 4, 2012 12:18:28 PM > > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >> Surveyor > > >> Hi, Mike > > >> > > >> With 192.168.1.*, we could only access the tree network. In order > > >> to > > >> use > > >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, > > >> y, > > >> z > > >> here is the coordinates of the compute nodes). > > >> The code below could bring the torus ip address up. > > >> > > >> IP="" > > >> set_torus_ip() > > >> { > > >> x=$1 > > >> y=$2 > > >> z=$3 > > >> z=`expr $3 + 1` > > >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp > > >> IP=12.$x.$y.$z > > >> } > > >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d > > >> '"' > > >> -f > > >> 2` > > >> echo ${BG_PSETORG}>> /dev/shm/localip > > >> set_torus_ip $BG_PSETORG > > >> > > >> best > > >> zhao > > >> > > >> On 3/4/2012 10:24 AM, Michael Wilde wrote: > > >>> Zhao, > > >>> > > >>> Can you tell us if the nodes on the torus network are accessed > > >>> over > > >>> the 192.168 network? I just realized they cant all be on the > > >>> 192.168.1 subnet, so I hope I suggested the right network here. > > >>> > > >>> Thanks, > > >>> > > >>> - Mike > > >>> > > >>> ----- Original Message ----- > > >>>> From: "Emalayan Vairavanathan"< svemalayan at yahoo.com > > > >>>> To: swift-devel at ci.uchicago.edu > > >>>> Sent: Sunday, March 4, 2012 1:40:53 AM > > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >>>> Surveyor > > >>>> Thank you very much Mike. I agree with you suggestion. I can do > > >>>> that > > >>>> in worker.pl. > > >>>> > > >>>> > > >>>> Thank you > > >>>> Emalayan > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> From: Michael Wilde< wilde at mcs.anl.gov > > > >>>> To: emalayan at ece.ubc.ca > > >>>> Cc: swift-devel< swift-devel at ci.uchicago.edu > > > >>>> Sent: Saturday, 3 March 2012 7:39 PM > > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in > > >>>> Surveyor > > >>>> > > >>>> Emalayan, > > >>>> > > >>>> I wasnt paying much attention to the actual IP address returned > > >>>> by > > >>>> hostname in the zeptoos profile. > > >>>> > > >>>> Since these are the addresses that Mosa will communicate over, > > >>>> I > > >>>> think > > >>>> you *do* want them to be the 192.168.1.* addresses of the nodes > > >>>> on > > >>>> the > > >>>> torus network (in other words tun0). > > >>>> > > >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I > > >>>> think > > >>>> thats what you should use. So try replacing `hostname` in > > >>>> worker.pl > > >>>> with something like: > > >>>> > > >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ > > >>>> .*//'` > > >>>> > > >>>> You may have to adapt this a bit to meet your needs. Im > > >>>> assuming > > >>>> that > > >>>> the only code that will uses these IPs is MosaStore. > > >>>> > > >>>> - Mike > > >>>> > > >>>> > > >>>> ----- Original Message ----- > > >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov > > > >>>>> To: zeptoos at lists.mcs.anl.gov > > >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM > > >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor > > >>>>> Hi Emalayan, > > >>>>> > > >>>>> The zeptoos profile returns the IP address of associated I/O > > >>>>> node, > > >>>>> which is kind of wrong in my opinion (influence of IBM CNK). > > >>>>> ifconfig on compute nodes returns CN's IP address, which is > > >>>>> correct. > > >>>>> e.g. tun0 192.168.1.64 > > >>>>> > > >>>>> If you want to find associated ION's IP address from CNs, > > >>>>> do something like this. > > >>>>> > > >>>>> $ grep BG_IP= /proc/personality.sh > > >>>>> > > >>>>> - kaz > > >>>>> > > >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: > > >>>>>> Hi All, > > >>>>>> > > >>>>>> I am trying to run some experiments in Surveyor. The software > > >>>>>> I > > >>>>>> am > > >>>>>> using > > >>>>>> gets the IP-address of compute-nodes using hostname command. > > >>>>>> > > >>>>>> With zepto-vn-eval/mosatest profile hostname command returns > > >>>>>> none. > > >>>>>> But with zeptoos profile hostname returns the correct IP > > >>>>>> address. > > >>>>>> > > >>>>>> Is this due to some configuration issues in > > >>>>>> zepto-vn-eval/mosatest > > >>>>>> profile?As a workaround I tired to use ifconfig with both > > >>>>>> profiles, > > >>>>>> but > > >>>>>> it seems ifconfig is not returning the correct IP address. > > >>>>>> > > >>>>>> Is there any command / files which I can used to retrieve the > > >>>>>> hostname > > >>>>>> on compute nodes? I have pasted the console output with both > > >>>>>> profiles > > >>>>>> below. Please let me know if you need more details. > > >>>>>> > > >>>>>> Thank you > > >>>>>> Emalayan > > >>>>>> > > >>>>>> > > >>>>>> =======================With zeptoos profile > > >>>>>> =============================== > > >>>>>> > > >>>>>> / # hostname > > >>>>>> 172.18.3.19 > > >>>>>> / # > > >>>>>> / # cat /proc/sys/kernel/hostname > > >>>>>> 172.18.3.19 > > >>>>>> / # > > >>>>>> / # > > >>>>>> / # ifconfig -a > > >>>>>> lo Link encap:Local Loopback > > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> tun0 Link encap:UNSPEC HWaddr > > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 > > >>>>>> Mask:255.255.255.255 > > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:500 > > >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> =======================With zepto-vn-eval/mosatest profile > > >>>>>> =============================== > > >>>>>> > > >>>>>> /etc # hostname > > >>>>>> (none) > > >>>>>> /etc # > > >>>>>> /etc # cat /proc/sys/kernel/hostname > > >>>>>> (none) > > >>>>>> /etc # > > >>>>>> /etc # ifconfig -a > > >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 > > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:1000 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 > > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:1000 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> lo Link encap:Local Loopback > > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 > > >>>>>> inet6 addr: ::1/128 Scope:Host > > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> sit0 Link encap:IPv6-in-IPv4 > > >>>>>> NOARP MTU:1480 Metric:1 > > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:0 > > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > >>>>>> > > >>>>>> tun0 Link encap:UNSPEC HWaddr > > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 > > >>>>>> Mask:255.255.255.255 > > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 > > >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 > > >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 > > >>>>>> collisions:0 txqueuelen:500 > > >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) > > >>>>>> > > >>>> -- > > >>>> Michael Wilde > > >>>> Computation Institute, University of Chicago > > >>>> Mathematics and Computer Science Division > > >>>> Argonne National Laboratory > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>> > > >>>> > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From zhaozhang at uchicago.edu Sun Mar 4 19:58:13 2012 From: zhaozhang at uchicago.edu (ZHAO ZHANG) Date: Sun, 04 Mar 2012 19:58:13 -0600 Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <1330908865.58128.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <4F53BFBE.3020205@uchicago.edu> <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> <1330905648.31137.YahooMailNeo@web39508.mail.mud.yahoo.com> <4F540390.8010505@uchicago.edu> <1330906988.77015.YahooMailNeo@web39502.mail.mud.yahoo.com> <4F540898.5060402@uchicago.edu> <1330908865.58128.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <4F541DB5.2050302@uchicago.edu> Hi, Emalayan As another Mike's email said, running on an allocation that is not a multiple of 64 compute nodes will still be charged as a multiple of 64 compute nodes. That means, even we could run something on 32 compute nodes, the system still allocates us 64, and charged our account for core hours. So we could set 64 compute nodes as hour smallest test case. zhao On 3/4/2012 6:54 PM, Emalayan Vairavanathan wrote: > Zhao, Thank you, I think this is not going to work with allocations > with fraction of PSET. Please see below (I have pasted personality.sh > with 2 node ) > > May be we need to get it from other sources. Or may be swift can pass > this information to workers once the connection has been established. > > Regards > Emalayan > > BG_UCI=68021700 > BG_LOCATION=R00-M0-N08-J23 > BG_MAC=12:01:13:80:00:00 > BG_IP=172.18.1.19 > BG_NETMASK=255.240.0.0 > BG_BROADCAST=172.32.255.255 > BG_GATEWAY=172.16.3.5 > BG_MTU=9000 > BG_FS=172.17.3.1 > BG_EXPORTDIR="/bgsys" > BG_SIMULATION=0 > BG_PSETNUM=0 > BG_NUMPSETS=1 > BG_NODESINPSET=64 > BG_XSIZE=4 > BG_YSIZE=4 > BG_ZSIZE=4 > BG_VERBOSE=0 > BG_PSETSIZE="4 4 4" > BG_PSETORG="0 0 0" > BG_CLOCKHZ=850 > BG_GLINTS=1 > BG_ISTORUS="XYZ" > BG_BLOCKID="ANL-R00-M0-N08-64" > BG_SN=172.17.3.1 > BG_IS_IO_NODE=0 > BG_RANK_IN_PSET=64 > BG_RANK=0 > BG_IP_OVER_COL=0 > BG_IP_OVER_TOR=0 > BG_IP_OVER_COL_VC=0 > BG_CIO_MODE=FULL > BG_BGSYS_FS_TYPE=NFSv3 > BG_HTC_MODE=0 > > ------------------------------------------------------------------------ > *From:* ZHAO ZHANG > *To:* Emalayan Vairavanathan > *Cc:* Michael Wilde ; Justin M Wozniak > ; "swift-devel at ci.uchicago.edu" > > *Sent:* Sunday, 4 March 2012 4:28 PM > *Subject:* Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor > > Hi, > > I am attaching one personality.sh file of a 4096 CN allocation. From > that you can see, there are couple of ways to calculate the > allocation size: > 1) BG_BLOCKID="ANL-R10-R13-4096" > 2) Multiplication of BG_XSIZE, BG_YSIZE, BG_ZSIZE > > I am not quite sure how it will work on allocation that is smaller > than 64, you could simply give it a check. > > best > zhao > > BG_UCI=68801700 > BG_LOCATION=R10-M0-N00-J23 > BG_MAC=00:00:00:00:00:00 > BG_IP=0.0.0.0 > BG_NETMASK=255.255.255.112 > BG_BROADCAST=0.0.0.0 > BG_GATEWAY=0.0.0.0 > BG_MTU=1536 > BG_FS=0.0.0.0 > BG_EXPORTDIR="" > BG_SIMULATION=0 > BG_PSETNUM=0 > BG_NUMPSETS=64 > BG_NODESINPSET=64 > BG_XSIZE=8 > BG_YSIZE=16 > BG_ZSIZE=32 > BG_VERBOSE=0 > BG_PSETSIZE="4 4 4" > BG_PSETORG="0 0 0" > BG_CLOCKHZ=850 > BG_GLINTS=1 > BG_ISTORUS="" > BG_BLOCKID="ANL-R10-R13-4096" > BG_SN=0.0.0.0 > BG_IS_IO_NODE=0 > BG_RANK_IN_PSET=64 > BG_RANK=0 > BG_IP_OVER_COL=0 > BG_IP_OVER_TOR=0 > BG_IP_OVER_COL_VC=0 > BG_CIO_MODE=FULL > BG_BGSYS_FS_TYPE=NFSv3 > BG_HTC_MODE=0 > > > On 3/4/2012 6:23 PM, Emalayan Vairavanathan wrote: >> Zhao, Thank you very much for the answers. >> >> One more question: :) >> >> Do you know how I can calculate this ? Does this method works >> regardless of the number of nodes allocated (even with any fraction >> of pset ) ? >> >> Thank you >> Emalayan >> >> ------------------------------------------------------------------------ >> *From:* ZHAO ZHANG >> >> *To:* Emalayan Vairavanathan >> >> *Cc:* Michael Wilde ; >> Justin M Wozniak ; >> "swift-devel at ci.uchicago.edu" >> >> *Sent:* Sunday, 4 March 2012 4:06 PM >> *Subject:* Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >> >> Hi, Emalayan >> >> On 3/4/2012 6:00 PM, Emalayan Vairavanathan wrote: >>> Mike, that sounds like a good idea. >>> >>> Zhao, In addition to Mike's questions I have two more questions. >>> >>> 1) Is it possible to get/ calculate the MAX_RANK / number of nodes >>> in an allocation from persoanlity.h or some other data structure ? >> Yes, you could calculate the MAX_RANK from personality.sh. >>> >>> 2) Which interface should be configured to have Tours ? (Does this >>> matter at all ?) >>> In your scripts you are configuring eth1. But in >>> http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages tun1 is >>> configured. >> To use the torus network, there are two ways. One is to use the >> 12.x.y.z+1 interface, which we have to configure ourselves. >> The other way is to use the "ipfwd.sh", aka the 10.128 interface. The >> drawback of the second interface is it takes up one core >> for polling, and there is some scalability issue beyond 2K compute >> nodes as far as I remember. Mosa could use either of them. >> >> best >> zhao >>> >>> Thank you >>> Emalayan >>> >>> ------------------------------------------------------------------------ >>> *From:* Michael Wilde >>> *To:* ZHAO ZHANG >>> ; Justin M Wozniak >>> >>> *Cc:* Emalayan Vairavanathan >>> ; swift-devel at ci.uchicago.edu >>> >>> *Sent:* Sunday, 4 March 2012 2:33 PM >>> *Subject:* Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >>> >>> Zhao, with this procedure do you get consecutive host IP addresses >>> starting from 0.0 through 640*64 in the two low order octets? >>> >>> In other words, does your just do what this page describes under >>> "IP over Torus": >>> >>> http://wiki.mcs.anl.gov/zeptoos/index.php/Other_Packages >>> >>> Is the "ipfwd.sh" script mentioned there still needed, or does that >>> now happen automatically? >>> >>> If so, perhaps we can greatly simplify the Mosa startup: we need >>> only pass the max rank of the running job, and Mosa will know that >>> it can use 12.128.0.0 for example. Then we dont need any manual >>> intervention, nor complicated/brittle file-waiting logic. >>> >>> Zhao, I dont understand why your example is using the 12.0.0.0 >>> network vs the example on the page above which uses 10.128.0.0. Can >>> you help me understand what is going on here? Is the "IP Over Torus" >>> info on the ZeptoOS wiki outdated? Or does it describe a different >>> technique? >>> >>> Justin, have you also mastered similar techniques for JETS? Do we >>> need help form the ZeptoOS team on this? >>> >>> Thanks, >>> >>> - Mike >>> >>> >>> >>> ----- Original Message ----- >>> > From: "ZHAO ZHANG" >> > >>> > To: "Michael Wilde" > >>> > Cc: "Emalayan Vairavanathan" >> >, swift-devel at ci.uchicago.edu >>> >>> > Sent: Sunday, March 4, 2012 1:17:18 PM >>> > Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor >>> > Yes, each compute node needs to run this script to bring up the >>> > network >>> > interface. >>> > >>> > zhao >>> > >>> > On 3/4/2012 12:53 PM, Michael Wilde wrote: >>> > > Thanks, Zhao. Does this need to run on each node at startup? >>> > > >>> > > If so should this logic be integrated into the worker startup >>> > > script, Jon, Justin, Emalayan? >>> > > >>> > > Ive not looked at the current scripts much; I would think that all >>> > > the BG/P specific logic of enabling the torus network and finding >>> > > each node's IP address on the torus should be done in the init >>> > > script rather than in the worker. >>> > > >>> > > - Mike >>> > > >>> > > ----- Original Message ----- >>> > >> From: "ZHAO ZHANG">> > >>> > >> To: "Michael Wilde"> >>> > >> Cc: "Emalayan Vairavanathan">> >, >>> > >> swift-devel at ci.uchicago.edu >>> > >> Sent: Sunday, March 4, 2012 12:18:28 PM >>> > >> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>> > >> Surveyor >>> > >> Hi, Mike >>> > >> >>> > >> With 192.168.1.*, we could only access the tree network. In order >>> > >> to >>> > >> use >>> > >> the torus network, we need to use the 12.x.y.z+1 ip address. (x, y, >>> > >> z >>> > >> here is the coordinates of the compute nodes). >>> > >> The code below could bring the torus ip address up. >>> > >> >>> > >> IP="" >>> > >> set_torus_ip() >>> > >> { >>> > >> x=$1 >>> > >> y=$2 >>> > >> z=$3 >>> > >> z=`expr $3 + 1` >>> > >> ifconfig eth1 12.$x.$y.$z netmask 255.0.0.0 mtu 8996 -arp >>> > >> IP=12.$x.$y.$z >>> > >> } >>> > >> BG_PSETORG=`cat /proc/personality.sh | grep BG_PSETORG | cut -d '"' >>> > >> -f >>> > >> 2` >>> > >> echo ${BG_PSETORG}>> /dev/shm/localip >>> > >> set_torus_ip $BG_PSETORG >>> > >> >>> > >> best >>> > >> zhao >>> > >> >>> > >> On 3/4/2012 10:24 AM, Michael Wilde wrote: >>> > >>> Zhao, >>> > >>> >>> > >>> Can you tell us if the nodes on the torus network are accessed >>> > >>> over >>> > >>> the 192.168 network? I just realized they cant all be on the >>> > >>> 192.168.1 subnet, so I hope I suggested the right network here. >>> > >>> >>> > >>> Thanks, >>> > >>> >>> > >>> - Mike >>> > >>> >>> > >>> ----- Original Message ----- >>> > >>>> From: "Emalayan Vairavanathan">> > >>> > >>>> To: swift-devel at ci.uchicago.edu >>> >>> > >>>> Sent: Sunday, March 4, 2012 1:40:53 AM >>> > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>> > >>>> Surveyor >>> > >>>> Thank you very much Mike. I agree with you suggestion. I can do >>> > >>>> that >>> > >>>> in worker.pl. >>> > >>>> >>> > >>>> >>> > >>>> Thank you >>> > >>>> Emalayan >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> From: Michael Wilde> >>> > >>>> To: emalayan at ece.ubc.ca >>> > >>>> Cc: swift-devel>> > >>> > >>>> Sent: Saturday, 3 March 2012 7:39 PM >>> > >>>> Subject: Re: [Swift-devel] [ZeptoOS] hostname returns none in >>> > >>>> Surveyor >>> > >>>> >>> > >>>> Emalayan, >>> > >>>> >>> > >>>> I wasnt paying much attention to the actual IP address returned >>> > >>>> by >>> > >>>> hostname in the zeptoos profile. >>> > >>>> >>> > >>>> Since these are the addresses that Mosa will communicate over, I >>> > >>>> think >>> > >>>> you *do* want them to be the 192.168.1.* addresses of the nodes >>> > >>>> on >>> > >>>> the >>> > >>>> torus network (in other words tun0). >>> > >>>> >>> > >>>> So, since both profiles return 192.168.1.64 for the tun0 IP, I >>> > >>>> think >>> > >>>> thats what you should use. So try replacing `hostname` in >>> > >>>> worker.pl >>> > >>>> with something like: >>> > >>>> >>> > >>>> `ifconfig | grep 192.168 | sed -e 's/^inet addr://' -e 's/ .*//'` >>> > >>>> >>> > >>>> You may have to adapt this a bit to meet your needs. Im assuming >>> > >>>> that >>> > >>>> the only code that will uses these IPs is MosaStore. >>> > >>>> >>> > >>>> - Mike >>> > >>>> >>> > >>>> >>> > >>>> ----- Original Message ----- >>> > >>>>> From: "Kazutomo Yoshii"< kazutomo at mcs.anl.gov >>> > >>> > >>>>> To: zeptoos at lists.mcs.anl.gov >>> > >>>>> Sent: Saturday, March 3, 2012 8:52:00 PM >>> > >>>>> Subject: Re: [ZeptoOS] hostname returns none in Surveyor >>> > >>>>> Hi Emalayan, >>> > >>>>> >>> > >>>>> The zeptoos profile returns the IP address of associated I/O >>> > >>>>> node, >>> > >>>>> which is kind of wrong in my opinion (influence of IBM CNK). >>> > >>>>> ifconfig on compute nodes returns CN's IP address, which is >>> > >>>>> correct. >>> > >>>>> e.g. tun0 192.168.1.64 >>> > >>>>> >>> > >>>>> If you want to find associated ION's IP address from CNs, >>> > >>>>> do something like this. >>> > >>>>> >>> > >>>>> $ grep BG_IP= /proc/personality.sh >>> > >>>>> >>> > >>>>> - kaz >>> > >>>>> >>> > >>>>> On 03/03/2012 08:25 PM, Emalayan Vairavanathan wrote: >>> > >>>>>> Hi All, >>> > >>>>>> >>> > >>>>>> I am trying to run some experiments in Surveyor. The software I >>> > >>>>>> am >>> > >>>>>> using >>> > >>>>>> gets the IP-address of compute-nodes using hostname command. >>> > >>>>>> >>> > >>>>>> With zepto-vn-eval/mosatest profile hostname command returns >>> > >>>>>> none. >>> > >>>>>> But with zeptoos profile hostname returns the correct IP >>> > >>>>>> address. >>> > >>>>>> >>> > >>>>>> Is this due to some configuration issues in >>> > >>>>>> zepto-vn-eval/mosatest >>> > >>>>>> profile?As a workaround I tired to use ifconfig with both >>> > >>>>>> profiles, >>> > >>>>>> but >>> > >>>>>> it seems ifconfig is not returning the correct IP address. >>> > >>>>>> >>> > >>>>>> Is there any command / files which I can used to retrieve the >>> > >>>>>> hostname >>> > >>>>>> on compute nodes? I have pasted the console output with both >>> > >>>>>> profiles >>> > >>>>>> below. Please let me know if you need more details. >>> > >>>>>> >>> > >>>>>> Thank you >>> > >>>>>> Emalayan >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> =======================With zeptoos profile >>> > >>>>>> =============================== >>> > >>>>>> >>> > >>>>>> / # hostname >>> > >>>>>> 172.18.3.19 >>> > >>>>>> / # >>> > >>>>>> / # cat /proc/sys/kernel/hostname >>> > >>>>>> 172.18.3.19 >>> > >>>>>> / # >>> > >>>>>> / # >>> > >>>>>> / # ifconfig -a >>> > >>>>>> lo Link encap:Local Loopback >>> > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> > >>>>>> collisions:0 txqueuelen:0 >>> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> > >>>>>> >>> > >>>>>> tun0 Link encap:UNSPEC HWaddr >>> > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>> > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>> > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>> > >>>>>> RX packets:2662 errors:0 dropped:0 overruns:0 frame:0 >>> > >>>>>> TX packets:1772 errors:0 dropped:0 overruns:0 carrier:0 >>> > >>>>>> collisions:0 txqueuelen:500 >>> > >>>>>> RX bytes:140206 (136.9 KiB) TX bytes:125412 (122.4 KiB) >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> =======================With zepto-vn-eval/mosatest profile >>> > >>>>>> =============================== >>> > >>>>>> >>> > >>>>>> /etc # hostname >>> > >>>>>> (none) >>> > >>>>>> /etc # >>> > >>>>>> /etc # cat /proc/sys/kernel/hostname >>> > >>>>>> (none) >>> > >>>>>> /etc # >>> > >>>>>> /etc # ifconfig -a >>> > >>>>>> eth0 Link encap:Ethernet HWaddr 00:80:46:00:00:00 >>> > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> > >>>>>> collisions:0 txqueuelen:1000 >>> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> > >>>>>> >>> > >>>>>> eth1 Link encap:Ethernet HWaddr 00:80:47:00:00:00 >>> > >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> > >>>>>> collisions:0 txqueuelen:1000 >>> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> > >>>>>> >>> > >>>>>> lo Link encap:Local Loopback >>> > >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> > >>>>>> inet6 addr: ::1/128 Scope:Host >>> > >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> > >>>>>> collisions:0 txqueuelen:0 >>> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> > >>>>>> >>> > >>>>>> sit0 Link encap:IPv6-in-IPv4 >>> > >>>>>> NOARP MTU:1480 Metric:1 >>> > >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> > >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> > >>>>>> collisions:0 txqueuelen:0 >>> > >>>>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>> > >>>>>> >>> > >>>>>> tun0 Link encap:UNSPEC HWaddr >>> > >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 >>> > >>>>>> inet addr:192.168.1.64 P-t-P:192.168.1.254 Mask:255.255.255.255 >>> > >>>>>> UP POINTOPOINT RUNNING NOARP MULTICAST MTU:65535 Metric:1 >>> > >>>>>> RX packets:965 errors:0 dropped:0 overruns:0 frame:0 >>> > >>>>>> TX packets:627 errors:0 dropped:0 overruns:0 carrier:0 >>> > >>>>>> collisions:0 txqueuelen:500 >>> > >>>>>> RX bytes:50984 (49.7 KiB) TX bytes:50530 (49.3 KiB) >>> > >>>>>> >>> > >>>> -- >>> > >>>> Michael Wilde >>> > >>>> Computation Institute, University of Chicago >>> > >>>> Mathematics and Computer Science Division >>> > >>>> Argonne National Laboratory >>> > >>>> >>> > >>>> _______________________________________________ >>> > >>>> Swift-devel mailing list >>> > >>>> Swift-devel at ci.uchicago.edu >>> > >>>> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> _______________________________________________ >>> > >>>> Swift-devel mailing list >>> > >>>> Swift-devel at ci.uchicago.edu >>> > >>>> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> >>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Mar 5 08:17:42 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 5 Mar 2012 08:17:42 -0600 (CST) Subject: [Swift-devel] Configuring Swift to access MosaStore In-Reply-To: <00ed01ccf8d4$ac72c660$05585320$@gmail.com> Message-ID: <776537591.60555.1330957062994.JavaMail.root@zimbra.anl.gov> was: Re: [Swift-devel] coasters-hosts.pl script Jon, can you create a demo script that shows how to configure a Swift run to use MosaStore. The following approach may work: - Assume MosaStore will be mounted as /mosa to all workers - Simulate this with a localhost run, using /tmp/mosa, then do same with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on Beagle). - Set CDM direct mode for all paths starting with [/tmp]/mosa. You might need to work through some of the issues with CDM direct where accesses need to match both /tmp/mosa and file:///tmp/mosa (I *think*) - Map some temporary output-to-input files to /tmp/mosa; create a multi-level "catsncats"-like workflow to exercise it; the recent ParameterSweep example, perhaps extended to do N levels of fan-in/fan-out and pass-N might be a good test. - see if you can get _concurrent to get placed on /tmp/mosa I think some of these tests would be a great test case for Swift/Turbine as well. You can do this is stages; the simple test of mapping CDM-direct files to /tmp/mosa should give Emalayan an initial test case to run once Mosa is ready on the BG/P. - Mike ----- Original Message ----- > From: "Matei Ripeanu" > To: mosastore at googlegroups.com, "Jonathan Monette" , "Justin M Wozniak" > Cc: swift-devel at ci.uchicago.edu, emalayan at ece.ubc.ca > Sent: Friday, March 2, 2012 6:29:17 PM > Subject: Re: [Swift-devel] coasters-hosts.pl script > Indeed this is good news! Thank you. > > > > Our next task, I think, will be to figure out how to configure Swift > so that the headnode (where Swift runs) will not require any access to > intermediate storage (MosaStore). Only the worker nodes will have > access to intermediate storage. This is to go around the one way > headnode-worker node connectivity issue. > > > > Any guidance on how to get this configuration would be much > appreciated. > > > > Thank you again, > > > > -Matei > > > > > > From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] > On Behalf Of Emalayan Vairavanathan > Sent: March-02-12 2:32 PM > To: Jonathan Monette; Justin M Wozniak > Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; > MosaStore > Subject: Re: [Swift-devel] coasters-hosts.pl script > > > > > > Thank you Jon and Justin. > > > > > > This is a great news. I will get back to you if I have questions. > > > > > > Regards > > > Emalayan > > > > > > > > > > From: Jonathan Monette < jonmon at mcs.anl.gov > > To: Justin M Wozniak < wozniak at mcs.anl.gov > > Cc: " swift-devel at ci.uchicago.edu Devel " < > swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca > Sent: Friday, 2 March 2012 2:21 PM > Subject: Re: [Swift-devel] coasters-hosts.pl script > > > Emalayan, > We believe we have fixed the issue. You can copy the new > coasters-hosts.pl script from > ~jonmon/surveyor/worker-init-test/coasters-hosts.pl > > This script reads the worker logs located in the logs directory. The > steps to run are as follows: > start-coaster-service > > ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt > > You MUST clean out the worker logs after you before you start a new > coaster service to make sure the script searches the right worker log > files. This may not be ideal at the moment but this will help get you > started. If you have any other questions feel free to ask. We will > need to update the mosaswift site with the new information, we will do > this soon. > > On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > > > Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on > > node 172.18.1.83 from the worker log, > > instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker > > started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the > > cps log? > > > > They both provide the same ip addresses. And the worker log always > > has that ip address before the cps log does. > > > > On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > > > >> That fix still did not work. I had moved it to the same spot. It is > >> still waiting for the worker-init.pl script to finish before the ip > >> addresses are printed to the cps log. Those ip addresses are what > >> is needed by the coaster-hosts.pl script to finish. If I create an > >> empty file for the coaster-host.pl script to read, then the work > >> continues and the ip addresses show up in the cps log. > >> > >> Why is log4j waiting to add those lines to the cps log after the > >> worker-init.pl script is finished? > >> > >> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > >> > >>> Thanks, in my copy I thought I had moved the reconnect to before > >>> the init-cmd and it still wasn't working. I will test with your > >>> change. I just verified that it was indeed waiting for the > >>> worker-init.pl script to finish. I created an empty file for the > >>> script to read and it finished connecting and the ip addresses I > >>> needed were added to the cps log. I will also be testing your fix. > >>> > >>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > >>> > >>>> > >>>> Yes- I must have tested this with a different log file. I just > >>>> checked in and installed in ~wozniak/Public a fix for this that > >>>> launches WORKER_INIT_CMD after the reconnect(). I am a little > >>>> worried about time outs but it works so far. I will continue > >>>> testing... > >>>> Justin > >>>> > >>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: > >>>> > >>>>> Justin, > >>>>> So I have been trying to help Emalayan get the host list file > >>>>> for the worker-init.pl script. It seems the cps log file is not > >>>>> providing the ip addresses for the coasters-hosts.pl script. I > >>>>> thought this was maybe because we did not have the correct log4j > >>>>> setting set but we have the Coaster service Cpu set to DEBUG. So > >>>>> for some reason the workers are not connecting to the service. > >>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the > >>>>> coaster-service.conf file I see the workers connect and the cps > >>>>> log file shows there ip addresses. However when setting this > >>>>> line it seems they are not connecting. > >>>>> > >>>>> Emalayan thought there might be some sort of circular dependency > >>>>> going with the host-list file and the worker. The worker > >>>>> requires the host-list file so that it can run the > >>>>> worker-init.pl script and then connect but the host-list file > >>>>> cannot be generated because the workers cannot connect. I > >>>>> noticed in your swift-test directory the cps files did have the > >>>>> ip addresses set and coasters-hosts.pl found the ip addresses > >>>>> and reported them. Did you try that test with setting the > >>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? > >>>>> Any idea what may be happening? The job is running when looking > >>>>> under cqstat. > >>>>> > >>>>> A side note: At the mosaswift site, your example talks about > >>>>> running the coasters-hosts.pl on the cps log but the example you > >>>>> provide runs it on logs/coasters.log. This may need to be > >>>>> changed. Also, should provide the log4j setting that is required > >>>>> to generate the Cpu line with the worker ip address just to > >>>>> clarify that this line should be set for this script to work. > >>>>> > >>>>> For reference, this line: > >>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > >>>> > >>>> -- > >>>> Justin M Wozniak > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > You received this message because you are subscribed to the Google > Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com . > To unsubscribe from this group, send email to > mosastore+unsubscribe at googlegroups.com . > For more options, visit this group at > http://groups.google.com/group/mosastore?hl=en . > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Mon Mar 5 08:31:36 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 5 Mar 2012 08:31:36 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> References: <888718346.59715.1330900432048.JavaMail.root@zimbra.anl.gov> Message-ID: On Sun, 4 Mar 2012, Michael Wilde wrote: > Justin, have you also mastered similar techniques for JETS? Do we need > help form the ZeptoOS team on this? Yes, in JETS, the workers have to both 1) connect to each other and 2) connect to the service on the login node. I did this with the 12.*** network as Zhao described. Justin -- Justin M Wozniak From jonmon at mcs.anl.gov Mon Mar 5 09:14:03 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 5 Mar 2012 09:14:03 -0600 Subject: [Swift-devel] Configuring Swift to access MosaStore In-Reply-To: <776537591.60555.1330957062994.JavaMail.root@zimbra.anl.gov> References: <776537591.60555.1330957062994.JavaMail.root@zimbra.anl.gov> Message-ID: Yea. I will get demo scripts together for the mosa tests. On Mar 5, 2012, at 8:17, Michael Wilde wrote: > was: Re: [Swift-devel] coasters-hosts.pl script > > Jon, can you create a demo script that shows how to configure a Swift run to use MosaStore. The following approach may work: > > - Assume MosaStore will be mounted as /mosa to all workers > > - Simulate this with a localhost run, using /tmp/mosa, then do same with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on Beagle). > > - Set CDM direct mode for all paths starting with [/tmp]/mosa. You might need to work through some of the issues with CDM direct where accesses need to match both /tmp/mosa and file:///tmp/mosa (I *think*) > > - Map some temporary output-to-input files to /tmp/mosa; create a multi-level "catsncats"-like workflow to exercise it; the recent ParameterSweep example, perhaps extended to do N levels of fan-in/fan-out and pass-N might be a good test. > > - see if you can get _concurrent to get placed on /tmp/mosa > > I think some of these tests would be a great test case for Swift/Turbine as well. > > You can do this is stages; the simple test of mapping CDM-direct files to /tmp/mosa should give Emalayan an initial test case to run once Mosa is ready on the BG/P. > > - Mike > > > ----- Original Message ----- >> From: "Matei Ripeanu" >> To: mosastore at googlegroups.com, "Jonathan Monette" , "Justin M Wozniak" >> Cc: swift-devel at ci.uchicago.edu, emalayan at ece.ubc.ca >> Sent: Friday, March 2, 2012 6:29:17 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> Indeed this is good news! Thank you. >> >> >> >> Our next task, I think, will be to figure out how to configure Swift >> so that the headnode (where Swift runs) will not require any access to >> intermediate storage (MosaStore). Only the worker nodes will have >> access to intermediate storage. This is to go around the one way >> headnode-worker node connectivity issue. >> >> >> >> Any guidance on how to get this configuration would be much >> appreciated. >> >> >> >> Thank you again, >> >> >> >> -Matei >> >> >> >> >> >> From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] >> On Behalf Of Emalayan Vairavanathan >> Sent: March-02-12 2:32 PM >> To: Jonathan Monette; Justin M Wozniak >> Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; >> MosaStore >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> >> >> >> >> >> Thank you Jon and Justin. >> >> >> >> >> >> This is a great news. I will get back to you if I have questions. >> >> >> >> >> >> Regards >> >> >> Emalayan >> >> >> >> >> >> >> >> >> >> From: Jonathan Monette < jonmon at mcs.anl.gov > >> To: Justin M Wozniak < wozniak at mcs.anl.gov > >> Cc: " swift-devel at ci.uchicago.edu Devel " < >> swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca >> Sent: Friday, 2 March 2012 2:21 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> >> >> Emalayan, >> We believe we have fixed the issue. You can copy the new >> coasters-hosts.pl script from >> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl >> >> This script reads the worker logs located in the logs directory. The >> steps to run are as follows: >> start-coaster-service >> >> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt >> >> You MUST clean out the worker logs after you before you start a new >> coaster service to make sure the script searches the right worker log >> files. This may not be ideal at the moment but this will help get you >> started. If you have any other questions feel free to ask. We will >> need to update the mosaswift site with the new information, we will do >> this soon. >> >> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: >> >>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on >>> node 172.18.1.83 from the worker log, >>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker >>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the >>> cps log? >>> >>> They both provide the same ip addresses. And the worker log always >>> has that ip address before the cps log does. >>> >>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: >>> >>>> That fix still did not work. I had moved it to the same spot. It is >>>> still waiting for the worker-init.pl script to finish before the ip >>>> addresses are printed to the cps log. Those ip addresses are what >>>> is needed by the coaster-hosts.pl script to finish. If I create an >>>> empty file for the coaster-host.pl script to read, then the work >>>> continues and the ip addresses show up in the cps log. >>>> >>>> Why is log4j waiting to add those lines to the cps log after the >>>> worker-init.pl script is finished? >>>> >>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >>>> >>>>> Thanks, in my copy I thought I had moved the reconnect to before >>>>> the init-cmd and it still wasn't working. I will test with your >>>>> change. I just verified that it was indeed waiting for the >>>>> worker-init.pl script to finish. I created an empty file for the >>>>> script to read and it finished connecting and the ip addresses I >>>>> needed were added to the cps log. I will also be testing your fix. >>>>> >>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>>>> >>>>>> >>>>>> Yes- I must have tested this with a different log file. I just >>>>>> checked in and installed in ~wozniak/Public a fix for this that >>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>>>> worried about time outs but it works so far. I will continue >>>>>> testing... >>>>>> Justin >>>>>> >>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>>>> >>>>>>> Justin, >>>>>>> So I have been trying to help Emalayan get the host list file >>>>>>> for the worker-init.pl script. It seems the cps log file is not >>>>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>>>> thought this was maybe because we did not have the correct log4j >>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. So >>>>>>> for some reason the workers are not connecting to the service. >>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the >>>>>>> coaster-service.conf file I see the workers connect and the cps >>>>>>> log file shows there ip addresses. However when setting this >>>>>>> line it seems they are not connecting. >>>>>>> >>>>>>> Emalayan thought there might be some sort of circular dependency >>>>>>> going with the host-list file and the worker. The worker >>>>>>> requires the host-list file so that it can run the >>>>>>> worker-init.pl script and then connect but the host-list file >>>>>>> cannot be generated because the workers cannot connect. I >>>>>>> noticed in your swift-test directory the cps files did have the >>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses >>>>>>> and reported them. Did you try that test with setting the >>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? >>>>>>> Any idea what may be happening? The job is running when looking >>>>>>> under cqstat. >>>>>>> >>>>>>> A side note: At the mosaswift site, your example talks about >>>>>>> running the coasters-hosts.pl on the cps log but the example you >>>>>>> provide runs it on logs/coasters.log. This may need to be >>>>>>> changed. Also, should provide the log4j setting that is required >>>>>>> to generate the Cpu line with the worker ip address just to >>>>>>> clarify that this line should be set for this script to work. >>>>>>> >>>>>>> For reference, this line: >>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>>>> >>>>>> -- >>>>>> Justin M Wozniak >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "MosaStore" group. >> To post to this group, send email to mosastore at googlegroups.com . >> To unsubscribe from this group, send email to >> mosastore+unsubscribe at googlegroups.com . >> For more options, visit this group at >> http://groups.google.com/group/mosastore?hl=en . >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From wilde at mcs.anl.gov Mon Mar 5 10:02:03 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 5 Mar 2012 10:02:03 -0600 (CST) Subject: [Swift-devel] [ZeptoOS] hostname returns none in Surveyor In-Reply-To: Message-ID: <866379589.61227.1330963323480.JavaMail.root@zimbra.anl.gov> > Yes, in JETS, the workers have to both 1) connect to each other and 2) > connect to the service on the login node. I did this with the 12.*** > network as Zhao described. I assume you used something other than the 12.*** net to connect to the login host? Whats the guidelines for how to reach the login host from the server? Lets augment the sites guide page for BG/P to fully describe the networking and init scripts for the BG/P, as it pertains to Swift. I could also see some benefits, especially for ExM experiments, of running various things on the IOPs, which the ZeptoOS init script lets us do. So we should document how to do that and what the networking issues are. - Mike From svemalayan at yahoo.com Mon Mar 5 13:34:34 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Mon, 5 Mar 2012 11:34:34 -0800 (PST) Subject: [Swift-devel] Configuring Swift to access MosaStore In-Reply-To: References: <776537591.60555.1330957062994.JavaMail.root@zimbra.anl.gov> Message-ID: <1330976074.2431.YahooMailNeo@web39502.mail.mud.yahoo.com> Thank you Jon. Yesterday I successfully run Mosa? (on our cluster) with cdm-direct mode with the help of swift-user manual and the scripts available in?/cog/modules/swift/tests/cdm/absolute. It would be useful if you can develop a simple test case. I can double check with my test case. Thank you Emalayan ________________________________ From: Jonathan Monette To: Michael Wilde Cc: "emalayan at ece.ubc.ca" ; "matei at ece.ubc.ca" ; "swift-devel at ci.uchicago.edu" ; "mosastore at googlegroups.com" ; Jonathan Monette Sent: Monday, 5 March 2012 7:14 AM Subject: Re: [Swift-devel] Configuring Swift to access MosaStore Yea. I will get demo scripts together for the mosa tests. On Mar 5, 2012, at 8:17, Michael Wilde wrote: > was: Re: [Swift-devel] coasters-hosts.pl script > > Jon, can you create a demo script that shows how to configure a Swift run to use MosaStore. The following approach may work: > > - Assume MosaStore will be mounted as /mosa to all workers > > - Simulate this with a localhost run, using /tmp/mosa, then do same with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on Beagle). > > - Set CDM direct mode for all paths starting with [/tmp]/mosa. You might need to work through some of the issues with CDM direct where accesses need to match both /tmp/mosa and file:///tmp/mosa (I *think*) > > - Map some temporary output-to-input files to /tmp/mosa; create a multi-level "catsncats"-like workflow to exercise it; the recent ParameterSweep example, perhaps extended to do N levels of fan-in/fan-out and pass-N might be a good test. > > - see if you can get _concurrent to get placed on /tmp/mosa > > I think some of these tests would be a great test case for Swift/Turbine as well. > > You can do this is stages; the simple test of mapping CDM-direct files to /tmp/mosa should give Emalayan an initial test case to run once Mosa is ready on the BG/P. > > - Mike > > > ----- Original Message ----- >> From: "Matei Ripeanu" >> To: mosastore at googlegroups.com, "Jonathan Monette" , "Justin M Wozniak" >> Cc: swift-devel at ci.uchicago.edu, emalayan at ece.ubc.ca >> Sent: Friday, March 2, 2012 6:29:17 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> Indeed this is good news! Thank you. >> >> >> >> Our next task, I think, will be to figure out how to configure Swift >> so that the headnode (where Swift runs) will not require any access to >> intermediate storage (MosaStore). Only the worker nodes will have >> access to intermediate storage. This is to go around the one way >> headnode-worker node connectivity issue. >> >> >> >> Any guidance on how to get this configuration would be much >> appreciated. >> >> >> >> Thank you again, >> >> >> >> -Matei >> >> >> >> >> >> From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] >> On Behalf Of Emalayan Vairavanathan >> Sent: March-02-12 2:32 PM >> To: Jonathan Monette; Justin M Wozniak >> Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; >> MosaStore >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> >> >> >> >> >> Thank you Jon and Justin. >> >> >> >> >> >> This is a great news. I will get back to you if I have questions. >> >> >> >> >> >> Regards >> >> >> Emalayan >> >> >> >> >> >> >> >> >> >> From: Jonathan Monette < jonmon at mcs.anl.gov > >> To: Justin M Wozniak < wozniak at mcs.anl.gov > >> Cc: " swift-devel at ci.uchicago.edu Devel " < >> swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca >> Sent: Friday, 2 March 2012 2:21 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> >> >> Emalayan, >> We believe we have fixed the issue. You can copy the new >> coasters-hosts.pl script from >> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl >> >> This script reads the worker logs located in the logs directory. The >> steps to run are as follows: >> start-coaster-service >> >> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt >> >> You MUST clean out the worker logs after you before you start a new >> coaster service to make sure the script searches the right worker log >> files. This may not be ideal at the moment but this will help get you >> started. If you have any other questions feel free to ask. We will >> need to update the mosaswift site with the new information, we will do >> this soon. >> >> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: >> >>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on >>> node 172.18.1.83 from the worker log, >>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker >>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the >>> cps log? >>> >>> They both provide the same ip addresses. And the worker log always >>> has that ip address before the cps log does. >>> >>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: >>> >>>> That fix still did not work. I had moved it to the same spot. It is >>>> still waiting for the worker-init.pl script to finish before the ip >>>> addresses are printed to the cps log. Those ip addresses are what >>>> is needed by the coaster-hosts.pl script to finish. If I create an >>>> empty file for the coaster-host.pl script to read, then the work >>>> continues and the ip addresses show up in the cps log. >>>> >>>> Why is log4j waiting to add those lines to the cps log after the >>>> worker-init.pl script is finished? >>>> >>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >>>> >>>>> Thanks, in my copy I thought I had moved the reconnect to before >>>>> the init-cmd and it still wasn't working. I will test with your >>>>> change. I just verified that it was indeed waiting for the >>>>> worker-init.pl script to finish. I created an empty file for the >>>>> script to read and it finished connecting and the ip addresses I >>>>> needed were added to the cps log. I will also be testing your fix. >>>>> >>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>>>> >>>>>> >>>>>> Yes- I must have tested this with a different log file. I just >>>>>> checked in and installed in ~wozniak/Public a fix for this that >>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>>>> worried about time outs but it works so far. I will continue >>>>>> testing... >>>>>> Justin >>>>>> >>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>>>> >>>>>>> Justin, >>>>>>> So I have been trying to help Emalayan get the host list file >>>>>>> for the worker-init.pl script. It seems the cps log file is not >>>>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>>>> thought this was maybe because we did not have the correct log4j >>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. So >>>>>>> for some reason the workers are not connecting to the service. >>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the >>>>>>> coaster-service.conf file I see the workers connect and the cps >>>>>>> log file shows there ip addresses. However when setting this >>>>>>> line it seems they are not connecting. >>>>>>> >>>>>>> Emalayan thought there might be some sort of circular dependency >>>>>>> going with the host-list file and the worker. The worker >>>>>>> requires the host-list file so that it can run the >>>>>>> worker-init.pl script and then connect but the host-list file >>>>>>> cannot be generated because the workers cannot connect. I >>>>>>> noticed in your swift-test directory the cps files did have the >>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses >>>>>>> and reported them. Did you try that test with setting the >>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? >>>>>>> Any idea what may be happening? The job is running when looking >>>>>>> under cqstat. >>>>>>> >>>>>>> A side note: At the mosaswift site, your example talks about >>>>>>> running the coasters-hosts.pl on the cps log but the example you >>>>>>> provide runs it on logs/coasters.log. This may need to be >>>>>>> changed. Also, should provide the log4j setting that is required >>>>>>> to generate the Cpu line with the worker ip address just to >>>>>>> clarify that this line should be set for this script to work. >>>>>>> >>>>>>> For reference, this line: >>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>>>> >>>>>> -- >>>>>> Justin M Wozniak >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "MosaStore" group. >> To post to this group, send email to mosastore at googlegroups.com . >> To unsubscribe from this group, send email to >> mosastore+unsubscribe at googlegroups.com . >> For more options, visit this group at >> http://groups.google.com/group/mosastore?hl=en . >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Mon Mar 5 17:04:57 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Mon, 5 Mar 2012 15:04:57 -0800 (PST) Subject: [Swift-devel] Configuring Swift to access MosaStore In-Reply-To: <1330976074.2431.YahooMailNeo@web39502.mail.mud.yahoo.com> References: <776537591.60555.1330957062994.JavaMail.root@zimbra.anl.gov> <1330976074.2431.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: <1330988697.87200.YahooMailNeo@web39507.mail.mud.yahoo.com> Hi Jon, I just figured out that my catsn script is not working if Mosa is not mounted on the node where swift runs. I am not sure whether this is a configuration issue? or some problem with my cdm-rules / site-files / catsn script. Anyway without further debugging I am going to wait for the setup you are going to provide. Please let me know if I can help you with this. Thank you Emalayan ________________________________ From: Emalayan Vairavanathan To: Jonathan Monette ; Michael Wilde Cc: "emalayan at ece.ubc.ca" ; "matei at ece.ubc.ca" ; "swift-devel at ci.uchicago.edu" ; "mosastore at googlegroups.com" ; Jonathan Monette Sent: Monday, 5 March 2012 11:34 AM Subject: Re: [Swift-devel] Configuring Swift to access MosaStore Thank you Jon. Yesterday I successfully run Mosa? (on our cluster) with cdm-direct mode with the help of swift-user manual and the scripts available in?/cog/modules/swift/tests/cdm/absolute. It would be useful if you can develop a simple test case. I can double check with my test case. Thank you Emalayan ________________________________ From: Jonathan Monette To: Michael Wilde Cc: "emalayan at ece.ubc.ca" ; "matei at ece.ubc.ca" ; "swift-devel at ci.uchicago.edu" ; "mosastore at googlegroups.com" ; Jonathan Monette Sent: Monday, 5 March 2012 7:14 AM Subject: Re: [Swift-devel] Configuring Swift to access MosaStore Yea. I will get demo scripts together for the mosa tests. On Mar 5, 2012, at 8:17, Michael Wilde wrote: > was: Re: [Swift-devel] coasters-hosts.pl script > > Jon, can you create a demo script that shows how to configure a Swift run to use MosaStore. The following approach may work: > > - Assume MosaStore will be mounted as /mosa to all workers > > - Simulate this with a localhost run, using /tmp/mosa, then do same with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on Beagle). > > - Set CDM direct mode for all paths starting with [/tmp]/mosa. You might need to work through some of the issues with CDM direct where accesses need to match both /tmp/mosa and file:///tmp/mosa (I *think*) > > - Map some temporary output-to-input files to /tmp/mosa; create a multi-level "catsncats"-like workflow to exercise it; the recent ParameterSweep example, perhaps extended to do N levels of fan-in/fan-out and pass-N might be a good test. > > - see if you can get _concurrent to get placed on /tmp/mosa > > I think some of these tests would be a great test case for Swift/Turbine as well. > > You can do this is stages; the simple test of mapping CDM-direct files to /tmp/mosa should give Emalayan an initial test case to run once Mosa is ready on the BG/P. > > - Mike > > > ----- Original Message ----- >> From: "Matei Ripeanu" >> To: mosastore at googlegroups.com, "Jonathan Monette" , "Justin M Wozniak" >> Cc: swift-devel at ci.uchicago.edu, emalayan at ece.ubc.ca >> Sent: Friday, March 2, 2012 6:29:17 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> Indeed this is good news! Thank you. >> >> >> >> Our next task, I think, will be to figure out how to configure Swift >> so that the headnode (where Swift runs) will not require any access to >> intermediate storage (MosaStore). Only the worker nodes will have >> access to intermediate storage. This is to go around the one way >> headnode-worker node connectivity issue. >> >> >> >> Any guidance on how to get this configuration would be much >> appreciated. >> >> >> >> Thank you again, >> >> >> >> -Matei >> >> >> >> >> >> From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] >> On Behalf Of Emalayan Vairavanathan >> Sent: March-02-12 2:32 PM >> To: Jonathan Monette; Justin M Wozniak >> Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; >> MosaStore >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> >> >> >> >> >> Thank you Jon and Justin. >> >> >> >> >> >> This is a great news. I will get back to you if I have questions. >> >> >> >> >> >> Regards >> >> >> Emalayan >> >> >> >> >> >> >> >> >> >> From: Jonathan Monette < jonmon at mcs.anl.gov > >> To: Justin M Wozniak < wozniak at mcs.anl.gov > >> Cc: " swift-devel at ci.uchicago.edu Devel " < >> swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca >> Sent: Friday, 2 March 2012 2:21 PM >> Subject: Re: [Swift-devel] coasters-hosts.pl script >> >> >> Emalayan, >> We believe we have fixed the issue. You can copy the new >> coasters-hosts.pl script from >> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl >> >> This script reads the worker logs located in the logs directory. The >> steps to run are as follows: >> start-coaster-service >> >> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt >> >> You MUST clean out the worker logs after you before you start a new >> coaster service to make sure the script searches the right worker log >> files. This may not be ideal at the moment but this will help get you >> started. If you have any other questions feel free to ask. We will >> need to update the mosaswift site with the new information, we will do >> this soon. >> >> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: >> >>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on >>> node 172.18.1.83 from the worker log, >>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker >>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the >>> cps log? >>> >>> They both provide the same ip addresses. And the worker log always >>> has that ip address before the cps log does. >>> >>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: >>> >>>> That fix still did not work. I had moved it to the same spot. It is >>>> still waiting for the worker-init.pl script to finish before the ip >>>> addresses are printed to the cps log. Those ip addresses are what >>>> is needed by the coaster-hosts.pl script to finish. If I create an >>>> empty file for the coaster-host.pl script to read, then the work >>>> continues and the ip addresses show up in the cps log. >>>> >>>> Why is log4j waiting to add those lines to the cps log after the >>>> worker-init.pl script is finished? >>>> >>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >>>> >>>>> Thanks, in my copy I thought I had moved the reconnect to before >>>>> the init-cmd and it still wasn't working. I will test with your >>>>> change. I just verified that it was indeed waiting for the >>>>> worker-init.pl script to finish. I created an empty file for the >>>>> script to read and it finished connecting and the ip addresses I >>>>> needed were added to the cps log. I will also be testing your fix. >>>>> >>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>>>> >>>>>> >>>>>> Yes- I must have tested this with a different log file. I just >>>>>> checked in and installed in ~wozniak/Public a fix for this that >>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>>>> worried about time outs but it works so far. I will continue >>>>>> testing... >>>>>> Justin >>>>>> >>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>>>> >>>>>>> Justin, >>>>>>> So I have been trying to help Emalayan get the host list file >>>>>>> for the worker-init.pl script. It seems the cps log file is not >>>>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>>>> thought this was maybe because we did not have the correct log4j >>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. So >>>>>>> for some reason the workers are not connecting to the service. >>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the >>>>>>> coaster-service.conf file I see the workers connect and the cps >>>>>>> log file shows there ip addresses. However when setting this >>>>>>> line it seems they are not connecting. >>>>>>> >>>>>>> Emalayan thought there might be some sort of circular dependency >>>>>>> going with the host-list file and the worker. The worker >>>>>>> requires the host-list file so that it can run the >>>>>>> worker-init.pl script and then connect but the host-list file >>>>>>> cannot be generated because the workers cannot connect. I >>>>>>> noticed in your swift-test directory the cps files did have the >>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses >>>>>>> and reported them. Did you try that test with setting the >>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? >>>>>>> Any idea what may be happening? The job is running when looking >>>>>>> under cqstat. >>>>>>> >>>>>>> A side note: At the mosaswift site, your example talks about >>>>>>> running the coasters-hosts.pl on the cps log but the example you >>>>>>> provide runs it on logs/coasters.log. This may need to be >>>>>>> changed. Also, should provide the log4j setting that is required >>>>>>> to generate the Cpu line with the worker ip address just to >>>>>>> clarify that this line should be set for this script to work. >>>>>>> >>>>>>> For reference, this line: >>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>>>> >>>>>> -- >>>>>> Justin M Wozniak >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "MosaStore" group. >> To post to this group, send email to mosastore at googlegroups.com . >> To unsubscribe from this group, send email to >> mosastore+unsubscribe at googlegroups.com . >> For more options, visit this group at >> http://groups.google.com/group/mosastore?hl=en . >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Mon Mar 5 17:07:08 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 5 Mar 2012 17:07:08 -0600 Subject: [Swift-devel] Configuring Swift to access MosaStore In-Reply-To: <1330976074.2431.YahooMailNeo@web39502.mail.mud.yahoo.com> References: <776537591.60555.1330957062994.JavaMail.root@zimbra.anl.gov> <1330976074.2431.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: <80C56715-3B53-49DD-B9FB-A1FF08FCB9FF@mcs.anl.gov> If you could provide the set up you were using that would be great. I can fill in anything missing an do my tests to verify. On Mar 5, 2012, at 13:34, Emalayan Vairavanathan wrote: > Thank you Jon. > > Yesterday I successfully run Mosa (on our cluster) with cdm-direct mode with the help of swift-user manual and the scripts available in /cog/modules/swift/tests/cdm/absolute. > > It would be useful if you can develop a simple test case. I can double check with my test case. > > Thank you > Emalayan > > From: Jonathan Monette > To: Michael Wilde > Cc: "emalayan at ece.ubc.ca" ; "matei at ece.ubc.ca" ; "swift-devel at ci.uchicago.edu" ; "mosastore at googlegroups.com" ; Jonathan Monette > Sent: Monday, 5 March 2012 7:14 AM > Subject: Re: [Swift-devel] Configuring Swift to access MosaStore > > Yea. I will get demo scripts together for the mosa tests. > > On Mar 5, 2012, at 8:17, Michael Wilde wrote: > > > was: Re: [Swift-devel] coasters-hosts.pl script > > > > Jon, can you create a demo script that shows how to configure a Swift run to use MosaStore. The following approach may work: > > > > - Assume MosaStore will be mounted as /mosa to all workers > > > > - Simulate this with a localhost run, using /tmp/mosa, then do same with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on Beagle). > > > > - Set CDM direct mode for all paths starting with [/tmp]/mosa. You might need to work through some of the issues with CDM direct where accesses need to match both /tmp/mosa and file:///tmp/mosa (I *think*) > > > > - Map some temporary output-to-input files to /tmp/mosa; create a multi-level "catsncats"-like workflow to exercise it; the recent ParameterSweep example, perhaps extended to do N levels of fan-in/fan-out and pass-N might be a good test. > > > > - see if you can get _concurrent to get placed on /tmp/mosa > > > > I think some of these tests would be a great test case for Swift/Turbine as well. > > > > You can do this is stages; the simple test of mapping CDM-direct files to /tmp/mosa should give Emalayan an initial test case to run once Mosa is ready on the BG/P. > > > > - Mike > > > > > > ----- Original Message ----- > >> From: "Matei Ripeanu" > >> To: mosastore at googlegroups.com, "Jonathan Monette" , "Justin M Wozniak" > >> Cc: swift-devel at ci.uchicago.edu, emalayan at ece.ubc.ca > >> Sent: Friday, March 2, 2012 6:29:17 PM > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > >> Indeed this is good news! Thank you. > >> > >> > >> > >> Our next task, I think, will be to figure out how to configure Swift > >> so that the headnode (where Swift runs) will not require any access to > >> intermediate storage (MosaStore). Only the worker nodes will have > >> access to intermediate storage. This is to go around the one way > >> headnode-worker node connectivity issue. > >> > >> > >> > >> Any guidance on how to get this configuration would be much > >> appreciated. > >> > >> > >> > >> Thank you again, > >> > >> > >> > >> -Matei > >> > >> > >> > >> > >> > >> From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] > >> On Behalf Of Emalayan Vairavanathan > >> Sent: March-02-12 2:32 PM > >> To: Jonathan Monette; Justin M Wozniak > >> Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; > >> MosaStore > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > >> > >> > >> > >> > >> > >> Thank you Jon and Justin. > >> > >> > >> > >> > >> > >> This is a great news. I will get back to you if I have questions. > >> > >> > >> > >> > >> > >> Regards > >> > >> > >> Emalayan > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> From: Jonathan Monette < jonmon at mcs.anl.gov > > >> To: Justin M Wozniak < wozniak at mcs.anl.gov > > >> Cc: " swift-devel at ci.uchicago.edu Devel " < > >> swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca > >> Sent: Friday, 2 March 2012 2:21 PM > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > >> > >> > >> Emalayan, > >> We believe we have fixed the issue. You can copy the new > >> coasters-hosts.pl script from > >> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl > >> > >> This script reads the worker logs located in the logs directory. The > >> steps to run are as follows: > >> start-coaster-service > >> > >> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt > >> > >> You MUST clean out the worker logs after you before you start a new > >> coaster service to make sure the script searches the right worker log > >> files. This may not be ideal at the moment but this will help get you > >> started. If you have any other questions feel free to ask. We will > >> need to update the mosaswift site with the new information, we will do > >> this soon. > >> > >> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > >> > >>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on > >>> node 172.18.1.83 from the worker log, > >>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker > >>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the > >>> cps log? > >>> > >>> They both provide the same ip addresses. And the worker log always > >>> has that ip address before the cps log does. > >>> > >>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > >>> > >>>> That fix still did not work. I had moved it to the same spot. It is > >>>> still waiting for the worker-init.pl script to finish before the ip > >>>> addresses are printed to the cps log. Those ip addresses are what > >>>> is needed by the coaster-hosts.pl script to finish. If I create an > >>>> empty file for the coaster-host.pl script to read, then the work > >>>> continues and the ip addresses show up in the cps log. > >>>> > >>>> Why is log4j waiting to add those lines to the cps log after the > >>>> worker-init.pl script is finished? > >>>> > >>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > >>>> > >>>>> Thanks, in my copy I thought I had moved the reconnect to before > >>>>> the init-cmd and it still wasn't working. I will test with your > >>>>> change. I just verified that it was indeed waiting for the > >>>>> worker-init.pl script to finish. I created an empty file for the > >>>>> script to read and it finished connecting and the ip addresses I > >>>>> needed were added to the cps log. I will also be testing your fix. > >>>>> > >>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > >>>>> > >>>>>> > >>>>>> Yes- I must have tested this with a different log file. I just > >>>>>> checked in and installed in ~wozniak/Public a fix for this that > >>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little > >>>>>> worried about time outs but it works so far. I will continue > >>>>>> testing... > >>>>>> Justin > >>>>>> > >>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: > >>>>>> > >>>>>>> Justin, > >>>>>>> So I have been trying to help Emalayan get the host list file > >>>>>>> for the worker-init.pl script. It seems the cps log file is not > >>>>>>> providing the ip addresses for the coasters-hosts.pl script. I > >>>>>>> thought this was maybe because we did not have the correct log4j > >>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. So > >>>>>>> for some reason the workers are not connecting to the service. > >>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the > >>>>>>> coaster-service.conf file I see the workers connect and the cps > >>>>>>> log file shows there ip addresses. However when setting this > >>>>>>> line it seems they are not connecting. > >>>>>>> > >>>>>>> Emalayan thought there might be some sort of circular dependency > >>>>>>> going with the host-list file and the worker. The worker > >>>>>>> requires the host-list file so that it can run the > >>>>>>> worker-init.pl script and then connect but the host-list file > >>>>>>> cannot be generated because the workers cannot connect. I > >>>>>>> noticed in your swift-test directory the cps files did have the > >>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses > >>>>>>> and reported them. Did you try that test with setting the > >>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? > >>>>>>> Any idea what may be happening? The job is running when looking > >>>>>>> under cqstat. > >>>>>>> > >>>>>>> A side note: At the mosaswift site, your example talks about > >>>>>>> running the coasters-hosts.pl on the cps log but the example you > >>>>>>> provide runs it on logs/coasters.log. This may need to be > >>>>>>> changed. Also, should provide the log4j setting that is required > >>>>>>> to generate the Cpu line with the worker ip address just to > >>>>>>> clarify that this line should be set for this script to work. > >>>>>>> > >>>>>>> For reference, this line: > >>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > >>>>>> > >>>>>> -- > >>>>>> Justin M Wozniak > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> > >> > >> > >> -- > >> You received this message because you are subscribed to the Google > >> Groups "MosaStore" group. > >> To post to this group, send email to mosastore at googlegroups.com . > >> To unsubscribe from this group, send email to > >> mosastore+unsubscribe at googlegroups.com . > >> For more options, visit this group at > >> http://groups.google.com/group/mosastore?hl=en . > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Mon Mar 5 17:25:05 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Mon, 5 Mar 2012 15:25:05 -0800 (PST) Subject: [Swift-devel] Configuring Swift to access MosaStore In-Reply-To: <80C56715-3B53-49DD-B9FB-A1FF08FCB9FF@mcs.anl.gov> References: <776537591.60555.1330957062994.JavaMail.root@zimbra.anl.gov> <1330976074.2431.YahooMailNeo@web39502.mail.mud.yahoo.com> <80C56715-3B53-49DD-B9FB-A1FF08FCB9FF@mcs.anl.gov> Message-ID: <1330989905.48792.YahooMailNeo@web39503.mail.mud.yahoo.com> Please find the attached setup. Thank you Emalayan ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: Michael Wilde ; "emalayan at ece.ubc.ca" ; "matei at ece.ubc.ca" ; "swift-devel at ci.uchicago.edu" ; "mosastore at googlegroups.com" ; Jonathan Monette Sent: Monday, 5 March 2012 3:07 PM Subject: Re: [Swift-devel] Configuring Swift to access MosaStore If you could provide the set up you were using that would be great. I can fill in anything missing an do my tests to verify. On Mar 5, 2012, at 13:34, Emalayan Vairavanathan wrote: Thank you Jon. > > > >Yesterday I successfully run Mosa? (on our cluster) with cdm-direct mode with the help of swift-user manual and the scripts available in?/cog/modules/swift/tests/cdm/absolute. > > >It would be useful if you can develop a simple test case. I can double check with my test case. > > >Thank you >Emalayan > > > > >________________________________ > From: Jonathan Monette >To: Michael Wilde >Cc: "emalayan at ece.ubc.ca" ; "matei at ece.ubc.ca" ; "swift-devel at ci.uchicago.edu" ; "mosastore at googlegroups.com" ; Jonathan Monette >Sent: Monday, 5 March 2012 7:14 AM >Subject: Re: [Swift-devel] Configuring Swift to access MosaStore > >Yea. I will get demo scripts together for the mosa tests. > >On Mar 5, 2012, at 8:17, Michael Wilde wrote: > >> was: Re: [Swift-devel] coasters-hosts.pl script >> >> Jon, can you create a demo script that shows how to configure a Swift run to use MosaStore. The following approach may work: >> >> - Assume MosaStore will be mounted as /mosa to all workers >> >> - Simulate this with a localhost run, using /tmp/mosa, then do same with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on Beagle). >> >> - Set CDM direct mode for all paths starting with [/tmp]/mosa. You might need to work through some of the issues with CDM direct where accesses need to match both /tmp/mosa and file:///tmp/mosa (I *think*) >> >> - Map some temporary output-to-input files to /tmp/mosa; create a multi-level "catsncats"-like workflow to exercise it; the recent ParameterSweep example, perhaps extended to do N levels of fan-in/fan-out and pass-N might be a good test. >> >> - see if you can get _concurrent to get placed on /tmp/mosa >> >> I think some of these tests would be a great test case for Swift/Turbine as well. >> >> You can do this is stages; the simple test of mapping CDM-direct files to /tmp/mosa should give Emalayan an initial test case to run once Mosa is ready on the BG/P. >> >> - Mike >> >> >> ----- Original Message ----- >>> From: "Matei Ripeanu" >>> To: mosastore at googlegroups.com, "Jonathan Monette" , "Justin M Wozniak" >>> Cc: swift-devel at ci.uchicago.edu, emalayan at ece.ubc.ca >>> Sent: Friday, March 2, 2012 6:29:17 PM >>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>> Indeed this is good news! Thank you. >>> >>> >>> >>> Our next task, I think, will be to figure out how to configure Swift >>> so that the headnode (where Swift runs) will not require any access to >>> intermediate storage (MosaStore). Only the worker nodes will have >>> access to intermediate storage. This is to go around the one way >>> headnode-worker node connectivity issue. >>> >>> >>> >>> Any guidance on how to get this configuration would be much >>> appreciated. >>> >>> >>> >>> Thank you again, >>> >>> >>> >>> -Matei >>> >>> >>> >>> >>> >>> From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] >>> On Behalf Of Emalayan Vairavanathan >>> Sent: March-02-12 2:32 PM >>> To: Jonathan Monette; Justin M Wozniak >>> Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; >>> MosaStore >>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>> >>> >>> >>> >>> >>> Thank you Jon and Justin. >>> >>> >>> >>> >>> >>> This is a great news. I will get back to you if I have questions. >>> >>> >>> >>> >>> >>> Regards >>> >>> >>> Emalayan >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> From: Jonathan Monette < jonmon at mcs.anl.gov > >>> To: Justin M Wozniak < wozniak at mcs.anl.gov > >>> Cc: " swift-devel at ci.uchicago.edu Devel " < >>> swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca >>> Sent: Friday, 2 March 2012 2:21 PM >>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>> >>> >>> Emalayan, >>> We believe we have fixed the issue. You can copy the new >>> coasters-hosts.pl script from >>> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl >>> >>> This script reads the worker logs located in the logs directory. The >>> steps to run are as follows: >>> start-coaster-service >>> >>> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt >>> >>> You MUST clean out the worker logs after you before you start a new >>> coaster service to make sure the script searches the right worker log >>> files. This may not be ideal at the moment but this will help get you >>> started. If you have any other questions feel free to ask. We will >>> need to update the mosaswift site with the new information, we will do >>> this soon. >>> >>> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: >>> >>>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on >>>> node 172.18.1.83 from the worker log, >>>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu worker >>>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the >>>> cps log? >>>> >>>> They both provide the same ip addresses. And the worker log always >>>> has that ip address before the cps log does. >>>> >>>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: >>>> >>>>> That fix still did not work. I had moved it to the same spot. It is >>>>> still waiting for the worker-init.pl script to finish before the ip >>>>> addresses are printed to the cps log. Those ip addresses are what >>>>> is needed by the coaster-hosts.pl script to finish. If I create an >>>>> empty file for the coaster-host.pl script to read, then the work >>>>> continues and the ip addresses show up in the cps log. >>>>> >>>>> Why is log4j waiting to add those lines to the cps log after the >>>>> worker-init.pl script is finished? >>>>> >>>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >>>>> >>>>>> Thanks, in my copy I thought I had moved the reconnect to before >>>>>> the init-cmd and it still wasn't working. I will test with your >>>>>> change. I just verified that it was indeed waiting for the >>>>>> worker-init.pl script to finish. I created an empty file for the >>>>>> script to read and it finished connecting and the ip addresses I >>>>>> needed were added to the cps log. I will also be testing your fix. >>>>>> >>>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>>>>> >>>>>>> >>>>>>> Yes- I must have tested this with a different log file. I just >>>>>>> checked in and installed in ~wozniak/Public a fix for this that >>>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>>>>> worried about time outs but it works so far. I will continue >>>>>>> testing... >>>>>>> Justin >>>>>>> >>>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>>>>> >>>>>>>> Justin, >>>>>>>> So I have been trying to help Emalayan get the host list file >>>>>>>> for the worker-init.pl script. It seems the cps log file is not >>>>>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>>>>> thought this was maybe because we did not have the correct log4j >>>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. So >>>>>>>> for some reason the workers are not connecting to the service. >>>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in the >>>>>>>> coaster-service.conf file I see the workers connect and the cps >>>>>>>> log file shows there ip addresses. However when setting this >>>>>>>> line it seems they are not connecting. >>>>>>>> >>>>>>>> Emalayan thought there might be some sort of circular dependency >>>>>>>> going with the host-list file and the worker. The worker >>>>>>>> requires the host-list file so that it can run the >>>>>>>> worker-init.pl script and then connect but the host-list file >>>>>>>> cannot be generated because the workers cannot connect. I >>>>>>>> noticed in your swift-test directory the cps files did have the >>>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses >>>>>>>> and reported them. Did you try that test with setting the >>>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? >>>>>>>> Any idea what may be happening? The job is running when looking >>>>>>>> under cqstat. >>>>>>>> >>>>>>>> A side note: At the mosaswift site, your example talks about >>>>>>>> running the coasters-hosts.pl on the cps log but the example you >>>>>>>> provide runs it on logs/coasters.log. This may need to be >>>>>>>> changed. Also, should provide the log4j setting that is required >>>>>>>> to generate the Cpu line with the worker ip address just to >>>>>>>> clarify that this line should be set for this script to work. >>>>>>>> >>>>>>>> For reference, this line: >>>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>>>>> >>>>>>> -- >>>>>>> Justin M Wozniak >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "MosaStore" group. >>> To post to this group, send email to mosastore at googlegroups.com . >>> To unsubscribe from this group, send email to >>> mosastore+unsubscribe at googlegroups.com . >>> For more options, visit this group at >>> http://groups.google.com/group/mosastore?hl=en . >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: catsn.tar Type: application/x-tar Size: 20480 bytes Desc: not available URL: From wilde at mcs.anl.gov Mon Mar 5 18:43:12 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 5 Mar 2012 18:43:12 -0600 (CST) Subject: [Swift-devel] Need CDM Direct documentation - Re: Configuring Swift to access MosaStore In-Reply-To: <1330989905.48792.YahooMailNeo@web39503.mail.mud.yahoo.com> Message-ID: <1006061317.64594.1330994592375.JavaMail.root@zimbra.anl.gov> Hi Ketan, Justin, or anyone else who has tried this recently: Could you point us to the documentation that you wrote on how to use CDM direct to obtain simple access to literal, untranslated, full path names? As I recall there were some subtleties on how to specify the name patterns, including matching both swiftwrap-observed names and file:// names in vdl-int.k. Or is it simper then I recall? I have checked out the CDM "absolute" test that Emalayan mentioned below. Does that do a complete test of references to absolute names? And for only names below say /tmp/mosa? - Mike ----- Original Message ----- > From: "Emalayan Vairavanathan" > To: "Jonathan Monette" > Cc: "Michael Wilde" , emalayan at ece.ubc.ca, matei at ece.ubc.ca, swift-devel at ci.uchicago.edu, "Jonathan > Monette" , "MosaStore" > Sent: Monday, March 5, 2012 5:25:05 PM > Subject: Re: [Swift-devel] Configuring Swift to access MosaStore > Please find the attached setup. > > > Thank you > Emalayan > > > > > > From: Jonathan Monette > To: Emalayan Vairavanathan > Cc: Michael Wilde ; "emalayan at ece.ubc.ca" > ; "matei at ece.ubc.ca" ; > "swift-devel at ci.uchicago.edu" ; > "mosastore at googlegroups.com" ; Jonathan > Monette > Sent: Monday, 5 March 2012 3:07 PM > Subject: Re: [Swift-devel] Configuring Swift to access MosaStore > > > > > If you could provide the set up you were using that would be great. I > can fill in anything missing an do my tests to verify. > > On Mar 5, 2012, at 13:34, Emalayan Vairavanathan < > svemalayan at yahoo.com > wrote: > > > > > > > > Thank you Jon. > > > > Yesterday I successfully run Mosa (on our cluster) with cdm-direct > mode with the help of swift-user manual and the scripts available in > /cog/modules/swift/tests/cdm/absolute . > > > It would be useful if you can develop a simple test case. I can double > check with my test case. > > > Thank you > Emalayan > > > > > > > From: Jonathan Monette < jonmon at mcs.anl.gov > > To: Michael Wilde < wilde at mcs.anl.gov > > Cc: " emalayan at ece.ubc.ca " < emalayan at ece.ubc.ca >; " > matei at ece.ubc.ca " < matei at ece.ubc.ca >; " swift-devel at ci.uchicago.edu > " < swift-devel at ci.uchicago.edu >; " mosastore at googlegroups.com " < > mosastore at googlegroups.com >; Jonathan Monette < jon.monette at gmail.com > > > Sent: Monday, 5 March 2012 7:14 AM > Subject: Re: [Swift-devel] Configuring Swift to access MosaStore > > Yea. I will get demo scripts together for the mosa tests. > > On Mar 5, 2012, at 8:17, Michael Wilde < wilde at mcs.anl.gov > wrote: > > > was: Re: [Swift-devel] coasters-hosts.pl script > > > > Jon, can you create a demo script that shows how to configure a > > Swift run to use MosaStore. The following approach may work: > > > > - Assume MosaStore will be mounted as /mosa to all workers > > > > - Simulate this with a localhost run, using /tmp/mosa, then do same > > with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on > > Beagle). > > > > - Set CDM direct mode for all paths starting with [/tmp]/mosa. You > > might need to work through some of the issues with CDM direct where > > accesses need to match both /tmp/mosa and file:///tmp/mosa (I > > *think*) > > > > - Map some temporary output-to-input files to /tmp/mosa; create a > > multi-level "catsncats"-like workflow to exercise it; the recent > > ParameterSweep example, perhaps extended to do N levels of > > fan-in/fan-out and pass-N might be a good test. > > > > - see if you can get _concurrent to get placed on /tmp/mosa > > > > I think some of these tests would be a great test case for > > Swift/Turbine as well. > > > > You can do this is stages; the simple test of mapping CDM-direct > > files to /tmp/mosa should give Emalayan an initial test case to run > > once Mosa is ready on the BG/P. > > > > - Mike > > > > > > ----- Original Message ----- > >> From: "Matei Ripeanu" < matei.ripeanu at gmail.com > > >> To: mosastore at googlegroups.com , "Jonathan Monette" < > >> jonmon at mcs.anl.gov >, "Justin M Wozniak" < wozniak at mcs.anl.gov > > >> Cc: swift-devel at ci.uchicago.edu , emalayan at ece.ubc.ca > >> Sent: Friday, March 2, 2012 6:29:17 PM > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > >> Indeed this is good news! Thank you. > >> > >> > >> > >> Our next task, I think, will be to figure out how to configure > >> Swift > >> so that the headnode (where Swift runs) will not require any access > >> to > >> intermediate storage (MosaStore). Only the worker nodes will have > >> access to intermediate storage. This is to go around the one way > >> headnode-worker node connectivity issue. > >> > >> > >> > >> Any guidance on how to get this configuration would be much > >> appreciated. > >> > >> > >> > >> Thank you again, > >> > >> > >> > >> -Matei > >> > >> > >> > >> > >> > >> From: mosastore at googlegroups.com [mailto: > >> mosastore at googlegroups.com ] > >> On Behalf Of Emalayan Vairavanathan > >> Sent: March-02-12 2:32 PM > >> To: Jonathan Monette; Justin M Wozniak > >> Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; > >> MosaStore > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > >> > >> > >> > >> > >> > >> Thank you Jon and Justin. > >> > >> > >> > >> > >> > >> This is a great news. I will get back to you if I have questions. > >> > >> > >> > >> > >> > >> Regards > >> > >> > >> Emalayan > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> From: Jonathan Monette < jonmon at mcs.anl.gov > > >> To: Justin M Wozniak < wozniak at mcs.anl.gov > > >> Cc: " swift-devel at ci.uchicago.edu Devel " < > >> swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca > >> Sent: Friday, 2 March 2012 2:21 PM > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > >> > >> > >> Emalayan, > >> We believe we have fixed the issue. You can copy the new > >> coasters-hosts.pl script from > >> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl > >> > >> This script reads the worker logs located in the logs directory. > >> The > >> steps to run are as follows: > >> start-coaster-service > >> > >> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt > >> > >> You MUST clean out the worker logs after you before you start a new > >> coaster service to make sure the script searches the right worker > >> log > >> files. This may not be ideal at the moment but this will help get > >> you > >> started. If you have any other questions feel free to ask. We will > >> need to update the mosaswift site with the new information, we will > >> do > >> this soon. > >> > >> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > >> > >>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on > >>> node 172.18.1.83 from the worker log, > >>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu > >>> worker > >>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the > >>> cps log? > >>> > >>> They both provide the same ip addresses. And the worker log always > >>> has that ip address before the cps log does. > >>> > >>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > >>> > >>>> That fix still did not work. I had moved it to the same spot. It > >>>> is > >>>> still waiting for the worker-init.pl script to finish before the > >>>> ip > >>>> addresses are printed to the cps log. Those ip addresses are what > >>>> is needed by the coaster-hosts.pl script to finish. If I create > >>>> an > >>>> empty file for the coaster-host.pl script to read, then the work > >>>> continues and the ip addresses show up in the cps log. > >>>> > >>>> Why is log4j waiting to add those lines to the cps log after the > >>>> worker-init.pl script is finished? > >>>> > >>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > >>>> > >>>>> Thanks, in my copy I thought I had moved the reconnect to before > >>>>> the init-cmd and it still wasn't working. I will test with your > >>>>> change. I just verified that it was indeed waiting for the > >>>>> worker-init.pl script to finish. I created an empty file for the > >>>>> script to read and it finished connecting and the ip addresses I > >>>>> needed were added to the cps log. I will also be testing your > >>>>> fix. > >>>>> > >>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > >>>>> > >>>>>> > >>>>>> Yes- I must have tested this with a different log file. I just > >>>>>> checked in and installed in ~wozniak/Public a fix for this that > >>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little > >>>>>> worried about time outs but it works so far. I will continue > >>>>>> testing... > >>>>>> Justin > >>>>>> > >>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: > >>>>>> > >>>>>>> Justin, > >>>>>>> So I have been trying to help Emalayan get the host list file > >>>>>>> for the worker-init.pl script. It seems the cps log file is > >>>>>>> not > >>>>>>> providing the ip addresses for the coasters-hosts.pl script. I > >>>>>>> thought this was maybe because we did not have the correct > >>>>>>> log4j > >>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. > >>>>>>> So > >>>>>>> for some reason the workers are not connecting to the service. > >>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in > >>>>>>> the > >>>>>>> coaster-service.conf file I see the workers connect and the > >>>>>>> cps > >>>>>>> log file shows there ip addresses. However when setting this > >>>>>>> line it seems they are not connecting. > >>>>>>> > >>>>>>> Emalayan thought there might be some sort of circular > >>>>>>> dependency > >>>>>>> going with the host-list file and the worker. The worker > >>>>>>> requires the host-list file so that it can run the > >>>>>>> worker-init.pl script and then connect but the host-list file > >>>>>>> cannot be generated because the workers cannot connect. I > >>>>>>> noticed in your swift-test directory the cps files did have > >>>>>>> the > >>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses > >>>>>>> and reported them. Did you try that test with setting the > >>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? > >>>>>>> Any idea what may be happening? The job is running when > >>>>>>> looking > >>>>>>> under cqstat. > >>>>>>> > >>>>>>> A side note: At the mosaswift site, your example talks about > >>>>>>> running the coasters-hosts.pl on the cps log but the example > >>>>>>> you > >>>>>>> provide runs it on logs/coasters.log. This may need to be > >>>>>>> changed. Also, should provide the log4j setting that is > >>>>>>> required > >>>>>>> to generate the Cpu line with the worker ip address just to > >>>>>>> clarify that this line should be set for this script to work. > >>>>>>> > >>>>>>> For reference, this line: > >>>>>>> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > >>>>>> > >>>>>> -- > >>>>>> Justin M Wozniak > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> > >> > >> > >> -- > >> You received this message because you are subscribed to the Google > >> Groups "MosaStore" group. > >> To post to this group, send email to mosastore at googlegroups.com . > >> To unsubscribe from this group, send email to > >> mosastore+ unsubscribe at googlegroups.com . > >> For more options, visit this group at > >> http://groups.google.com/group/mosastore?hl=en . > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- > You received this message because you are subscribed to the Google > Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com. > To unsubscribe from this group, send email to > mosastore+unsubscribe at googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/mosastore?hl=en. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Mon Mar 5 18:52:47 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 5 Mar 2012 18:52:47 -0600 Subject: [Swift-devel] Need CDM Direct documentation - Re: Configuring Swift to access MosaStore In-Reply-To: <1006061317.64594.1330994592375.JavaMail.root@zimbra.anl.gov> References: <1330989905.48792.YahooMailNeo@web39503.mail.mud.yahoo.com> <1006061317.64594.1330994592375.JavaMail.root@zimbra.anl.gov> Message-ID: Not really following this thread; here is my input: I know that we (Justin and me) had to add a couple lines in the ~cdmlib.sh in order to make it work correctly for the SCEC workflow. I have the patched version somewhere in my dir, will dig it up. In addition, I also observed that when the "* default " line is present in the cdm files, it somehow did not work for me, so I had to remove it. I did do some tests on combinations of absolute, relative, path in swift script and the same in the cdm directory but do not recall results (nothing unexpected happen). I wrote some documentation on the CDM section, will dig up and get back. On Mon, Mar 5, 2012 at 6:43 PM, Michael Wilde wrote: > Hi Ketan, Justin, or anyone else who has tried this recently: > > Could you point us to the documentation that you wrote on how to use CDM > direct to obtain simple access to literal, untranslated, full path names? > > As I recall there were some subtleties on how to specify the name > patterns, including matching both swiftwrap-observed names and file:// > names in vdl-int.k. > > Or is it simper then I recall? > > I have checked out the CDM "absolute" test that Emalayan mentioned below. > Does that do a complete test of references to absolute names? And for only > names below say /tmp/mosa? > > - Mike > > > > ----- Original Message ----- > > From: "Emalayan Vairavanathan" > > To: "Jonathan Monette" > > Cc: "Michael Wilde" , emalayan at ece.ubc.ca, > matei at ece.ubc.ca, swift-devel at ci.uchicago.edu, "Jonathan > > Monette" , "MosaStore" < > mosastore at googlegroups.com> > > Sent: Monday, March 5, 2012 5:25:05 PM > > Subject: Re: [Swift-devel] Configuring Swift to access MosaStore > > Please find the attached setup. > > > > > > Thank you > > Emalayan > > > > > > > > > > > > From: Jonathan Monette > > To: Emalayan Vairavanathan > > Cc: Michael Wilde ; "emalayan at ece.ubc.ca" > > ; "matei at ece.ubc.ca" ; > > "swift-devel at ci.uchicago.edu" ; > > "mosastore at googlegroups.com" ; Jonathan > > Monette > > Sent: Monday, 5 March 2012 3:07 PM > > Subject: Re: [Swift-devel] Configuring Swift to access MosaStore > > > > > > > > > > If you could provide the set up you were using that would be great. I > > can fill in anything missing an do my tests to verify. > > > > On Mar 5, 2012, at 13:34, Emalayan Vairavanathan < > > svemalayan at yahoo.com > wrote: > > > > > > > > > > > > > > > > Thank you Jon. > > > > > > > > Yesterday I successfully run Mosa (on our cluster) with cdm-direct > > mode with the help of swift-user manual and the scripts available in > > /cog/modules/swift/tests/cdm/absolute . > > > > > > It would be useful if you can develop a simple test case. I can double > > check with my test case. > > > > > > Thank you > > Emalayan > > > > > > > > > > > > > > From: Jonathan Monette < jonmon at mcs.anl.gov > > > To: Michael Wilde < wilde at mcs.anl.gov > > > Cc: " emalayan at ece.ubc.ca " < emalayan at ece.ubc.ca >; " > > matei at ece.ubc.ca " < matei at ece.ubc.ca >; " swift-devel at ci.uchicago.edu > > " < swift-devel at ci.uchicago.edu >; " mosastore at googlegroups.com " < > > mosastore at googlegroups.com >; Jonathan Monette < jon.monette at gmail.com > > > > > Sent: Monday, 5 March 2012 7:14 AM > > Subject: Re: [Swift-devel] Configuring Swift to access MosaStore > > > > Yea. I will get demo scripts together for the mosa tests. > > > > On Mar 5, 2012, at 8:17, Michael Wilde < wilde at mcs.anl.gov > wrote: > > > > > was: Re: [Swift-devel] coasters-hosts.pl script > > > > > > Jon, can you create a demo script that shows how to configure a > > > Swift run to use MosaStore. The following approach may work: > > > > > > - Assume MosaStore will be mounted as /mosa to all workers > > > > > > - Simulate this with a localhost run, using /tmp/mosa, then do same > > > with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on > > > Beagle). > > > > > > - Set CDM direct mode for all paths starting with [/tmp]/mosa. You > > > might need to work through some of the issues with CDM direct where > > > accesses need to match both /tmp/mosa and file:///tmp/mosa (I > > > *think*) > > > > > > - Map some temporary output-to-input files to /tmp/mosa; create a > > > multi-level "catsncats"-like workflow to exercise it; the recent > > > ParameterSweep example, perhaps extended to do N levels of > > > fan-in/fan-out and pass-N might be a good test. > > > > > > - see if you can get _concurrent to get placed on /tmp/mosa > > > > > > I think some of these tests would be a great test case for > > > Swift/Turbine as well. > > > > > > You can do this is stages; the simple test of mapping CDM-direct > > > files to /tmp/mosa should give Emalayan an initial test case to run > > > once Mosa is ready on the BG/P. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > >> From: "Matei Ripeanu" < matei.ripeanu at gmail.com > > > >> To: mosastore at googlegroups.com , "Jonathan Monette" < > > >> jonmon at mcs.anl.gov >, "Justin M Wozniak" < wozniak at mcs.anl.gov > > > >> Cc: swift-devel at ci.uchicago.edu , emalayan at ece.ubc.ca > > >> Sent: Friday, March 2, 2012 6:29:17 PM > > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > > >> Indeed this is good news! Thank you. > > >> > > >> > > >> > > >> Our next task, I think, will be to figure out how to configure > > >> Swift > > >> so that the headnode (where Swift runs) will not require any access > > >> to > > >> intermediate storage (MosaStore). Only the worker nodes will have > > >> access to intermediate storage. This is to go around the one way > > >> headnode-worker node connectivity issue. > > >> > > >> > > >> > > >> Any guidance on how to get this configuration would be much > > >> appreciated. > > >> > > >> > > >> > > >> Thank you again, > > >> > > >> > > >> > > >> -Matei > > >> > > >> > > >> > > >> > > >> > > >> From: mosastore at googlegroups.com [mailto: > > >> mosastore at googlegroups.com ] > > >> On Behalf Of Emalayan Vairavanathan > > >> Sent: March-02-12 2:32 PM > > >> To: Jonathan Monette; Justin M Wozniak > > >> Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; > > >> MosaStore > > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > > >> > > >> > > >> > > >> > > >> > > >> Thank you Jon and Justin. > > >> > > >> > > >> > > >> > > >> > > >> This is a great news. I will get back to you if I have questions. > > >> > > >> > > >> > > >> > > >> > > >> Regards > > >> > > >> > > >> Emalayan > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> From: Jonathan Monette < jonmon at mcs.anl.gov > > > >> To: Justin M Wozniak < wozniak at mcs.anl.gov > > > >> Cc: " swift-devel at ci.uchicago.edu Devel " < > > >> swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca > > >> Sent: Friday, 2 March 2012 2:21 PM > > >> Subject: Re: [Swift-devel] coasters-hosts.pl script > > >> > > >> > > >> Emalayan, > > >> We believe we have fixed the issue. You can copy the new > > >> coasters-hosts.pl script from > > >> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl > > >> > > >> This script reads the worker logs located in the logs directory. > > >> The > > >> steps to run are as follows: > > >> start-coaster-service > > >> > > >> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt > > >> > > >> You MUST clean out the worker logs after you before you start a new > > >> coaster service to make sure the script searches the right worker > > >> log > > >> files. This may not be ideal at the moment but this will help get > > >> you > > >> started. If you have any other questions feel free to ask. We will > > >> need to update the mosaswift site with the new information, we will > > >> do > > >> this soon. > > >> > > >> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: > > >> > > >>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on > > >>> node 172.18.1.83 from the worker log, > > >>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu > > >>> worker > > >>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the > > >>> cps log? > > >>> > > >>> They both provide the same ip addresses. And the worker log always > > >>> has that ip address before the cps log does. > > >>> > > >>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: > > >>> > > >>>> That fix still did not work. I had moved it to the same spot. It > > >>>> is > > >>>> still waiting for the worker-init.pl script to finish before the > > >>>> ip > > >>>> addresses are printed to the cps log. Those ip addresses are what > > >>>> is needed by the coaster-hosts.pl script to finish. If I create > > >>>> an > > >>>> empty file for the coaster-host.pl script to read, then the work > > >>>> continues and the ip addresses show up in the cps log. > > >>>> > > >>>> Why is log4j waiting to add those lines to the cps log after the > > >>>> worker-init.pl script is finished? > > >>>> > > >>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: > > >>>> > > >>>>> Thanks, in my copy I thought I had moved the reconnect to before > > >>>>> the init-cmd and it still wasn't working. I will test with your > > >>>>> change. I just verified that it was indeed waiting for the > > >>>>> worker-init.pl script to finish. I created an empty file for the > > >>>>> script to read and it finished connecting and the ip addresses I > > >>>>> needed were added to the cps log. I will also be testing your > > >>>>> fix. > > >>>>> > > >>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: > > >>>>> > > >>>>>> > > >>>>>> Yes- I must have tested this with a different log file. I just > > >>>>>> checked in and installed in ~wozniak/Public a fix for this that > > >>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little > > >>>>>> worried about time outs but it works so far. I will continue > > >>>>>> testing... > > >>>>>> Justin > > >>>>>> > > >>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: > > >>>>>> > > >>>>>>> Justin, > > >>>>>>> So I have been trying to help Emalayan get the host list file > > >>>>>>> for the worker-init.pl script. It seems the cps log file is > > >>>>>>> not > > >>>>>>> providing the ip addresses for the coasters-hosts.pl script. I > > >>>>>>> thought this was maybe because we did not have the correct > > >>>>>>> log4j > > >>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. > > >>>>>>> So > > >>>>>>> for some reason the workers are not connecting to the service. > > >>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in > > >>>>>>> the > > >>>>>>> coaster-service.conf file I see the workers connect and the > > >>>>>>> cps > > >>>>>>> log file shows there ip addresses. However when setting this > > >>>>>>> line it seems they are not connecting. > > >>>>>>> > > >>>>>>> Emalayan thought there might be some sort of circular > > >>>>>>> dependency > > >>>>>>> going with the host-list file and the worker. The worker > > >>>>>>> requires the host-list file so that it can run the > > >>>>>>> worker-init.pl script and then connect but the host-list file > > >>>>>>> cannot be generated because the workers cannot connect. I > > >>>>>>> noticed in your swift-test directory the cps files did have > > >>>>>>> the > > >>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses > > >>>>>>> and reported them. Did you try that test with setting the > > >>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? > > >>>>>>> Any idea what may be happening? The job is running when > > >>>>>>> looking > > >>>>>>> under cqstat. > > >>>>>>> > > >>>>>>> A side note: At the mosaswift site, your example talks about > > >>>>>>> running the coasters-hosts.pl on the cps log but the example > > >>>>>>> you > > >>>>>>> provide runs it on logs/coasters.log. This may need to be > > >>>>>>> changed. Also, should provide the log4j setting that is > > >>>>>>> required > > >>>>>>> to generate the Cpu line with the worker ip address just to > > >>>>>>> clarify that this line should be set for this script to work. > > >>>>>>> > > >>>>>>> For reference, this line: > > >>>>>>> > log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG > > >>>>>> > > >>>>>> -- > > >>>>>> Justin M Wozniak > > >>>>> > > >>>>> _______________________________________________ > > >>>>> Swift-devel mailing list > > >>>>> Swift-devel at ci.uchicago.edu > > >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>> > > >>> _______________________________________________ > > >>> Swift-devel mailing list > > >>> Swift-devel at ci.uchicago.edu > > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >> > > >> > > >> > > >> > > >> -- > > >> You received this message because you are subscribed to the Google > > >> Groups "MosaStore" group. > > >> To post to this group, send email to mosastore at googlegroups.com . > > >> To unsubscribe from this group, send email to > > >> mosastore+ unsubscribe at googlegroups.com . > > >> For more options, visit this group at > > >> http://groups.google.com/group/mosastore?hl=en . > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > > You received this message because you are subscribed to the Google > > Groups "MosaStore" group. > > To post to this group, send email to mosastore at googlegroups.com. > > To unsubscribe from this group, send email to > > mosastore+unsubscribe at googlegroups.com. > > For more options, visit this group at > > http://groups.google.com/group/mosastore?hl=en. > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Mon Mar 5 19:20:53 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 5 Mar 2012 19:20:53 -0600 (Central Standard Time) Subject: [Swift-devel] Need CDM Direct documentation - Re: Configuring Swift to access MosaStore In-Reply-To: References: <1330989905.48792.YahooMailNeo@web39503.mail.mud.yahoo.com> <1006061317.64594.1330994592375.JavaMail.root@zimbra.anl.gov> Message-ID: I think Ketan's case had to do with absolute path names. That is covered in the user guide: http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases I will take a look at Emalayan's case... Justin On Mon, 5 Mar 2012, Ketan Maheshwari wrote: > Not really following this thread; here is my input: > > I know that we (Justin and me) had to add a couple lines in the ~cdmlib.sh > in order to make it work correctly for the SCEC workflow. I have the > patched version somewhere in my dir, will dig it up. > > In addition, I also observed that when the "* default " line is present in > the cdm files, it somehow did not work for me, so I had to remove it. > > I did do some tests on combinations of absolute, relative, path in swift > script and the same in the cdm directory but do not recall results (nothing > unexpected happen). > > I wrote some documentation on the CDM section, will dig up and get back. > > > On Mon, Mar 5, 2012 at 6:43 PM, Michael Wilde wrote: > >> Hi Ketan, Justin, or anyone else who has tried this recently: >> >> Could you point us to the documentation that you wrote on how to use CDM >> direct to obtain simple access to literal, untranslated, full path names? >> >> As I recall there were some subtleties on how to specify the name >> patterns, including matching both swiftwrap-observed names and file:// >> names in vdl-int.k. >> >> Or is it simper then I recall? >> >> I have checked out the CDM "absolute" test that Emalayan mentioned below. >> Does that do a complete test of references to absolute names? And for only >> names below say /tmp/mosa? >> >> - Mike >> >> >> >> ----- Original Message ----- >>> From: "Emalayan Vairavanathan" >>> To: "Jonathan Monette" >>> Cc: "Michael Wilde" , emalayan at ece.ubc.ca, >> matei at ece.ubc.ca, swift-devel at ci.uchicago.edu, "Jonathan >>> Monette" , "MosaStore" < >> mosastore at googlegroups.com> >>> Sent: Monday, March 5, 2012 5:25:05 PM >>> Subject: Re: [Swift-devel] Configuring Swift to access MosaStore >>> Please find the attached setup. >>> >>> >>> Thank you >>> Emalayan >>> >>> >>> >>> >>> >>> From: Jonathan Monette >>> To: Emalayan Vairavanathan >>> Cc: Michael Wilde ; "emalayan at ece.ubc.ca" >>> ; "matei at ece.ubc.ca" ; >>> "swift-devel at ci.uchicago.edu" ; >>> "mosastore at googlegroups.com" ; Jonathan >>> Monette >>> Sent: Monday, 5 March 2012 3:07 PM >>> Subject: Re: [Swift-devel] Configuring Swift to access MosaStore >>> >>> >>> >>> >>> If you could provide the set up you were using that would be great. I >>> can fill in anything missing an do my tests to verify. >>> >>> On Mar 5, 2012, at 13:34, Emalayan Vairavanathan < >>> svemalayan at yahoo.com > wrote: >>> >>> >>> >>> >>> >>> >>> >>> Thank you Jon. >>> >>> >>> >>> Yesterday I successfully run Mosa (on our cluster) with cdm-direct >>> mode with the help of swift-user manual and the scripts available in >>> /cog/modules/swift/tests/cdm/absolute . >>> >>> >>> It would be useful if you can develop a simple test case. I can double >>> check with my test case. >>> >>> >>> Thank you >>> Emalayan >>> >>> >>> >>> >>> >>> >>> From: Jonathan Monette < jonmon at mcs.anl.gov > >>> To: Michael Wilde < wilde at mcs.anl.gov > >>> Cc: " emalayan at ece.ubc.ca " < emalayan at ece.ubc.ca >; " >>> matei at ece.ubc.ca " < matei at ece.ubc.ca >; " swift-devel at ci.uchicago.edu >>> " < swift-devel at ci.uchicago.edu >; " mosastore at googlegroups.com " < >>> mosastore at googlegroups.com >; Jonathan Monette < jon.monette at gmail.com >>>> >>> Sent: Monday, 5 March 2012 7:14 AM >>> Subject: Re: [Swift-devel] Configuring Swift to access MosaStore >>> >>> Yea. I will get demo scripts together for the mosa tests. >>> >>> On Mar 5, 2012, at 8:17, Michael Wilde < wilde at mcs.anl.gov > wrote: >>> >>>> was: Re: [Swift-devel] coasters-hosts.pl script >>>> >>>> Jon, can you create a demo script that shows how to configure a >>>> Swift run to use MosaStore. The following approach may work: >>>> >>>> - Assume MosaStore will be mounted as /mosa to all workers >>>> >>>> - Simulate this with a localhost run, using /tmp/mosa, then do same >>>> with *1* worker, N jobs per node (eg 4 on BG/P, 8 on PADS, 2 on >>>> Beagle). >>>> >>>> - Set CDM direct mode for all paths starting with [/tmp]/mosa. You >>>> might need to work through some of the issues with CDM direct where >>>> accesses need to match both /tmp/mosa and file:///tmp/mosa (I >>>> *think*) >>>> >>>> - Map some temporary output-to-input files to /tmp/mosa; create a >>>> multi-level "catsncats"-like workflow to exercise it; the recent >>>> ParameterSweep example, perhaps extended to do N levels of >>>> fan-in/fan-out and pass-N might be a good test. >>>> >>>> - see if you can get _concurrent to get placed on /tmp/mosa >>>> >>>> I think some of these tests would be a great test case for >>>> Swift/Turbine as well. >>>> >>>> You can do this is stages; the simple test of mapping CDM-direct >>>> files to /tmp/mosa should give Emalayan an initial test case to run >>>> once Mosa is ready on the BG/P. >>>> >>>> - Mike >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Matei Ripeanu" < matei.ripeanu at gmail.com > >>>>> To: mosastore at googlegroups.com , "Jonathan Monette" < >>>>> jonmon at mcs.anl.gov >, "Justin M Wozniak" < wozniak at mcs.anl.gov > >>>>> Cc: swift-devel at ci.uchicago.edu , emalayan at ece.ubc.ca >>>>> Sent: Friday, March 2, 2012 6:29:17 PM >>>>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>>>> Indeed this is good news! Thank you. >>>>> >>>>> >>>>> >>>>> Our next task, I think, will be to figure out how to configure >>>>> Swift >>>>> so that the headnode (where Swift runs) will not require any access >>>>> to >>>>> intermediate storage (MosaStore). Only the worker nodes will have >>>>> access to intermediate storage. This is to go around the one way >>>>> headnode-worker node connectivity issue. >>>>> >>>>> >>>>> >>>>> Any guidance on how to get this configuration would be much >>>>> appreciated. >>>>> >>>>> >>>>> >>>>> Thank you again, >>>>> >>>>> >>>>> >>>>> -Matei >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: mosastore at googlegroups.com [mailto: >>>>> mosastore at googlegroups.com ] >>>>> On Behalf Of Emalayan Vairavanathan >>>>> Sent: March-02-12 2:32 PM >>>>> To: Jonathan Monette; Justin M Wozniak >>>>> Cc: swift-devel at ci.uchicago.edu Devel; emalayan at ece.ubc.cais ; >>>>> MosaStore >>>>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thank you Jon and Justin. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> This is a great news. I will get back to you if I have questions. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Regards >>>>> >>>>> >>>>> Emalayan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: Jonathan Monette < jonmon at mcs.anl.gov > >>>>> To: Justin M Wozniak < wozniak at mcs.anl.gov > >>>>> Cc: " swift-devel at ci.uchicago.edu Devel " < >>>>> swift-devel at ci.uchicago.edu >; emalayan at ece.ubc.ca >>>>> Sent: Friday, 2 March 2012 2:21 PM >>>>> Subject: Re: [Swift-devel] coasters-hosts.pl script >>>>> >>>>> >>>>> Emalayan, >>>>> We believe we have fixed the issue. You can copy the new >>>>> coasters-hosts.pl script from >>>>> ~jonmon/surveyor/worker-init-test/coasters-hosts.pl >>>>> >>>>> This script reads the worker logs located in the logs directory. >>>>> The >>>>> steps to run are as follows: >>>>> start-coaster-service >>>>> >>>>> ./coasters-hosts.pl logs/worker-*.log > worker-hosts.txt >>>>> >>>>> You MUST clean out the worker logs after you before you start a new >>>>> coaster service to make sure the script searches the right worker >>>>> log >>>>> files. This may not be ideal at the moment but this will help get >>>>> you >>>>> started. If you have any other questions feel free to ask. We will >>>>> need to update the mosaswift site with the new information, we will >>>>> do >>>>> this soon. >>>>> >>>>> On Mar 2, 2012, at 11:26 AM, Jonathan Monette wrote: >>>>> >>>>>> Can we match this line: 2012/03/02 17:16:04.712 INFO - Running on >>>>>> node 172.18.1.83 from the worker log, >>>>>> instead of this line: 2012-03-02 17:21:25,214+0000 DEBUG Cpu >>>>>> worker >>>>>> started: block=2012.0302.171344.704 host=172.18.1.83 id=0 from the >>>>>> cps log? >>>>>> >>>>>> They both provide the same ip addresses. And the worker log always >>>>>> has that ip address before the cps log does. >>>>>> >>>>>> On Mar 2, 2012, at 11:15 AM, Jonathan Monette wrote: >>>>>> >>>>>>> That fix still did not work. I had moved it to the same spot. It >>>>>>> is >>>>>>> still waiting for the worker-init.pl script to finish before the >>>>>>> ip >>>>>>> addresses are printed to the cps log. Those ip addresses are what >>>>>>> is needed by the coaster-hosts.pl script to finish. If I create >>>>>>> an >>>>>>> empty file for the coaster-host.pl script to read, then the work >>>>>>> continues and the ip addresses show up in the cps log. >>>>>>> >>>>>>> Why is log4j waiting to add those lines to the cps log after the >>>>>>> worker-init.pl script is finished? >>>>>>> >>>>>>> On Mar 2, 2012, at 11:05 AM, Jonathan Monette wrote: >>>>>>> >>>>>>>> Thanks, in my copy I thought I had moved the reconnect to before >>>>>>>> the init-cmd and it still wasn't working. I will test with your >>>>>>>> change. I just verified that it was indeed waiting for the >>>>>>>> worker-init.pl script to finish. I created an empty file for the >>>>>>>> script to read and it finished connecting and the ip addresses I >>>>>>>> needed were added to the cps log. I will also be testing your >>>>>>>> fix. >>>>>>>> >>>>>>>> On Mar 2, 2012, at 11:01 AM, Justin M Wozniak wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Yes- I must have tested this with a different log file. I just >>>>>>>>> checked in and installed in ~wozniak/Public a fix for this that >>>>>>>>> launches WORKER_INIT_CMD after the reconnect(). I am a little >>>>>>>>> worried about time outs but it works so far. I will continue >>>>>>>>> testing... >>>>>>>>> Justin >>>>>>>>> >>>>>>>>> On Thu, 1 Mar 2012, Jonathan Monette wrote: >>>>>>>>> >>>>>>>>>> Justin, >>>>>>>>>> So I have been trying to help Emalayan get the host list file >>>>>>>>>> for the worker-init.pl script. It seems the cps log file is >>>>>>>>>> not >>>>>>>>>> providing the ip addresses for the coasters-hosts.pl script. I >>>>>>>>>> thought this was maybe because we did not have the correct >>>>>>>>>> log4j >>>>>>>>>> setting set but we have the Coaster service Cpu set to DEBUG. >>>>>>>>>> So >>>>>>>>>> for some reason the workers are not connecting to the service. >>>>>>>>>> When I comment out the export WORKER_ENVIRONEMTN="?" line in >>>>>>>>>> the >>>>>>>>>> coaster-service.conf file I see the workers connect and the >>>>>>>>>> cps >>>>>>>>>> log file shows there ip addresses. However when setting this >>>>>>>>>> line it seems they are not connecting. >>>>>>>>>> >>>>>>>>>> Emalayan thought there might be some sort of circular >>>>>>>>>> dependency >>>>>>>>>> going with the host-list file and the worker. The worker >>>>>>>>>> requires the host-list file so that it can run the >>>>>>>>>> worker-init.pl script and then connect but the host-list file >>>>>>>>>> cannot be generated because the workers cannot connect. I >>>>>>>>>> noticed in your swift-test directory the cps files did have >>>>>>>>>> the >>>>>>>>>> ip addresses set and coasters-hosts.pl found the ip addresses >>>>>>>>>> and reported them. Did you try that test with setting the >>>>>>>>>> WORKER_ENVIRONMENT variable in the coaster-service.conf file? >>>>>>>>>> Any idea what may be happening? The job is running when >>>>>>>>>> looking >>>>>>>>>> under cqstat. >>>>>>>>>> >>>>>>>>>> A side note: At the mosaswift site, your example talks about >>>>>>>>>> running the coasters-hosts.pl on the cps log but the example >>>>>>>>>> you >>>>>>>>>> provide runs it on logs/coasters.log. This may need to be >>>>>>>>>> changed. Also, should provide the log4j setting that is >>>>>>>>>> required >>>>>>>>>> to generate the Cpu line with the worker ip address just to >>>>>>>>>> clarify that this line should be set for this script to work. >>>>>>>>>> >>>>>>>>>> For reference, this line: >>>>>>>>>> >> log4j.logger.org.globus.cog.abstraction.coaster.service.job.manager.Cpu=DEBUG >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Justin M Wozniak >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "MosaStore" group. >>>>> To post to this group, send email to mosastore at googlegroups.com . >>>>> To unsubscribe from this group, send email to >>>>> mosastore+ unsubscribe at googlegroups.com . >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/mosastore?hl=en . >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> -- >>>> Michael Wilde >>>> Computation Institute, University of Chicago >>>> Mathematics and Computer Science Division >>>> Argonne National Laboratory >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "MosaStore" group. >>> To post to this group, send email to mosastore at googlegroups.com. >>> To unsubscribe from this group, send email to >>> mosastore+unsubscribe at googlegroups.com. >>> For more options, visit this group at >>> http://groups.google.com/group/mosastore?hl=en. >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > > > -- Justin M Wozniak From svemalayan at yahoo.com Tue Mar 6 14:22:43 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Tue, 6 Mar 2012 12:22:43 -0800 (PST) Subject: [Swift-devel] Torus rank Message-ID: <1331065363.62398.YahooMailNeo@web39505.mail.mud.yahoo.com> Hi Kaz, I found the following about tours rank from http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ and want to double check with you still this information valid.? Torus rank:- A torus rank is a number identifying a compute node within a whole partition. In a way, it is much "nicer" than a pset rank since it is unique within a job and it also starts from 0. - A shell script can easily calculate it from other fields: TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \ \\$3 * $BG_XSIZE * $BG_YSIZE}" It would be great if you can make sure that the tours rank is unique within a partition and always starts from zero regardless of the following. ?- Partition size (fraction of pset / multiple of psets / both ) ?- Platform (Surveyor / Intrepid) ?- Kernel profiles. (Note: I am asking this in order to figure out a easy way to deploy MosaStore at BG/P at large scale. I did few tests with different partition size in Surveyor and it seems the tours rank is unique and always starts from zero.) Thank you Emalayan -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Tue Mar 6 14:41:45 2012 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 06 Mar 2012 14:41:45 -0600 Subject: [Swift-devel] Torus rank In-Reply-To: <1331065363.62398.YahooMailNeo@web39505.mail.mud.yahoo.com> References: <1331065363.62398.YahooMailNeo@web39505.mail.mud.yahoo.com> Message-ID: <4F567689.8000700@uchicago.edu> Hi, Emalayan Emalayan Vairavanathan wrote: > Hi Kaz, > > I found the following about tours rank from > http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ and want to double check > with you still this information valid. > > > Torus rank:- > > > A torus rank is a number identifying a compute node within a > whole partition. In a way, it is much "nicer" than a pset rank > since it is unique within a job and it also starts from 0. > > - A shell script can easily calculate it from other fields: > TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \ > \\$3 * $BG_XSIZE * $BG_YSIZE}" > > It would be great if you can make sure that the tours rank is unique > within a partition and always starts from zero regardless of the > following. > > - Partition size (fraction of pset / multiple of psets / both ) > - Platform (Surveyor / Intrepid) > - Kernel profiles. > yes, this is true regardless of partition size, platform, and ZeptoOS kernel profiles(zeptoos, zepto-vn-eval, and the one you use for mosa). best zhao > > (Note: I am asking this in order to figure out a easy way to deploy > MosaStore at BG/P at large scale. I did few tests with different > partition size in Surveyor and it seems the tours rank is unique and > always starts from zero.) > > Thank you > Emalayan > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From svemalayan at yahoo.com Tue Mar 6 15:22:20 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Tue, 6 Mar 2012 13:22:20 -0800 (PST) Subject: [Swift-devel] Torus rank In-Reply-To: <4F567689.8000700@uchicago.edu> References: <1331065363.62398.YahooMailNeo@web39505.mail.mud.yahoo.com> <4F567689.8000700@uchicago.edu> Message-ID: <1331068940.42466.YahooMailNeo@web39506.mail.mud.yahoo.com> Great. Thank you Zhao. ________________________________ From: Zhao Zhang To: Emalayan Vairavanathan Cc: Kazutomo Yoshii ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Tuesday, 6 March 2012 12:41 PM Subject: Re: [Swift-devel] Torus rank Hi, Emalayan Emalayan Vairavanathan wrote: > Hi Kaz, > > I found the following about tours rank from > http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ and want to double check > with you still this information valid. > > >? ? ? Torus rank:- > > >? ? ? A torus rank is a number identifying a compute node within a >? ? ? whole partition. In a way, it is much "nicer" than a pset rank >? ? ? since it is unique within a job and it also starts from 0. > > - A shell script can easily calculate it from other fields: > TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \ >? ? ? ? ? ? \\$3 * $BG_XSIZE * $BG_YSIZE}" > > It would be great if you can make sure that the tours rank is unique > within a partition and always starts from zero regardless of the > following. > >? - Partition size (fraction of pset / multiple of psets / both ) >? - Platform (Surveyor / Intrepid) >? - Kernel profiles. > yes, this is true regardless of partition size, platform, and ZeptoOS kernel profiles(zeptoos, zepto-vn-eval, and the one you use for mosa). best zhao > > (Note: I am asking this in order to figure out a easy way to deploy > MosaStore at BG/P at large scale. I did few tests with different > partition size in Surveyor and it seems the tours rank is unique and > always starts from zero.) > > Thank you > Emalayan > > >? > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >? -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Wed Mar 7 15:15:31 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Wed, 07 Mar 2012 15:15:31 -0600 Subject: [Swift-devel] CFP: The 9th Int. Conf. on Autonomic Computing (ICAC) 2012 -- deadline extension to March 16th, 2012 Message-ID: <4F57CFF3.9060107@cs.iit.edu> CALL FOR PAPERS The 9th International Conference on Autonomic Computing (ICAC 2012) September 17-21, 2012. San Jose, CA, USA http://icac2012.cs.fiu.edu/ ----------------------------------------------------------------- IMPORTANT DATES Paper and Poster Submission: March 16, 2012, 11:59pm PST (EXTENDED) Notification: May 18, 2012 Camera-ready Due: June 8, 2012 ----------------------------------------------------------------- OVERVIEW ICAC is the leading conference on autonomic computing techniques, foundations, and applications. Autonomic computing refers to methods and means for automated management of performance, fault, security, and configuration with little involvement of users or administrators. Systems introducing new autonomic features are becoming increasingly prevalent, motivating research that spans a variety of areas, from computer systems, networking, software engineering, and data management to machine learning, control theory, and bio-inspired computing. ICAC brings together researchers and practitioners across these disciplines to address multiple facets of adaptation and self-management in computing systems and applications from different perspectives. Autonomic computing solutions are sought for clouds, grids, data centers, enterprise software, internet services, data services, smart phones, embedded systems, and sensor networks. In these environments, resources and applications must be managed to maximize performance and minimize cost, while maintaining predictable and reliable behavior in the face of varying workloads, failures, and malicious threats. Papers are solicited from all areas of autonomic computing, including (but not limited to): * End-to-end techniques for management of resources, workloads, performance, faults, power/cooling, security, and others. * Self-managing components, such as server, storage, network protocols, or specific application elements, and embedded and mobile end systems such as smart phones. * Decision and analysis techniques and their use, such as machine learning, control theory, predictive methods, probability and stochastic processes, queuing theory methodologies, emergent behavior, rule-based systems, and bio-inspired techniques. * Monitoring systems for autonomic computing. * Hypervisor, operating systems, hardware, or application support for autonomic computing. * Novel human interfaces for monitoring and controlling autonomic systems. * Management topics, such as specification and modeling of service-level agreements, behavior enforcement and tie-in with IT governance. * Toolkits, frameworks, principles and architectures, from software engineering practices and experimental methodologies to agent-based techniques and virtualization. * Fundamental science and theory of self-managing systems: understanding, controlling or exploiting system behaviors to enforce autonomic properties. * Applications of autonomic computing and experiences with prototyped or deployed systems solving real-world problems in science, engineering, business and society. Papers will be judged on originality, significance, interest, correctness, clarity and relevance to the broader community. Papers should report on experiences, measurements, user studies, or other evaluations, as appropriate. Evaluations of a prototype or large-scale deployment of systems and applications is expected. PAPER AND POSTER SUBMISSIONS Full papers (a maximum of 10 pages in the two-column ACM proceedings format) and posters (2 pages) are invited on a wide variety of topics relating to autonomic computing. Submitted papers must be original work, and may not be under consideration for another conference or journal. Complete formatting and submission instructions can be found on the conference web site. Accepted papers and posters will appear in proceedings distributed at the conference and available electronically. Relevant top ICAC'12 papers will be invited for "fast-track" submissions to the ACM Transactions on Autonomous and Adaptive Systems (TAAS). INDUSTRY SESSION One of ICAC's important roles is to bring together researchers and practitioners from academia and industry. In its industry session, ICAC helps fulfill this role by presenting an industry viewpoint on technologies, products, and market needs. The industry session also addresses current challenges, and opportunities for academic and corporate research collaborations. We encourage industry leaders, including entrepreneurs, product developers, architects, managers, marketers and end users, to submit their papers and posters reflecting such industry perspectives as part of the regular submission process. ------------------------------------------------------------------ ORGANIZERS GENERAL CHAIR Dejan Milojicic, HP Labs PROGRAM CHAIRS Dongyan Xu, Purdue University Vanish Talwar, HP Labs INDUSTRY CHAIR Xiaoyun Zhu, VMware WORKSHOPS CHAIR Fred Douglis, EMC POSTERS/DEMO/EXHIBITS CHAIR Eno Thereska, Microsoft Research FINANCE CHAIR Michael Kozuch, Intel LOCAL ARRANGEMENT CHAIR Jessica Blaine PUBLICITY CHAIRS Daniel Batista, University of S?o Paulo Vartan Padaryan, ISP/Russian Academy of Sci. Ioan Raicu, Illinois Inst. of Technology Jianfeng Zhan, ICT/Chinese Academy of Sci. Ming Zhao, Florida Intl. University PROGRAM COMMITTEE Tarek Abdelzaher, UIUC Umesh Bellur, IIT, Bombay Ken Birman, Cornell University Rajkumar Buyya, Univ. of Melbourne Rocky Chang, Hong Kong Polytechnic University Yuan Chen, HP Labs Alva Couch, Tufts University Peter Dinda, Northwestern University Fred Douglis, EMC Renato Figueiredo, University of Florida Mohamed Hefeeda, Qatar Computing Research Institute Joe Hellerstein, Google Geoff Jiang, NEC Labs Jeff Kephart, IBM Research Emre Kiciman, Microsoft Research Fabio Kon, University of S?o Paulo Michael Kozuch, Intel Dejan Milojicic, HP Labs Klara Nahrstedt, UIUC Priya Narasimhan, CMU Manish Parashar, Rutgers University Ioan Raicu, Illinois Inst. of Technology Omer Rana, Cardiff University Masoud Sadjadi, Florida Intl. University Rick Schlichting, AT&T Labs Hartmut Schmeck, KIT Karsten Schwan, Georgia Tech Onn Shehory, IBM Research Eno Thereska, Microsoft Research Xiaoyun Zhu, VMware -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From svemalayan at yahoo.com Wed Mar 7 18:23:09 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Wed, 7 Mar 2012 16:23:09 -0800 (PST) Subject: [Swift-devel] Torus rank In-Reply-To: <4F578EF0.602@mcs.anl.gov> References: <1331065363.62398.YahooMailNeo@web39505.mail.mud.yahoo.com> <4F567689.8000700@uchicago.edu> <1331068940.42466.YahooMailNeo@web39506.mail.mud.yahoo.com> <4F578EF0.602@mcs.anl.gov> Message-ID: <1331166189.10268.YahooMailNeo@web39506.mail.mud.yahoo.com> Kaz, Thank you very much for detailed clarification. Regards Emalayan ________________________________ From: Kazutomo Yoshii To: Emalayan Vairavanathan Cc: "mosastore at googlegroups.com" ; "swift-devel at ci.uchicago.edu" Sent: Wednesday, 7 March 2012 8:38 AM Subject: Re: [Swift-devel] Torus rank Thanks! Zhao This is 99.99% guaranteed.? Let me explain why not 100%. The control system loads personality (e.g. torus coordinate) into memory on each node when they create a new partition via jtag. No operating system runs at that point. The boot loader passes the location of the personality to operating system (CNK or Linux). NODE: personality information on node is a binary data of the BGP personality struct; technically _BGP_Personality_t in /bgsys/drivers/ppcfloor/arch/include/common/bgp_personality.h Linux creates /proc/personality.sh by reading each member of _BGP_Personality_t for convenience. CNK doesn't provide this. If IBM change _BGP_Personality_t, personality.sh can be corrupted. (probably won't happen but we can't guarantee this). Linux also provides /proc/personality, which is a binary structure, so you can can cast it to? _BGP_Personality_t, which is more reliable. Due to other job pressure, I couldn't maintain all combinations. However, I'm still updating zepto-vn-eval's compute node linux for my kernel work. As for the torus hardware rank,? I would definitely notice if that gets corrupted because Linux kernel and BGP native MPI heavily depends on personality. Thanks, Kaz On 03/06/2012 03:22 PM, Emalayan Vairavanathan wrote: > Great. Thank you Zhao. > > ------------------------------------------------------------------------ > *From:* Zhao Zhang > *To:* Emalayan Vairavanathan > *Cc:* Kazutomo Yoshii ; > "swift-devel at ci.uchicago.edu" ; MosaStore > > *Sent:* Tuesday, 6 March 2012 12:41 PM > *Subject:* Re: [Swift-devel] Torus rank > > Hi, Emalayan > > Emalayan Vairavanathan wrote: >> Hi Kaz, >> >> I found the following about tours rank from >> http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ and want to double check >> with you still this information valid. >> >> >>? ? ? Torus rank:- >> >> >>? ? ? A torus rank is a number identifying a compute node within a >>? ? ? whole partition. In a way, it is much "nicer" than a pset rank >>? ? ? since it is unique within a job and it also starts from 0. >> >> - A shell script can easily calculate it from other fields: >> TORUS_RANK=`echo $BG_PSETORG | awk "{print \\$1 + \\$2 * $BG_XSIZE + \ >>? ? ? ? ? ? \\$3 * $BG_XSIZE * $BG_YSIZE}" >> >> It would be great if you can make sure that the tours rank is unique >> within a partition and always starts from zero regardless of the >> following. >> >>? - Partition size (fraction of pset / multiple of psets / both ) >>? - Platform (Surveyor / Intrepid) >>? - Kernel profiles. >> > yes, this is true regardless of partition size, platform, and ZeptoOS > kernel profiles(zeptoos, zepto-vn-eval, and the one you use for mosa). > > best > zhao >> >> (Note: I am asking this in order to figure out a easy way to deploy >> MosaStore at BG/P at large scale. I did few tests with different >> partition size in Surveyor and it seems the tours rank is unique and >> always starts from zero.) >> >> Thank you >> Emalayan >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > -- > You received this message because you are subscribed to the Google > Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com > . > To unsubscribe from this group, send email to > mosastore+unsubscribe at googlegroups.com > . > For more options, visit this group at > http://groups.google.com/group/mosastore?hl=en. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Mar 8 09:21:43 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 8 Mar 2012 09:21:43 -0600 (CST) Subject: [Swift-devel] sites.xml for ranger sge coasters In-Reply-To: <561683211.5757.1331217900880.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <594057048.75488.1331220103100.JavaMail.root@zimbra.anl.gov> David, thanks for addressing this problem. Does it affect any of the other local providers: pbs, condor, (sge), cobalt? (I need to do some cobalt runs on Eureka for a user today, so I hope that provider is OK). You should describe the issue and fix on swift-devel. We should start a convention where we can document known issues for releases, so that users dont have to discover these bugs on their own. Can you make an action item to propose and start such a place (probably crosslinked to both Downloads and Documentation). Not urgent for today, but next week would be good. Thanks, - Mike ----- Original Message ----- > From: "David Kelly" > To: "Ketan Maheshwari" > Cc: "Michael Wilde" > Sent: Thursday, March 8, 2012 8:45:00 AM > Subject: Re: sites.xml for ranger sge coasters > 0.93 is frozen, but I committed the same change to 0.93.1 this > morning. > > ----- Original Message ----- > > From: "Ketan Maheshwari" > > To: "David Kelly" > > Cc: "Michael Wilde" > > Sent: Wednesday, March 7, 2012 6:39:09 PM > > Subject: Re: sites.xml for ranger sge coasters > > is it committed in 0.93 too? > > > > > > On Wed, Mar 7, 2012 at 6:10 PM, David Kelly < davidk at ci.uchicago.edu > > > > > wrote: > > > > > > I submitted a fix to trunk for the SGE provider. The submit script > > was > > wrong - it started one worker per core, rather than one worker per > > host. (Oddly it's been like that for years without anybody > > noticing). > > I ran a few sleep/hostname tests and it seems to be working. Can you > > please give it a try? > > > > Below is the sites.xml I used for my test: > > > > > > > > > > > > > > 5 > > 600 > > 16 > > 1 > > 3 > > 16way > > 3 > > development > > 0.4799 > > 10000 > > TG-DBS080004N > > /share/home/01503/davidkel/swiftwork > > > > > > > > Thanks, > > David > > > > > > > > > > -- > > Ketan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hockyg at uchicago.edu Thu Mar 8 09:34:20 2012 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 8 Mar 2012 10:34:20 -0500 Subject: [Swift-devel] sites.xml for ranger sge coasters In-Reply-To: <594057048.75488.1331220103100.JavaMail.root@zimbra.anl.gov> References: <561683211.5757.1331217900880.JavaMail.root@zimbra-mb2.anl.gov> <594057048.75488.1331220103100.JavaMail.root@zimbra.anl.gov> Message-ID: I noticed a problem like this when I was working with Ketan in July (checked my email, it was 7/13+/- 1 day) using his scripts for starting blocks of persistent coasters on ranger. Perhaps he has records of that or can try to reproduce the same thing we saw then. (I probably have the scripts we were using somewhere as well). On Thu, Mar 8, 2012 at 10:21 AM, Michael Wilde wrote: > David, thanks for addressing this problem. Does it affect any of the > other local providers: pbs, condor, (sge), cobalt? > > (I need to do some cobalt runs on Eureka for a user today, so I hope that > provider is OK). > > You should describe the issue and fix on swift-devel. > > We should start a convention where we can document known issues for > releases, so that users dont have to discover these bugs on their own. Can > you make an action item to propose and start such a place (probably > crosslinked to both Downloads and Documentation). Not urgent for today, but > next week would be good. > > Thanks, > > - Mike > > > > ----- Original Message ----- > > From: "David Kelly" > > To: "Ketan Maheshwari" > > Cc: "Michael Wilde" > > Sent: Thursday, March 8, 2012 8:45:00 AM > > Subject: Re: sites.xml for ranger sge coasters > > 0.93 is frozen, but I committed the same change to 0.93.1 this > > morning. > > > > ----- Original Message ----- > > > From: "Ketan Maheshwari" > > > To: "David Kelly" > > > Cc: "Michael Wilde" > > > Sent: Wednesday, March 7, 2012 6:39:09 PM > > > Subject: Re: sites.xml for ranger sge coasters > > > is it committed in 0.93 too? > > > > > > > > > On Wed, Mar 7, 2012 at 6:10 PM, David Kelly < davidk at ci.uchicago.edu > > > > > > > wrote: > > > > > > > > > I submitted a fix to trunk for the SGE provider. The submit script > > > was > > > wrong - it started one worker per core, rather than one worker per > > > host. (Oddly it's been like that for years without anybody > > > noticing). > > > I ran a few sleep/hostname tests and it seems to be working. Can you > > > please give it a try? > > > > > > Below is the sites.xml I used for my test: > > > > > > > > > > > > > > > > > > > > > 5 > > > 600 > > > 16 > > > 1 > > > 3 > > > 16way > > > 3 > > > development > > > 0.4799 > > > 10000 > > > TG-DBS080004N > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > Thanks, > > > David > > > > > > > > > > > > > > > -- > > > Ketan > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Thu Mar 8 10:58:50 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 8 Mar 2012 10:58:50 -0600 (CST) Subject: [Swift-devel] sites.xml for ranger sge coasters In-Reply-To: <594057048.75488.1331220103100.JavaMail.root@zimbra.anl.gov> Message-ID: <850800957.6161.1331225930199.JavaMail.root@zimbra-mb2.anl.gov> Mike, Sure, here is a quick description. When SGE is used with coasters, the submit script should attempt to start 1 worker.pl on each node. A single worker.pl can handle multiple jobs. Instead, it was starting one worker.pl for every core on every node. If you set jobspernode to 16 and had 16 cores per node, you could see up to 256 jobs per node. I think this only affected SGE+Coasters, but we should add some tests to the suite to verify. David ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: "Ketan Maheshwari" , "swift-devel" > Sent: Thursday, March 8, 2012 9:21:43 AM > Subject: Re: sites.xml for ranger sge coasters > David, thanks for addressing this problem. Does it affect any of the > other local providers: pbs, condor, (sge), cobalt? > > (I need to do some cobalt runs on Eureka for a user today, so I hope > that provider is OK). > > You should describe the issue and fix on swift-devel. > > We should start a convention where we can document known issues for > releases, so that users dont have to discover these bugs on their own. > Can you make an action item to propose and start such a place > (probably crosslinked to both Downloads and Documentation). Not urgent > for today, but next week would be good. > > Thanks, > > - Mike > > > > ----- Original Message ----- > > From: "David Kelly" > > To: "Ketan Maheshwari" > > Cc: "Michael Wilde" > > Sent: Thursday, March 8, 2012 8:45:00 AM > > Subject: Re: sites.xml for ranger sge coasters > > 0.93 is frozen, but I committed the same change to 0.93.1 this > > morning. > > > > ----- Original Message ----- > > > From: "Ketan Maheshwari" > > > To: "David Kelly" > > > Cc: "Michael Wilde" > > > Sent: Wednesday, March 7, 2012 6:39:09 PM > > > Subject: Re: sites.xml for ranger sge coasters > > > is it committed in 0.93 too? > > > > > > > > > On Wed, Mar 7, 2012 at 6:10 PM, David Kelly < > > > davidk at ci.uchicago.edu > > > > > > > wrote: > > > > > > > > > I submitted a fix to trunk for the SGE provider. The submit script > > > was > > > wrong - it started one worker per core, rather than one worker per > > > host. (Oddly it's been like that for years without anybody > > > noticing). > > > I ran a few sleep/hostname tests and it seems to be working. Can > > > you > > > please give it a try? > > > > > > Below is the sites.xml I used for my test: > > > > > > > > > > > > > > > > > > > > > 5 > > > 600 > > > 16 > > > 1 > > > 3 > > > 16way > > > 3 > > > development > > > 0.4799 > > > 10000 > > > TG-DBS080004N > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > Thanks, > > > David > > > > > > > > > > > > > > > -- > > > Ketan > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From wilde at mcs.anl.gov Thu Mar 8 11:07:57 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 8 Mar 2012 11:07:57 -0600 (CST) Subject: [Swift-devel] sites.xml for ranger sge coasters In-Reply-To: <850800957.6161.1331225930199.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <563677885.76269.1331226477018.JavaMail.root@zimbra.anl.gov> > ...When SGE is used with coasters, the > submit script should attempt to start 1 worker.pl on each node. A > single worker.pl can handle multiple jobs. Instead, it was starting > one worker.pl for every core on every node. If you set jobspernode to > 16 and had 16 cores per node, you could see up to 256 jobs per node. > > I think this only affected SGE+Coasters, but we should add some tests > to the suite to verify. Code inspection of each local provider type might also help to verify that the generated submit files are all correct now. - Mike From hockyg at uchicago.edu Fri Mar 9 15:59:29 2012 From: hockyg at uchicago.edu (Glen Hocky) Date: Fri, 9 Mar 2012 16:59:29 -0500 Subject: [Swift-devel] sites.xml for ranger sge coasters In-Reply-To: <594057048.75488.1331220103100.JavaMail.root@zimbra.anl.gov> References: <561683211.5757.1331217900880.JavaMail.root@zimbra-mb2.anl.gov> <594057048.75488.1331220103100.JavaMail.root@zimbra.anl.gov> Message-ID: David, Mike I'm now in a position to verify if this is working correctly or not, again. I wanted my new swift LAMMPS scripts to run 1 task per node using 16 cores. This sites file seems to do that correctly, i.e., it appears w/ David's change that only one coaster is started per node (and one job run per coaster). In principle I should be able to test packing different number of jobs in other ways -Glen 00:29:00 3600 1 16 200 1 16way 1 development 1.99 10000 TG-CHE110004 /scratch/01021/hockyg/glass-lammps-runs /share/home/01021/hockyg/reichman/glassy_dynamics/code/swift_lammps/run/test/swiftwork On Thu, Mar 8, 2012 at 10:21 AM, Michael Wilde wrote: > David, thanks for addressing this problem. Does it affect any of the > other local providers: pbs, condor, (sge), cobalt? > > (I need to do some cobalt runs on Eureka for a user today, so I hope that > provider is OK). > > You should describe the issue and fix on swift-devel. > > We should start a convention where we can document known issues for > releases, so that users dont have to discover these bugs on their own. Can > you make an action item to propose and start such a place (probably > crosslinked to both Downloads and Documentation). Not urgent for today, but > next week would be good. > > Thanks, > > - Mike > > > > ----- Original Message ----- > > From: "David Kelly" > > To: "Ketan Maheshwari" > > Cc: "Michael Wilde" > > Sent: Thursday, March 8, 2012 8:45:00 AM > > Subject: Re: sites.xml for ranger sge coasters > > 0.93 is frozen, but I committed the same change to 0.93.1 this > > morning. > > > > ----- Original Message ----- > > > From: "Ketan Maheshwari" > > > To: "David Kelly" > > > Cc: "Michael Wilde" > > > Sent: Wednesday, March 7, 2012 6:39:09 PM > > > Subject: Re: sites.xml for ranger sge coasters > > > is it committed in 0.93 too? > > > > > > > > > On Wed, Mar 7, 2012 at 6:10 PM, David Kelly < davidk at ci.uchicago.edu > > > > > > > wrote: > > > > > > > > > I submitted a fix to trunk for the SGE provider. The submit script > > > was > > > wrong - it started one worker per core, rather than one worker per > > > host. (Oddly it's been like that for years without anybody > > > noticing). > > > I ran a few sleep/hostname tests and it seems to be working. Can you > > > please give it a try? > > > > > > Below is the sites.xml I used for my test: > > > > > > > > > > > > > > > > > > > > > 5 > > > 600 > > > 16 > > > 1 > > > 3 > > > 16way > > > 3 > > > development > > > 0.4799 > > > 10000 > > > TG-DBS080004N > > > /share/home/01503/davidkel/swiftwork > > > > > > > > > > > > Thanks, > > > David > > > > > > > > > > > > > > > -- > > > Ketan > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Fri Mar 9 17:20:31 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Fri, 9 Mar 2012 17:20:31 -0600 Subject: [Swift-devel] sites.xml for ranger sge coasters In-Reply-To: References: <561683211.5757.1331217900880.JavaMail.root@zimbra-mb2.anl.gov> <594057048.75488.1331220103100.JavaMail.root@zimbra.anl.gov> Message-ID: Glen, Could you tell me if you were able to get through the ranger queue today or yesterday? I am stuck sitting on the queue since one and a half day now. Thanks, Ketan On Fri, Mar 9, 2012 at 3:59 PM, Glen Hocky wrote: > David, Mike > I'm now in a position to verify if this is working correctly or not, > again. > > I wanted my new swift LAMMPS scripts to run 1 task per node using 16 > cores. This sites file seems to do that correctly, i.e., it appears w/ > David's change that only one coaster is started per node (and one job run > per coaster). In principle I should be able to test packing different > number of jobs in other ways > > -Glen > > > > 00:29:00 > 3600 > 1 > 16 > 200 > 1 > 16way > 1 > development > 1.99 > 10000 > TG-CHE110004 > /scratch/01021/hockyg/glass-lammps-runs > > /share/home/01021/hockyg/reichman/glassy_dynamics/code/swift_lammps/run/test/swiftwork > > > > > On Thu, Mar 8, 2012 at 10:21 AM, Michael Wilde wrote: > >> David, thanks for addressing this problem. Does it affect any of the >> other local providers: pbs, condor, (sge), cobalt? >> >> (I need to do some cobalt runs on Eureka for a user today, so I hope that >> provider is OK). >> >> You should describe the issue and fix on swift-devel. >> >> We should start a convention where we can document known issues for >> releases, so that users dont have to discover these bugs on their own. Can >> you make an action item to propose and start such a place (probably >> crosslinked to both Downloads and Documentation). Not urgent for today, but >> next week would be good. >> >> Thanks, >> >> - Mike >> >> >> >> ----- Original Message ----- >> > From: "David Kelly" >> > To: "Ketan Maheshwari" >> > Cc: "Michael Wilde" >> > Sent: Thursday, March 8, 2012 8:45:00 AM >> > Subject: Re: sites.xml for ranger sge coasters >> > 0.93 is frozen, but I committed the same change to 0.93.1 this >> > morning. >> > >> > ----- Original Message ----- >> > > From: "Ketan Maheshwari" >> > > To: "David Kelly" >> > > Cc: "Michael Wilde" >> > > Sent: Wednesday, March 7, 2012 6:39:09 PM >> > > Subject: Re: sites.xml for ranger sge coasters >> > > is it committed in 0.93 too? >> > > >> > > >> > > On Wed, Mar 7, 2012 at 6:10 PM, David Kelly < davidk at ci.uchicago.edu >> > > > >> > > wrote: >> > > >> > > >> > > I submitted a fix to trunk for the SGE provider. The submit script >> > > was >> > > wrong - it started one worker per core, rather than one worker per >> > > host. (Oddly it's been like that for years without anybody >> > > noticing). >> > > I ran a few sleep/hostname tests and it seems to be working. Can you >> > > please give it a try? >> > > >> > > Below is the sites.xml I used for my test: >> > > >> > > >> > > >> > > >> > > >> > > >> > > 5 >> > > 600 >> > > 16 >> > > 1 >> > > 3 >> > > 16way >> > > 3 >> > > development >> > > 0.4799 >> > > 10000 >> > > TG-DBS080004N >> > > /share/home/01503/davidkel/swiftwork >> > > >> > > >> > > >> > > Thanks, >> > > David >> > > >> > > >> > > >> > > >> > > -- >> > > Ketan >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Mar 14 11:01:20 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Mar 2012 11:01:20 -0500 Subject: [Swift-devel] intermediate files and CDM Message-ID: So the test I was working on for the mosaswift test is located at /home/jonmon/Workspace/Swift/mosaswift-test The idea is the starting data is mapped to the cwd using CDM direct, the intermediate is mapped to /tmp/mosa using CDM direct which is the output of a cat job, and the final data is mapped back to the cwd using CDM direct which is the output from my appendHello.sh script. appendHello.sh does another cat and appends "Hello from " to the end of the file. From jonmon at mcs.anl.gov Wed Mar 14 14:22:14 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Mar 2012 14:22:14 -0500 Subject: [Swift-devel] Mosaswift update In-Reply-To: <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> Message-ID: <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> So here is an update on the progress of the mosaswift test scripts: I have a working CDM script and catsn(well semi working). The issue was that the rule for CDM direct was matching "data_files" and using an absolute path with "data_files" at the end. This caused the path to look like /path/to/data_files/data_files since CDM takes what it matched and appends it to the absolute path that was provided. This will documented better in the CDM section of the user guide. The other error that was brought to light was that the worker nodes do not share /tmp so they couldn't find error files when checking to see if they jobs completed. This was resolved by setting the number of workers from 1 to 2. Once mosastore is set up on the worker nodes this number can vary since all the workers will share /tmp/mosa(at least that is my assumption). I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. I have also fixed the parameter issue so we get a better error message for the problem. It provides a bit better error message as to what the issue was. On Mar 13, 2012, at 1:02 PM, Emalayan Vairavanathan wrote: > Hi Jon, > > Great news and thank you for the update. I think we are getting very closer. Please let me know if I can help you. > > I am cc-ing MosaStore group to inform about the progress. > > Thank you > Emalayan > > From: Jonathan Monette > To: "svemalayan at yahoo.com" > Sent: Tuesday, 13 March 2012 10:03 AM > Subject: Mosaswift update > > Emalayan, > I may have figured out how to use the intermediate file set up that moss needs. We will need to use something called provider staging which is a feature Swift has. Provider staging copies data directly to the compute node. I will experiment with this and see if this is the missing piece we need. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Wed Mar 14 14:43:02 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 14 Mar 2012 14:43:02 -0500 (CDT) Subject: [Swift-devel] Mosaswift update In-Reply-To: <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> Message-ID: On Wed, 14 Mar 2012, Jonathan Monette wrote: > I say semi working because I intended the script to build up output > during the stages on the swift run and the final output would show that > it was stored temporarily in /tmp. This is not happening as the final > output for the file contains only the last line(a Hello from line). > I believe this is due to how CDM using symlinks for the files that is > matched, so a symlink is being overwritten instead of at the file it > points to. I ran into this problem in the past(but this is documented > in the user guide) and just need verify that this is indeed the issue. In the past, the application unlinked the link. It looks like you are appending to the link. infile=`readlink $1` outfile=`readline $2` cat $infile >> $outfile Is there a case in which the readlink affects the behavior of this sequence? Does readline work correctly on the BG/P compute node? Justin -- Justin M Wozniak From jonmon at mcs.anl.gov Wed Mar 14 14:45:54 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Mar 2012 14:45:54 -0500 Subject: [Swift-devel] Mosaswift update In-Reply-To: References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> Message-ID: I am not sure. I thought readlink returned the absolute path of the link which is what I wanted to use. I am not sure how readlink behaves on the compute nodes. This was my first attempt at getting around the link but I do not think this is working. On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: > On Wed, 14 Mar 2012, Jonathan Monette wrote: > >> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. > > In the past, the application unlinked the link. It looks like you are appending to the link. > > infile=`readlink $1` > outfile=`readline $2` > cat $infile >> $outfile > > Is there a case in which the readlink affects the behavior of this sequence? > > Does readline work correctly on the BG/P compute node? > > Justin > > -- > Justin M Wozniak From jonmon at mcs.anl.gov Wed Mar 14 14:47:41 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Mar 2012 14:47:41 -0500 Subject: [Swift-devel] Worker node ip address Message-ID: <0A0D95B0-F26C-4D4A-8592-C0C741552645@mcs.anl.gov> Our current coaster hosts scripts rely on the worker logs and to report the ip addresses. In the case of the kernel profile for mosa, hostname does not report the ip addresses so the workers do not know them. In a previous email thread( [Swift-devel] [ZeptoOS] hostname returns none in Surveyor) we discussed about how to get the worker node ip addresses using personality.sh and some netmasking(attributed to Zhao, thanks!). Emalayan you specified that for now you can use the approach of using the torus rank 0 for the mosa manager and the mosa workers will just connect back to torus rank 0. In a previous email to me you specified that this approach will work for the experiments you are planning for SC but you would need a better way for other mosa deployments. So i guess the question is is the coaster hosts issue a blocker for you Emalayan but want to see if you are waiting for this or not? Could also verify that you are using the method of using the torus rank 0. From wozniak at mcs.anl.gov Wed Mar 14 15:02:28 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 14 Mar 2012 15:02:28 -0500 (CDT) Subject: [Swift-devel] Mosaswift update In-Reply-To: References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> Message-ID: Ok, then it's a typo on readlinE . On Wed, 14 Mar 2012, Jonathan Monette wrote: > I am not sure. I thought readlink returned the absolute path of the > link which is what I wanted to use. I am not sure how readlink behaves > on the compute nodes. This was my first attempt at getting around the > link but I do not think this is working. > > On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: > >> On Wed, 14 Mar 2012, Jonathan Monette wrote: >> >>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >> >> In the past, the application unlinked the link. It looks like you are appending to the link. >> >> infile=`readlink $1` >> outfile=`readline $2` >> cat $infile >> $outfile >> >> Is there a case in which the readlink affects the behavior of this sequence? >> >> Does readline work correctly on the BG/P compute node? >> >> Justin >> >> -- >> Justin M Wozniak > > -- Justin M Wozniak From jonmon at mcs.anl.gov Wed Mar 14 15:05:47 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Mar 2012 15:05:47 -0500 Subject: [Swift-devel] Mosaswift update In-Reply-To: References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> Message-ID: <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> Oh, thanks, didn't catch that. Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: > > Ok, then it's a typo on readlinE . > > On Wed, 14 Mar 2012, Jonathan Monette wrote: > >> I am not sure. I thought readlink returned the absolute path of the link which is what I wanted to use. I am not sure how readlink behaves on the compute nodes. This was my first attempt at getting around the link but I do not think this is working. >> >> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: >> >>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>> >>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >>> >>> In the past, the application unlinked the link. It looks like you are appending to the link. >>> >>> infile=`readlink $1` >>> outfile=`readline $2` >>> cat $infile >> $outfile >>> >>> Is there a case in which the readlink affects the behavior of this sequence? >>> >>> Does readline work correctly on the BG/P compute node? >>> >>> Justin >>> >>> -- >>> Justin M Wozniak >> >> > > -- > Justin M Wozniak From jonmon at mcs.anl.gov Wed Mar 14 16:00:13 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Mar 2012 16:00:13 -0500 Subject: [Swift-devel] Mosaswift update In-Reply-To: <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> Message-ID: So that actually fixed the issue. So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: > Oh, thanks, didn't catch that. Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. > > > On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: > >> >> Ok, then it's a typo on readlinE . >> >> On Wed, 14 Mar 2012, Jonathan Monette wrote: >> >>> I am not sure. I thought readlink returned the absolute path of the link which is what I wanted to use. I am not sure how readlink behaves on the compute nodes. This was my first attempt at getting around the link but I do not think this is working. >>> >>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: >>> >>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>>> >>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >>>> >>>> In the past, the application unlinked the link. It looks like you are appending to the link. >>>> >>>> infile=`readlink $1` >>>> outfile=`readline $2` >>>> cat $infile >> $outfile >>>> >>>> Is there a case in which the readlink affects the behavior of this sequence? >>>> >>>> Does readline work correctly on the BG/P compute node? >>>> >>>> Justin >>>> >>>> -- >>>> Justin M Wozniak >>> >>> >> >> -- >> Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From svemalayan at yahoo.com Wed Mar 14 16:03:05 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Wed, 14 Mar 2012 14:03:05 -0700 (PDT) Subject: [Swift-devel] Mosaswift update In-Reply-To: References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> Message-ID: <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> Hi Jon, Great, Thank you very much. Could you please send me the setup ? Thank you Emalayan ________________________________ From: Jonathan Monette To: Justin M Wozniak Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore Sent: Wednesday, 14 March 2012 2:00 PM Subject: Re: [Swift-devel] Mosaswift update So that actually fixed the issue.? So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: > Oh, thanks, didn't catch that.? Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. > > > On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: > >> >> Ok, then it's a typo on readlinE . >> >> On Wed, 14 Mar 2012, Jonathan Monette wrote: >> >>> I am not sure.? I thought readlink returned the absolute path of the link which is what I wanted to use.? I am not sure how readlink behaves on the compute nodes.? This was my first attempt at getting around the link but I do not think this is working. >>> >>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: >>> >>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>>> >>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp.? This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to.? I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >>>> >>>> In the past, the application unlinked the link.? It looks like you are appending to the link. >>>> >>>> infile=`readlink $1` >>>> outfile=`readline $2` >>>> cat $infile >> $outfile >>>> >>>> Is there a case in which the readlink affects the behavior of this sequence? >>>> >>>> Does readline work correctly on the BG/P compute node? >>>> >>>> ??? Justin >>>> >>>> -- >>>> Justin M Wozniak >>> >>> >> >> -- >> Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Mar 14 16:29:06 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Mar 2012 16:29:06 -0500 Subject: [Swift-devel] Mosaswift update In-Reply-To: <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> Message-ID: <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> The working example is located in /home/jonmon/Workspace/Swift/mosaswift-test There is a README there that provides a little bit of information. Quick version to run it: 1) start your workers 2) ./setup.sh 3) ./run.sh The directory final will have 5 files if it completed without error. The content of the files will look like: Hello world Hello from On Mar 14, 2012, at 4:03 PM, Emalayan Vairavanathan wrote: > Hi Jon, > > Great, Thank you very much. Could you please send me the setup ? > > Thank you > Emalayan > > From: Jonathan Monette > To: Justin M Wozniak > Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore > Sent: Wednesday, 14 March 2012 2:00 PM > Subject: Re: [Swift-devel] Mosaswift update > > So that actually fixed the issue. So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. > > On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: > > > Oh, thanks, didn't catch that. Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. > > > > > > On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: > > > >> > >> Ok, then it's a typo on readlinE . > >> > >> On Wed, 14 Mar 2012, Jonathan Monette wrote: > >> > >>> I am not sure. I thought readlink returned the absolute path of the link which is what I wanted to use. I am not sure how readlink behaves on the compute nodes. This was my first attempt at getting around the link but I do not think this is working. > >>> > >>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: > >>> > >>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: > >>>> > >>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. > >>>> > >>>> In the past, the application unlinked the link. It looks like you are appending to the link. > >>>> > >>>> infile=`readlink $1` > >>>> outfile=`readline $2` > >>>> cat $infile >> $outfile > >>>> > >>>> Is there a case in which the readlink affects the behavior of this sequence? > >>>> > >>>> Does readline work correctly on the BG/P compute node? > >>>> > >>>> Justin > >>>> > >>>> -- > >>>> Justin M Wozniak > >>> > >>> > >> > >> -- > >> Justin M Wozniak > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > You received this message because you are subscribed to the Google Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com. > To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Wed Mar 14 17:14:07 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Wed, 14 Mar 2012 15:14:07 -0700 (PDT) Subject: [Swift-devel] Mosaswift update In-Reply-To: <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> Message-ID: <1331763247.46950.YahooMailNeo@web39505.mail.mud.yahoo.com> Thank you Jon. I got the setup and trying to run it. - Emalayan ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: "mosastore at googlegroups.com" ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu Devel" Sent: Wednesday, 14 March 2012 2:29 PM Subject: Re: [Swift-devel] Mosaswift update The working example is located in /home/jonmon/Workspace/Swift/mosaswift-test There is a README there that provides a little bit of information. Quick version to run it: 1) start your workers 2) ./setup.sh 3) ./run.sh The directory final will have 5 files if it completed without error. ?The content of the files will look like: Hello world Hello from On Mar 14, 2012, at 4:03 PM, Emalayan Vairavanathan wrote: Hi Jon, > > >Great, Thank you very much. Could you please send me the setup ? > > > >Thank you >Emalayan > > > > >________________________________ > From: Jonathan Monette >To: Justin M Wozniak >Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore >Sent: Wednesday, 14 March 2012 2:00 PM >Subject: Re: [Swift-devel] Mosaswift update > >So that actually fixed the issue.? So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. > >On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: > >> Oh, thanks, didn't catch that.? Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. >> >> >> On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: >> >>> >>> Ok, then it's a typo on readlinE . >>> >>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>> >>>> I am not sure.? I thought readlink returned the absolute path of the link which is what I wanted to use.? I am not sure how readlink behaves on the compute nodes.? This was my first attempt at getting around the link but I do not think this is working. >>>> >>>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: >>>> >>>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>>>> >>>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp.? This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to.? I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >>>>> >>>>> In the past, the application unlinked the link.? It looks like you are appending to the link. >>>>> >>>>> infile=`readlink $1` >>>>> outfile=`readline $2` >>>>> cat $infile >> $outfile >>>>> >>>>> Is there a case in which the readlink affects the behavior of this sequence? >>>>> >>>>> Does readline work correctly on the BG/P compute node? >>>>> >>>>> ??? Justin >>>>> >>>>> -- >>>>> Justin M Wozniak >>>> >>>> >>> >>> -- >>> Justin M Wozniak >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >-- >You received this message because you are subscribed to the Google Groups "MosaStore" group. >To post to this group, send email to mosastore at googlegroups.com. >To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. >For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. > > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Thu Mar 15 03:30:17 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 15 Mar 2012 01:30:17 -0700 (PDT) Subject: [Swift-devel] Mosaswift update In-Reply-To: <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> Message-ID: <1331800217.12924.YahooMailNeo@web39502.mail.mud.yahoo.com> Hi Jon and All, MosaStore+Swift integration is done successfully. This is really a great news. I successfully ran the setup provided by Jon with MosaStore+Swift on Surveyor (on 1, 3 , 5 and 64 nodes with different number of? jobs varying from 1 to 1024). Jon and Mike thank you again for your time in helping me to get this far. A quick question: The setup creates the swift working directory in GPFS but stores the intermediate files in MosaStore. Where should we actually have the swift-working directory (in GPFS / Mosa)? and Does this matter at all in case if I am going to use Mosa as an intermediate storage? (As a next step I will be scripting our pipeline benchmark on swift and run at large scale to get some initial performance numbers. Also I will be parallely working to get Montage and ModFTDock running on BG/P with swift-CDM and MosaStore.) Regards Emalayan ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: "mosastore at googlegroups.com" ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu Devel" Sent: Wednesday, 14 March 2012 2:29 PM Subject: Re: [Swift-devel] Mosaswift update The working example is located in /home/jonmon/Workspace/Swift/mosaswift-test There is a README there that provides a little bit of information. Quick version to run it: 1) start your workers 2) ./setup.sh 3) ./run.sh The directory final will have 5 files if it completed without error. ?The content of the files will look like: Hello world Hello from On Mar 14, 2012, at 4:03 PM, Emalayan Vairavanathan wrote: Hi Jon, > > >Great, Thank you very much. Could you please send me the setup ? > > > >Thank you >Emalayan > > > > >________________________________ > From: Jonathan Monette >To: Justin M Wozniak >Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore >Sent: Wednesday, 14 March 2012 2:00 PM >Subject: Re: [Swift-devel] Mosaswift update > >So that actually fixed the issue.? So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. > >On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: > >> Oh, thanks, didn't catch that.? Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. >> >> >> On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: >> >>> >>> Ok, then it's a typo on readlinE . >>> >>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>> >>>> I am not sure.? I thought readlink returned the absolute path of the link which is what I wanted to use.? I am not sure how readlink behaves on the compute nodes.? This was my first attempt at getting around the link but I do not think this is working. >>>> >>>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: >>>> >>>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>>>> >>>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp.? This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to.? I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >>>>> >>>>> In the past, the application unlinked the link.? It looks like you are appending to the link. >>>>> >>>>> infile=`readlink $1` >>>>> outfile=`readline $2` >>>>> cat $infile >> $outfile >>>>> >>>>> Is there a case in which the readlink affects the behavior of this sequence? >>>>> >>>>> Does readline work correctly on the BG/P compute node? >>>>> >>>>> ??? Justin >>>>> >>>>> -- >>>>> Justin M Wozniak >>>> >>>> >>> >>> -- >>> Justin M Wozniak >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >-- >You received this message because you are subscribed to the Google Groups "MosaStore" group. >To post to this group, send email to mosastore at googlegroups.com. >To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. >For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. > > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Mar 15 07:55:30 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 15 Mar 2012 07:55:30 -0500 Subject: [Swift-devel] Mosaswift update In-Reply-To: <1331800217.12924.YahooMailNeo@web39502.mail.mud.yahoo.com> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> <1331800217.12924.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: I do not think it will matter since all the intermediate files will be mapped onto mosa but you can change the paths in those generated files to point to whatever you want. Then you can experiment with having the starting and final files in mosa if you want. On Mar 15, 2012, at 3:30, Emalayan Vairavanathan wrote: > Hi Jon and All, > > MosaStore+Swift integration is done successfully. This is really a great news. > > I successfully ran the setup provided by Jon with MosaStore+Swift on Surveyor (on 1, 3 , 5 and 64 nodes with different number of jobs varying from 1 to 1024). > > Jon and Mike thank you again for your time in helping me to get this far. > > A quick question: The setup creates the swift working directory in GPFS but stores the intermediate files in MosaStore. Where should we actually have the swift-working directory (in GPFS / Mosa)? and Does this matter at all in case if I am going to use Mosa as an intermediate storage? > > (As a next step I will be scripting our pipeline benchmark on swift and run at large scale to get some initial performance numbers. Also I will be parallely working to get Montage and ModFTDock running on BG/P with swift-CDM and MosaStore.) > > Regards > Emalayan > > From: Jonathan Monette > To: Emalayan Vairavanathan > Cc: "mosastore at googlegroups.com" ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu Devel" > Sent: Wednesday, 14 March 2012 2:29 PM > Subject: Re: [Swift-devel] Mosaswift update > > The working example is located in /home/jonmon/Workspace/Swift/mosaswift-test > > There is a README there that provides a little bit of information. > > Quick version to run it: > 1) start your workers > 2) ./setup.sh > 3) ./run.sh > > The directory final will have 5 files if it completed without error. The content of the files will look like: > Hello world > Hello from > > On Mar 14, 2012, at 4:03 PM, Emalayan Vairavanathan wrote: > >> Hi Jon, >> >> Great, Thank you very much. Could you please send me the setup ? >> >> Thank you >> Emalayan >> >> From: Jonathan Monette >> To: Justin M Wozniak >> Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore >> Sent: Wednesday, 14 March 2012 2:00 PM >> Subject: Re: [Swift-devel] Mosaswift update >> >> So that actually fixed the issue. So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. >> >> On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: >> >> > Oh, thanks, didn't catch that. Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. >> > >> > >> > On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: >> > >> >> >> >> Ok, then it's a typo on readlinE . >> >> >> >> On Wed, 14 Mar 2012, Jonathan Monette wrote: >> >> >> >>> I am not sure. I thought readlink returned the absolute path of the link which is what I wanted to use. I am not sure how readlink behaves on the compute nodes. This was my first attempt at getting around the link but I do not think this is working. >> >>> >> >>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: >> >>> >> >>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >> >>>> >> >>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >> >>>> >> >>>> In the past, the application unlinked the link. It looks like you are appending to the link. >> >>>> >> >>>> infile=`readlink $1` >> >>>> outfile=`readline $2` >> >>>> cat $infile >> $outfile >> >>>> >> >>>> Is there a case in which the readlink affects the behavior of this sequence? >> >>>> >> >>>> Does readline work correctly on the BG/P compute node? >> >>>> >> >>>> Justin >> >>>> >> >>>> -- >> >>>> Justin M Wozniak >> >>> >> >>> >> >> >> >> -- >> >> Justin M Wozniak >> > >> > _______________________________________________ >> > Swift-devel mailing list >> > Swift-devel at ci.uchicago.edu >> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> You received this message because you are subscribed to the Google Groups "MosaStore" group. >> To post to this group, send email to mosastore at googlegroups.com. >> To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. >> For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > You received this message because you are subscribed to the Google Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com. > To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Thu Mar 15 15:32:12 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 15 Mar 2012 13:32:12 -0700 (PDT) Subject: [Swift-devel] Worker node ip address In-Reply-To: <0A0D95B0-F26C-4D4A-8592-C0C741552645@mcs.anl.gov> References: <0A0D95B0-F26C-4D4A-8592-C0C741552645@mcs.anl.gov> Message-ID: <1331843532.88154.YahooMailNeo@web39504.mail.mud.yahoo.com> Hi Jon, MosaStore can calculate the list of IPs in a partition (if partition is the multiple of pset) from the information available to ZeptoOS. So Coaster's does not need to provide this information and I am not blocked by this issue. Thank you Emalayan ________________________________ From: Jonathan Monette To: "swift-devel at ci.uchicago.edu Devel" Cc: Michael Wilde ; emalayan at ece.ubc.ca; Justin Wozniak ; MosaStore Sent: Wednesday, 14 March 2012 12:47 PM Subject: Worker node ip address Our current coaster hosts scripts rely on the worker logs and to report the ip addresses.? In the case of the kernel profile for mosa, hostname does not report the ip addresses so the workers do not know them.? In a previous email thread( [Swift-devel] [ZeptoOS] hostname? returns none in Surveyor) we discussed about how to get the worker node ip addresses using personality.sh and some netmasking(attributed to Zhao, thanks!).? Emalayan you specified that for now you can use the approach of using the torus rank 0 for the mosa manager and the mosa workers will just connect back to torus rank 0.? In a previous email to me you specified that this approach will work for the experiments you are planning for SC but you would need a better way for other mosa deployments. So i guess the question is is the coaster hosts issue a blocker for you Emalayan but want to see if you are waiting for this or not? Could also verify that you are using the method of using the torus rank 0. -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Thu Mar 15 15:34:06 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 15 Mar 2012 13:34:06 -0700 (PDT) Subject: [Swift-devel] Mosaswift update In-Reply-To: References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> <1331800217.12924.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: <1331843646.95462.YahooMailNeo@web39502.mail.mud.yahoo.com> Thank you for the clarification Jon. ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore ; Michael Wilde ; matei Sent: Thursday, 15 March 2012 5:55 AM Subject: Re: [Swift-devel] Mosaswift update I do not think it will matter since all the intermediate files will be mapped onto mosa but you can change the paths in those generated files to point to whatever you want. ?Then you can experiment with having the starting and final files in mosa if you want.? On Mar 15, 2012, at 3:30, Emalayan Vairavanathan wrote: Hi Jon and All, > > >MosaStore+Swift integration is done successfully. This is really a great news. > > > >I successfully ran the setup provided by Jon with MosaStore+Swift on Surveyor (on 1, 3 , 5 and 64 nodes with different number of? jobs varying from 1 to 1024). > > >Jon and Mike thank you again for your time in helping me to get this far. > > >A quick question: The setup creates the swift working directory in GPFS but stores the intermediate files in MosaStore. Where should we actually have the swift-working directory (in GPFS / Mosa)? and Does this matter at all in case if I am going to use Mosa as an intermediate storage? > >(As a next step I will be scripting our pipeline benchmark on swift and run at large scale to get some initial performance numbers. Also I will be parallely working to get Montage and ModFTDock running on BG/P with swift-CDM and MosaStore.) > >Regards >Emalayan > > > > >________________________________ > From: Jonathan Monette >To: Emalayan Vairavanathan >Cc: "mosastore at googlegroups.com" ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu Devel" >Sent: Wednesday, 14 March 2012 2:29 PM >Subject: Re: [Swift-devel] Mosaswift update > > >The working example is located in /home/jonmon/Workspace/Swift/mosaswift-test > > >There is a README there that provides a little bit of information. > > >Quick version to run it: >1) start your workers >2) ./setup.sh >3) ./run.sh > > >The directory final will have 5 files if it completed without error. ?The content of the files will look like: >Hello world >Hello from > > >On Mar 14, 2012, at 4:03 PM, Emalayan Vairavanathan wrote: > >Hi Jon, >> >> >>Great, Thank you very much. Could you please send me the setup ? >> >> >> >>Thank you >>Emalayan >> >> >> >> >>________________________________ >> From: Jonathan Monette >>To: Justin M Wozniak >>Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore >>Sent: Wednesday, 14 March 2012 2:00 PM >>Subject: Re: [Swift-devel] Mosaswift update >> >>So that actually fixed the issue.? So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. >> >>On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: >> >>> Oh, thanks, didn't catch that.? Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. >>> >>> >>> On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: >>> >>>> >>>> Ok, then it's a typo on readlinE . >>>> >>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>>> >>>>> I am not sure.? I thought readlink returned the absolute path of the link which is what I wanted to use.? I am not sure how readlink behaves on the compute nodes.? This was my first attempt at getting around the link but I do not think this is working. >>>>> >>>>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: >>>>> >>>>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>>>>> >>>>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp.? This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to.? I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >>>>>> >>>>>> In the past, the application unlinked the link.? It looks like you are appending to the link. >>>>>> >>>>>> infile=`readlink $1` >>>>>> outfile=`readline $2` >>>>>> cat $infile >> $outfile >>>>>> >>>>>> Is there a case in which the readlink affects the behavior of this sequence? >>>>>> >>>>>> Does readline work correctly on the BG/P compute node? >>>>>> >>>>>> ??? Justin >>>>>> >>>>>> -- >>>>>> Justin M Wozniak >>>>> >>>>> >>>> >>>> -- >>>> Justin M Wozniak >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >>-- >>You received this message because you are subscribed to the Google Groups "MosaStore" group. >>To post to this group, send email to mosastore at googlegroups.com. >>To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. >>For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. >> >> >> >>_______________________________________________ >>Swift-devel mailing list >>Swift-devel at ci.uchicago.edu >>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > -- >You received this message because you are subscribed to the Google Groups "MosaStore" group. >To post to this group, send email to mosastore at googlegroups.com. >To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. >For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. > > > -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Mar 15 16:06:17 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 15 Mar 2012 16:06:17 -0500 Subject: [Swift-devel] Worker node ip address In-Reply-To: <1331843532.88154.YahooMailNeo@web39504.mail.mud.yahoo.com> References: <0A0D95B0-F26C-4D4A-8592-C0C741552645@mcs.anl.gov> <1331843532.88154.YahooMailNeo@web39504.mail.mud.yahoo.com> Message-ID: Ok. Thanks Emalayan. On Mar 15, 2012, at 3:32 PM, Emalayan Vairavanathan wrote: > Hi Jon, > > MosaStore can calculate the list of IPs in a partition (if partition is the multiple of pset) from the information available to ZeptoOS. > So Coaster's does not need to provide this information and I am not blocked by this issue. > > Thank you > Emalayan > > From: Jonathan Monette > To: "swift-devel at ci.uchicago.edu Devel" > Cc: Michael Wilde ; emalayan at ece.ubc.ca; Justin Wozniak ; MosaStore > Sent: Wednesday, 14 March 2012 12:47 PM > Subject: Worker node ip address > > Our current coaster hosts scripts rely on the worker logs and to report the ip addresses. In the case of the kernel profile for mosa, hostname does not report the ip addresses so the workers do not know them. In a previous email thread( [Swift-devel] [ZeptoOS] hostname returns none in Surveyor) we discussed about how to get the worker node ip addresses using personality.sh and some netmasking(attributed to Zhao, thanks!). Emalayan you specified that for now you can use the approach of using the torus rank 0 for the mosa manager and the mosa workers will just connect back to torus rank 0. In a previous email to me you specified that this approach will work for the experiments you are planning for SC but you would need a better way for other mosa deployments. > > So i guess the question is is the coaster hosts issue a blocker for you Emalayan but want to see if you are waiting for this or not? > > Could also verify that you are using the method of using the torus rank 0. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matei.ripeanu at gmail.com Mon Mar 19 10:31:13 2012 From: matei.ripeanu at gmail.com (Matei Ripeanu) Date: Mon, 19 Mar 2012 17:31:13 +0200 Subject: [Swift-devel] Mosaswift update In-Reply-To: <1331800217.12924.YahooMailNeo@web39502.mail.mud.yahoo.com> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> <1331800217.12924.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: <006501cd05e5$52aad730$f8008590$@gmail.com> Jon, I wanted to thank you for your huge effort making this work. For us this is big progress - we can start thinking about the experiments we'd like to have for the SC paper. -Matei From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] On Behalf Of Emalayan Vairavanathan Sent: March-15-12 10:30 AM To: swift-devel at ci.uchicago.edu Devel; Jonathan Monette Cc: MosaStore; Michael Wilde; matei Subject: Re: [Swift-devel] Mosaswift update Hi Jon and All, MosaStore+Swift integration is done successfully. This is really a great news. I successfully ran the setup provided by Jon with MosaStore+Swift on Surveyor (on 1, 3 , 5 and 64 nodes with different number of jobs varying from 1 to 1024). Jon and Mike thank you again for your time in helping me to get this far. A quick question: The setup creates the swift working directory in GPFS but stores the intermediate files in MosaStore. Where should we actually have the swift-working directory (in GPFS / Mosa)? and Does this matter at all in case if I am going to use Mosa as an intermediate storage? (As a next step I will be scripting our pipeline benchmark on swift and run at large scale to get some initial performance numbers. Also I will be parallely working to get Montage and ModFTDock running on BG/P with swift-CDM and MosaStore.) Regards Emalayan _____ From: Jonathan Monette To: Emalayan Vairavanathan Cc: "mosastore at googlegroups.com" ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu Devel " Sent: Wednesday, 14 March 2012 2:29 PM Subject: Re: [Swift-devel] Mosaswift update The working example is located in /home/jonmon/Workspace/Swift/mosaswift-test There is a README there that provides a little bit of information. Quick version to run it: 1) start your workers 2) ./setup.sh 3) ./run.sh The directory final will have 5 files if it completed without error. The content of the files will look like: Hello world Hello from On Mar 14, 2012, at 4:03 PM, Emalayan Vairavanathan wrote: Hi Jon, Great, Thank you very much. Could you please send me the setup ? Thank you Emalayan _____ From: Jonathan Monette To: Justin M Wozniak Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore Sent: Wednesday, 14 March 2012 2:00 PM Subject: Re: [Swift-devel] Mosaswift update So that actually fixed the issue. So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: > Oh, thanks, didn't catch that. Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. > > > On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: > >> >> Ok, then it's a typo on readlinE . >> >> On Wed, 14 Mar 2012, Jonathan Monette wrote: >> >>> I am not sure. I thought readlink returned the absolute path of the link which is what I wanted to use. I am not sure how readlink behaves on the compute nodes. This was my first attempt at getting around the link but I do not think this is working. >>> >>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: >>> >>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: >>>> >>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. >>>> >>>> In the past, the application unlinked the link. It looks like you are appending to the link. >>>> >>>> infile=`readlink $1` >>>> outfile=`readline $2` >>>> cat $infile >> $outfile >>>> >>>> Is there a case in which the readlink affects the behavior of this sequence? >>>> >>>> Does readline work correctly on the BG/P compute node? >>>> >>>> Justin >>>> >>>> -- >>>> Justin M Wozniak >>> >>> >> >> -- >> Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Mon Mar 19 17:47:39 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 19 Mar 2012 17:47:39 -0500 Subject: [Swift-devel] Mosaswift update In-Reply-To: <006501cd05e5$52aad730$f8008590$@gmail.com> References: <37E89417-7C36-4EA7-8D0B-D304F185D4D2@mcs.anl.gov> <1331661767.99575.YahooMailNeo@web39501.mail.mud.yahoo.com> <52F37C63-10EC-44D3-9409-821E32D06C8F@mcs.anl.gov> <66CC1246-ECA8-40FD-AEBC-E04B8901696A@mcs.anl.gov> <1331758985.63491.YahooMailNeo@web39507.mail.mud.yahoo.com> <1242CCDF-22F7-44B6-B196-8970E58271F7@mcs.anl.gov> <1331800217.12924.YahooMailNeo@web39502.mail.mud.yahoo.com> <006501cd05e5$52aad730$f8008590$@gmail.com> Message-ID: No problem. If there is any problems on the Swift side please do not hesitate in contacting me. I will help in any way I can. On Mar 19, 2012, at 10:31 AM, Matei Ripeanu wrote: > Jon, > > I wanted to thank you for your huge effort making this work. For us this is big progress ? we can start thinking about the experiments we?d like to have for the SC paper. > > -Matei > > From: mosastore at googlegroups.com [mailto:mosastore at googlegroups.com] On Behalf Of Emalayan Vairavanathan > Sent: March-15-12 10:30 AM > To: swift-devel at ci.uchicago.edu Devel; Jonathan Monette > Cc: MosaStore; Michael Wilde; matei > Subject: Re: [Swift-devel] Mosaswift update > > Hi Jon and All, > > MosaStore+Swift integration is done successfully. This is really a great news. > > I successfully ran the setup provided by Jon with MosaStore+Swift on Surveyor (on 1, 3 , 5 and 64 nodes with different number of jobs varying from 1 to 1024). > > Jon and Mike thank you again for your time in helping me to get this far. > > A quick question: The setup creates the swift working directory in GPFS but stores the intermediate files in MosaStore. Where should we actually have the swift-working directory (in GPFS / Mosa)? and Does this matter at all in case if I am going to use Mosa as an intermediate storage? > > (As a next step I will be scripting our pipeline benchmark on swift and run at large scale to get some initial performance numbers. Also I will be parallely working to get Montage and ModFTDock running on BG/P with swift-CDM and MosaStore.) > > Regards > Emalayan > > From: Jonathan Monette > To: Emalayan Vairavanathan > Cc: "mosastore at googlegroups.com" ; Justin M Wozniak ; "swift-devel at ci.uchicago.edu Devel" > Sent: Wednesday, 14 March 2012 2:29 PM > Subject: Re: [Swift-devel] Mosaswift update > > The working example is located in /home/jonmon/Workspace/Swift/mosaswift-test > > There is a README there that provides a little bit of information. > > Quick version to run it: > 1) start your workers > 2) ./setup.sh > 3) ./run.sh > > The directory final will have 5 files if it completed without error. The content of the files will look like: > Hello world > Hello from > > On Mar 14, 2012, at 4:03 PM, Emalayan Vairavanathan wrote: > > > Hi Jon, > > Great, Thank you very much. Could you please send me the setup ? > > Thank you > Emalayan > > From: Jonathan Monette > To: Justin M Wozniak > Cc: "swift-devel at ci.uchicago.edu Devel" ; MosaStore > Sent: Wednesday, 14 March 2012 2:00 PM > Subject: Re: [Swift-devel] Mosaswift update > > So that actually fixed the issue. So we have a working example of using CDM and intermediate file storage that Emalayan can use for testing his mosa store set up. > > On Mar 14, 2012, at 3:05 PM, Jonathan Monette wrote: > > > Oh, thanks, didn't catch that. Trying another run right now but I think someone is using all of surveyor right now, but manual coaster service job is stuck in the queue state. > > > > > > On Mar 14, 2012, at 3:02 PM, Justin M Wozniak wrote: > > > >> > >> Ok, then it's a typo on readlinE . > >> > >> On Wed, 14 Mar 2012, Jonathan Monette wrote: > >> > >>> I am not sure. I thought readlink returned the absolute path of the link which is what I wanted to use. I am not sure how readlink behaves on the compute nodes. This was my first attempt at getting around the link but I do not think this is working. > >>> > >>> On Mar 14, 2012, at 2:43 PM, Justin M Wozniak wrote: > >>> > >>>> On Wed, 14 Mar 2012, Jonathan Monette wrote: > >>>> > >>>>> I say semi working because I intended the script to build up output during the stages on the swift run and the final output would show that it was stored temporarily in /tmp. This is not happening as the final output for the file contains only the last line(a Hello from line). I believe this is due to how CDM using symlinks for the files that is matched, so a symlink is being overwritten instead of at the file it points to. I ran into this problem in the past(but this is documented in the user guide) and just need verify that this is indeed the issue. > >>>> > >>>> In the past, the application unlinked the link. It looks like you are appending to the link. > >>>> > >>>> infile=`readlink $1` > >>>> outfile=`readline $2` > >>>> cat $infile >> $outfile > >>>> > >>>> Is there a case in which the readlink affects the behavior of this sequence? > >>>> > >>>> Does readline work correctly on the BG/P compute node? > >>>> > >>>> Justin > >>>> > >>>> -- > >>>> Justin M Wozniak > >>> > >>> > >> > >> -- > >> Justin M Wozniak > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > You received this message because you are subscribed to the Google Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com. > To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > You received this message because you are subscribed to the Google Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com. > To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. > > > -- > You received this message because you are subscribed to the Google Groups "MosaStore" group. > To post to this group, send email to mosastore at googlegroups.com. > To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. > For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Mon Mar 19 20:36:19 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Mon, 19 Mar 2012 18:36:19 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM Message-ID: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> Hi Jon and All, I tried to run the montage-swift workflow on my local machine with CDM rules enabled. The workflow is failing (only first stage was completed successfully). For me it seems that some stages of the workflow expect the input to be located in a specific location.? I am using swift-0.93. I have attached the log files, site files, configuration files and swift stderr/out with this mail. It would be great if you can look at this issue and help me. (Note: Due to similar issue ModFTDock-swift workflow also fails on my local machine when CDM rules enabled) Thank you Emalayan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: logs.zip Type: application/zip Size: 11019 bytes Desc: not available URL: From wozniak at mcs.anl.gov Wed Mar 21 15:56:32 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 21 Mar 2012 15:56:32 -0500 (CDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: Hi Emalayan I just took a look at this log. Does the montage-swift workflow work for you without CDM? Is there a machine on which we can all try to run this locally? (PADS login node?) Justin On Mon, 19 Mar 2012, Emalayan Vairavanathan wrote: > Hi Jon and All, > > I tried to run the montage-swift workflow on my local machine with CDM > rules enabled. The workflow is failing (only first stage was completed > successfully). For me it seems that some stages of the workflow expect > the input to be located in a specific location.? > > I am using swift-0.93. I have attached the log files, site files, > configuration files and swift stderr/out with this mail. It would be > great if you can look at this issue and help me. > > > (Note: Due to similar issue ModFTDock-swift workflow also fails on my local machine when CDM rules enabled) > > Thank you > Emalayan -- Justin M Wozniak From jonmon at mcs.anl.gov Wed Mar 21 18:30:14 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 21 Mar 2012 18:30:14 -0500 Subject: [Swift-devel] using a reservation Message-ID: <746BC72D-3CAE-4AD2-A43D-C8E6F565E511@mcs.anl.gov> Hello, I am trying to use a reservation I have on Beagle. Here is my sites file: 0.5 10000 _WORK_/local KEEP CI-MCB000119 1 DEBUG _WORK_/beagleRes/workers 100 100 pbs.aprun;pbs.mpp;depth=24 86400 00:04:00 1 20 20 advres=18833.687 12.00 10000 _WORK_/beagleRes I have tried both pbs.properties and pbs.resources as a sites entry, I got this information from https://sites.google.com/site/swiftdevel/sites/pbs However here is the pbs script that has been generated: #CoG This script generated by CoG #CoG by class: class org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor #CoG on date: 2012/03/21 23:22:35 #PBS -S /bin/bash #PBS -N Block-0321-2211 #PBS -m n #PBS -A CI-MCB000119 #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 #PBS -l walltime=17:00:00 #PBS -o /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout #PBS -e /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr WORKER_LOGGING_LEVEL=DEBUG #PBS -v WORKER_LOGGING_LEVEL cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c '/usr/bin/perl /home/jonmon/.globus/coasters/cscript9177561070598799820.pl http://10.128.2.243:40904,http://127.0.0.2:40904,http://192.5.86.103:40904 0321-221135-000000 /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' /bin/echo $? >/home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode I have asked for a coaster block of 24 hours(my reservation is 96 hours) but it shows a wall time of 17 hours. Furthermore, the line #PBS -l advres= is missing so I am not using my reservation, I just get added to the batch queue and sit there. Does any remember how to specify a reservation in the sites file for PBS? -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Wed Mar 21 18:55:37 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Wed, 21 Mar 2012 16:55:37 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> Message-ID: <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> Hi? Justin, Thank you for looking at the logs. Please find my answers below and let me know if I can help you in the debugging process. Does the montage-swift workflow work for you without CDM? >> Yes, it works successfully with out CDM on our cluster. (I tried with swift-0.93 and coasters) ?Is there a machine on which we can all try to run this locally? (PADS login node?) >> I am not aware whether we can do it on PADS and I think i do not have access to PADS. I tired montage-swift on our cluster with coasters. ??? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. ??/home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts Please let me know if you need more information to recreate the problem. Thank you Emalayan ________________________________ From: Justin M Wozniak To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Wednesday, 21 March 2012 1:56 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM Hi Emalayan ??? I just took a look at this log.? Does the montage-swift workflow work for you without CDM? ??? Is there a machine on which we can all try to run this locally? (PADS login node?) ??? Justin On Mon, 19 Mar 2012, Emalayan Vairavanathan wrote: > Hi Jon and All, > > I tried to run the montage-swift workflow on my local machine with CDM rules enabled. The workflow is failing (only first stage was completed successfully). For me it seems that some stages of the workflow expect the input to be located in a specific location.? > > I am using swift-0.93. I have attached the log files, site files, configuration files and swift stderr/out with this mail. It would be great if you can look at this issue and help me. > > > (Note: Due to similar issue ModFTDock-swift workflow also fails on my local machine when CDM rules enabled) > > Thank you > Emalayan -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Mar 21 22:19:54 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 21 Mar 2012 22:19:54 -0500 Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> Message-ID: <160579E8-746B-475C-8C01-835B0E345C90@mcs.anl.gov> I can set up SwiftMontage on PADS(it may be there but outdated). Running using surveyor had some problems in the past(the compute nodes couldn't find the Montage binaries). I will get back to debugging that later, I think I have a fix. On Mar 21, 2012, at 6:55 PM, Emalayan Vairavanathan wrote: > Hi Justin, > > Thank you for looking at the logs. Please find my answers below and let me know if I can help you in the debugging process. > > Does the montage-swift workflow work for you without CDM? > >> Yes, it works successfully with out CDM on our cluster. (I tried with swift-0.93 and coasters) > > Is there a machine on which we can all try to run this locally? (PADS login node?) > >> I am not aware whether we can do it on PADS and I think i do not have access to PADS. I tired montage-swift on our cluster with coasters. > > But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. > /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > > Please let me know if you need more information to recreate the problem. > > Thank you > Emalayan > > From: Justin M Wozniak > To: Emalayan Vairavanathan > Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei > Sent: Wednesday, 21 March 2012 1:56 PM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > Hi Emalayan > I just took a look at this log. Does the montage-swift workflow work for you without CDM? > Is there a machine on which we can all try to run this locally? (PADS login node?) > Justin > > On Mon, 19 Mar 2012, Emalayan Vairavanathan wrote: > > > Hi Jon and All, > > > > I tried to run the montage-swift workflow on my local machine with CDM rules enabled. The workflow is failing (only first stage was completed successfully). For me it seems that some stages of the workflow expect the input to be located in a specific location. > > > > I am using swift-0.93. I have attached the log files, site files, configuration files and swift stderr/out with this mail. It would be great if you can look at this issue and help me. > > > > > > (Note: Due to similar issue ModFTDock-swift workflow also fails on my local machine when CDM rules enabled) > > > > Thank you > > Emalayan > > -- Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Wed Mar 21 22:53:59 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 21 Mar 2012 22:53:59 -0500 (Central Daylight Time) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <160579E8-746B-475C-8C01-835B0E345C90@mcs.anl.gov> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <160579E8-746B-475C-8C01-835B0E345C90@mcs.anl.gov> Message-ID: Let me try surveyor login first to see if I can reproduce it. On Wed, 21 Mar 2012, Jonathan Monette wrote: > I can set up SwiftMontage on PADS(it may be there but outdated). > Running using surveyor had some problems in the past(the compute nodes > couldn't find the Montage binaries). I will get back to debugging that > later, I think I have a fix. > > On Mar 21, 2012, at 6:55 PM, Emalayan Vairavanathan wrote: > >> Hi Justin, >> >> Thank you for looking at the logs. Please find my answers below and let me know if I can help you in the debugging process. >> >> Does the montage-swift workflow work for you without CDM? >>>> Yes, it works successfully with out CDM on our cluster. (I tried with swift-0.93 and coasters) >> >> Is there a machine on which we can all try to run this locally? (PADS login node?) >>>> I am not aware whether we can do it on PADS and I think i do not have access to PADS. I tired montage-swift on our cluster with coasters. >> >> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >> >> Please let me know if you need more information to recreate the problem. >> >> Thank you >> Emalayan >> >> From: Justin M Wozniak >> To: Emalayan Vairavanathan >> Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei >> Sent: Wednesday, 21 March 2012 1:56 PM >> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> >> Hi Emalayan >> I just took a look at this log. Does the montage-swift workflow work for you without CDM? >> Is there a machine on which we can all try to run this locally? (PADS login node?) >> Justin >> >> On Mon, 19 Mar 2012, Emalayan Vairavanathan wrote: >> >>> Hi Jon and All, >>> >>> I tried to run the montage-swift workflow on my local machine with CDM rules enabled. The workflow is failing (only first stage was completed successfully). For me it seems that some stages of the workflow expect the input to be located in a specific location. >>> >>> I am using swift-0.93. I have attached the log files, site files, configuration files and swift stderr/out with this mail. It would be great if you can look at this issue and help me. >>> >>> >>> (Note: Due to similar issue ModFTDock-swift workflow also fails on my local machine when CDM rules enabled) >>> >>> Thank you >>> Emalayan >> >> -- Justin M Wozniak >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Justin M Wozniak From wilde at mcs.anl.gov Wed Mar 21 22:55:49 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 21 Mar 2012 22:55:49 -0500 (CDT) Subject: [Swift-devel] using a reservation In-Reply-To: <746BC72D-3CAE-4AD2-A43D-C8E6F565E511@mcs.anl.gov> Message-ID: <2115817913.106554.1332388549352.JavaMail.root@zimbra.anl.gov> Jon, Regarding the walltime, your sites file mis-spells maxwalltime; hence the jobs emitted by your script probably dont sum to anything beyond 17:00h at the 10m default time. I dont see why the res isnt making it through to the PBS script. - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "swift-devel at ci.uchicago.edu Devel" > Sent: Wednesday, March 21, 2012 6:30:14 PM > Subject: [Swift-devel] using a reservation > Hello, > I am trying to use a reservation I have on Beagle. Here is my sites > file: > > > > > > > 0.5 > 10000 > > _WORK_/local > > > > > > > > > > KEEP > > > CI-MCB000119 > 1 > DEBUG > key="workerLoggingDirectory">_WORK_/beagleRes/workers > 100 > 100 > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 > 86400 > 00:04:00 > 1 > 20 > 20 > > > key="pbs.properties">advres=18833.687 > > > 12.00 > 10000 > > > > _WORK_/beagleRes > > > > > > > > > I have tried both pbs.properties and pbs.resources as a sites entry, I > got this information from > https://sites.google.com/site/swiftdevel/sites/pbs > However here is the pbs script that has been generated: > > > > #CoG This script generated by CoG > #CoG by class: class > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor > #CoG on date: 2012/03/21 23:22:35 > > > #PBS -S /bin/bash > #PBS -N Block-0321-2211 > #PBS -m n > #PBS -A CI-MCB000119 > #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 > #PBS -l walltime=17:00:00 > #PBS -o > /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout > #PBS -e > /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr > WORKER_LOGGING_LEVEL=DEBUG > #PBS -v WORKER_LOGGING_LEVEL > cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c > '/usr/bin/perl > /home/jonmon/.globus/coasters/cscript9177561070598799820.pl > http://10.128.2.243:40904,http://127.0.0.2:40904,http://192.5.86.103:40904 > 0321-221135-000000 > /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' > /bin/echo $? > >/home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode > > > I have asked for a coaster block of 24 hours(my reservation is 96 > hours) but it shows a wall time of 17 hours. Furthermore, the line > #PBS -l advres= is missing so I am not using my reservation, I > just get added to the batch queue and sit there. Does any remember how > to specify a reservation in the sites file for PBS? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Wed Mar 21 22:57:57 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 21 Mar 2012 22:57:57 -0500 (Central Daylight Time) Subject: [Swift-devel] using a reservation In-Reply-To: <2115817913.106554.1332388549352.JavaMail.root@zimbra.anl.gov> References: <2115817913.106554.1332388549352.JavaMail.root@zimbra.anl.gov> Message-ID: I'll take a look at this tomorrow. On Wed, 21 Mar 2012, Michael Wilde wrote: > Jon, > > Regarding the walltime, your sites file mis-spells maxwalltime; hence the jobs emitted by your script probably dont sum to anything beyond 17:00h at the 10m default time. > > I dont see why the res isnt making it through to the PBS script. > > - Mike > > > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "swift-devel at ci.uchicago.edu Devel" >> Sent: Wednesday, March 21, 2012 6:30:14 PM >> Subject: [Swift-devel] using a reservation >> Hello, >> I am trying to use a reservation I have on Beagle. Here is my sites >> file: >> >> >> >> >> >> >> 0.5 >> 10000 >> >> _WORK_/local >> >> >> >> >> >> >> >> >> >> KEEP >> >> >> CI-MCB000119 >> 1 >> DEBUG >> > key="workerLoggingDirectory">_WORK_/beagleRes/workers >> 100 >> 100 >> > key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 >> 86400 >> 00:04:00 >> 1 >> 20 >> 20 >> >> >> > key="pbs.properties">advres=18833.687 >> >> >> 12.00 >> 10000 >> >> >> >> _WORK_/beagleRes >> >> >> >> >> >> >> >> >> I have tried both pbs.properties and pbs.resources as a sites entry, I >> got this information from >> https://sites.google.com/site/swiftdevel/sites/pbs >> However here is the pbs script that has been generated: >> >> >> >> #CoG This script generated by CoG >> #CoG by class: class >> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor >> #CoG on date: 2012/03/21 23:22:35 >> >> >> #PBS -S /bin/bash >> #PBS -N Block-0321-2211 >> #PBS -m n >> #PBS -A CI-MCB000119 >> #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 >> #PBS -l walltime=17:00:00 >> #PBS -o >> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout >> #PBS -e >> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr >> WORKER_LOGGING_LEVEL=DEBUG >> #PBS -v WORKER_LOGGING_LEVEL >> cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c >> '/usr/bin/perl >> /home/jonmon/.globus/coasters/cscript9177561070598799820.pl >> http://10.128.2.243:40904,http://127.0.0.2:40904,http://192.5.86.103:40904 >> 0321-221135-000000 >> /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' >> /bin/echo $? >>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode >> >> >> I have asked for a coaster block of 24 hours(my reservation is 96 >> hours) but it shows a wall time of 17 hours. Furthermore, the line >> #PBS -l advres= is missing so I am not using my reservation, I >> just get added to the batch queue and sit there. Does any remember how >> to specify a reservation in the sites file for PBS? >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Justin M Wozniak From jonmon at mcs.anl.gov Thu Mar 22 00:37:08 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 00:37:08 -0500 Subject: [Swift-devel] using a reservation In-Reply-To: References: <2115817913.106554.1332388549352.JavaMail.root@zimbra.anl.gov> Message-ID: So I have been looking at this. I tried adding my own reservation key to PBSExecutor but that does not seem to work. So my question is, does this not work because the JobSpecification object does not know to look for this attribute? If so, could this be the reason why I cannot seem to get the reservation to the PBS script using pbs.properties or pbs.resources(I also tried pbs.resource_list as that is what the code looks for). Where does the JobSpecification get built? Where is the xml sites file parsed? I cannot seem to find this code. On Mar 21, 2012, at 10:57 PM, Justin M Wozniak wrote: > > I'll take a look at this tomorrow. > > On Wed, 21 Mar 2012, Michael Wilde wrote: > >> Jon, >> >> Regarding the walltime, your sites file mis-spells maxwalltime; hence the jobs emitted by your script probably dont sum to anything beyond 17:00h at the 10m default time. >> >> I dont see why the res isnt making it through to the PBS script. >> >> - Mike >> >> >> >> ----- Original Message ----- >>> From: "Jonathan Monette" >>> To: "swift-devel at ci.uchicago.edu Devel" >>> Sent: Wednesday, March 21, 2012 6:30:14 PM >>> Subject: [Swift-devel] using a reservation >>> Hello, >>> I am trying to use a reservation I have on Beagle. Here is my sites >>> file: >>> >>> >>> >>> >>> >>> >>> 0.5 >>> 10000 >>> >>> _WORK_/local >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> KEEP >>> >>> >>> CI-MCB000119 >>> 1 >>> DEBUG >>> >> key="workerLoggingDirectory">_WORK_/beagleRes/workers >>> 100 >>> 100 >>> >> key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 >>> 86400 >>> 00:04:00 >>> 1 >>> 20 >>> 20 >>> >>> >>> >> key="pbs.properties">advres=18833.687 >>> >>> >>> 12.00 >>> 10000 >>> >>> >>> >>> _WORK_/beagleRes >>> >>> >>> >>> >>> >>> >>> >>> >>> I have tried both pbs.properties and pbs.resources as a sites entry, I >>> got this information from >>> https://sites.google.com/site/swiftdevel/sites/pbs >>> However here is the pbs script that has been generated: >>> >>> >>> >>> #CoG This script generated by CoG >>> #CoG by class: class >>> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor >>> #CoG on date: 2012/03/21 23:22:35 >>> >>> >>> #PBS -S /bin/bash >>> #PBS -N Block-0321-2211 >>> #PBS -m n >>> #PBS -A CI-MCB000119 >>> #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 >>> #PBS -l walltime=17:00:00 >>> #PBS -o >>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout >>> #PBS -e >>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr >>> WORKER_LOGGING_LEVEL=DEBUG >>> #PBS -v WORKER_LOGGING_LEVEL >>> cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c >>> '/usr/bin/perl >>> /home/jonmon/.globus/coasters/cscript9177561070598799820.pl >>> http://10.128.2.243:40904,http://127.0.0.2:40904,http://192.5.86.103:40904 >>> 0321-221135-000000 >>> /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' >>> /bin/echo $? >>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode >>> >>> >>> I have asked for a coaster block of 24 hours(my reservation is 96 >>> hours) but it shows a wall time of 17 hours. Furthermore, the line >>> #PBS -l advres= is missing so I am not using my reservation, I >>> just get added to the batch queue and sit there. Does any remember how >>> to specify a reservation in the sites file for PBS? >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> > > -- > Justin M Wozniak From ketancmaheshwari at gmail.com Thu Mar 22 07:56:38 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 22 Mar 2012 07:56:38 -0500 Subject: [Swift-devel] using a reservation In-Reply-To: References: <2115817913.106554.1332388549352.JavaMail.root@zimbra.anl.gov> Message-ID: Jon, Here is a sites.xml that I used for a Beagle reservation a while ago for modftdock. This worked well on swift-r4252 cog-r3088. See if it helps at all comparing yours and this one: ========= CI-CCR000013 24 24 16000 100 100 pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 25 2 2 13.00 10000 /lustre/beagle/ketan/labs/modftdock/bgl.reserved.run/swift.workdir ========= On Thu, Mar 22, 2012 at 12:37 AM, Jonathan Monette wrote: > So I have been looking at this. I tried adding my own reservation key to > PBSExecutor but that does not seem to work. So my question is, does this > not work because the JobSpecification object does not know to look for this > attribute? If so, could this be the reason why I cannot seem to get the > reservation to the PBS script using pbs.properties or pbs.resources(I also > tried pbs.resource_list as that is what the code looks for). Where does > the JobSpecification get built? Where is the xml sites file parsed? I > cannot seem to find this code. > > On Mar 21, 2012, at 10:57 PM, Justin M Wozniak wrote: > > > > > I'll take a look at this tomorrow. > > > > On Wed, 21 Mar 2012, Michael Wilde wrote: > > > >> Jon, > >> > >> Regarding the walltime, your sites file mis-spells maxwalltime; hence > the jobs emitted by your script probably dont sum to anything beyond 17:00h > at the 10m default time. > >> > >> I dont see why the res isnt making it through to the PBS script. > >> > >> - Mike > >> > >> > >> > >> ----- Original Message ----- > >>> From: "Jonathan Monette" > >>> To: "swift-devel at ci.uchicago.edu Devel" > >>> Sent: Wednesday, March 21, 2012 6:30:14 PM > >>> Subject: [Swift-devel] using a reservation > >>> Hello, > >>> I am trying to use a reservation I have on Beagle. Here is my sites > >>> file: > >>> > >>> > >>> > >>> > >>> > >>> > >>> 0.5 > >>> 10000 > >>> > >>> _WORK_/local > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> KEEP > >>> > >>> > >>> CI-MCB000119 > >>> 1 > >>> DEBUG > >>> >>> key="workerLoggingDirectory">_WORK_/beagleRes/workers > >>> 100 > >>> 100 > >>> >>> key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 > >>> 86400 > >>> 00:04:00 > >>> 1 > >>> 20 > >>> 20 > >>> > >>> > >>> >>> key="pbs.properties">advres=18833.687 > >>> > >>> > >>> 12.00 > >>> 10000 > >>> > >>> > >>> > >>> _WORK_/beagleRes > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> I have tried both pbs.properties and pbs.resources as a sites entry, I > >>> got this information from > >>> https://sites.google.com/site/swiftdevel/sites/pbs > >>> However here is the pbs script that has been generated: > >>> > >>> > >>> > >>> #CoG This script generated by CoG > >>> #CoG by class: class > >>> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor > >>> #CoG on date: 2012/03/21 23:22:35 > >>> > >>> > >>> #PBS -S /bin/bash > >>> #PBS -N Block-0321-2211 > >>> #PBS -m n > >>> #PBS -A CI-MCB000119 > >>> #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 > >>> #PBS -l walltime=17:00:00 > >>> #PBS -o > >>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout > >>> #PBS -e > >>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr > >>> WORKER_LOGGING_LEVEL=DEBUG > >>> #PBS -v WORKER_LOGGING_LEVEL > >>> cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c > >>> '/usr/bin/perl > >>> /home/jonmon/.globus/coasters/cscript9177561070598799820.pl > >>> http://10.128.2.243:40904,http://127.0.0.2:40904, > http://192.5.86.103:40904 > >>> 0321-221135-000000 > >>> > /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' > >>> /bin/echo $? > >>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode > >>> > >>> > >>> I have asked for a coaster block of 24 hours(my reservation is 96 > >>> hours) but it shows a wall time of 17 hours. Furthermore, the line > >>> #PBS -l advres= is missing so I am not using my reservation, I > >>> just get added to the batch queue and sit there. Does any remember how > >>> to specify a reservation in the sites file for PBS? > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> > > > > -- > > Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Mar 22 08:03:29 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 22 Mar 2012 08:03:29 -0500 (CDT) Subject: [Swift-devel] using a reservation In-Reply-To: Message-ID: <838650131.106803.1332421409483.JavaMail.root@zimbra.anl.gov> Thanks, Ketan. We should match this against the 0.93 & trunk provider codes: pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 I was wondering if in fact pbs.properties needed to be specified in the manner you show here (above). Im wondering if pbs.resource_list was changed in a revision after you ran this? - Mike ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Jonathan Monette" > Cc: "swift-devel at ci.uchicago.edu Devel" > Sent: Thursday, March 22, 2012 7:56:38 AM > Subject: Re: [Swift-devel] using a reservation > Jon, > > > Here is a sites.xml that I used for a Beagle reservation a while ago > for modftdock. This worked well on swift-r4252 cog-r3088. See if it > helps at all comparing yours and this one: > > ========= > > > > > > > CI-CCR000013 > 24 > 24 > 16000 > 100 > 100 > > pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 > > 25 > 2 > 2 > 13.00 > 10000 > > /lustre/beagle/ketan/labs/modftdock/bgl.reserved.run/swift.workdir > > > > > > > > > > > ========= > > > > > On Thu, Mar 22, 2012 at 12:37 AM, Jonathan Monette < > jonmon at mcs.anl.gov > wrote: > > > So I have been looking at this. I tried adding my own reservation key > to PBSExecutor but that does not seem to work. So my question is, does > this not work because the JobSpecification object does not know to > look for this attribute? If so, could this be the reason why I cannot > seem to get the reservation to the PBS script using pbs.properties or > pbs.resources(I also tried pbs.resource_list as that is what the code > looks for). Where does the JobSpecification get built? Where is the > xml sites file parsed? I cannot seem to find this code. > > > > > On Mar 21, 2012, at 10:57 PM, Justin M Wozniak wrote: > > > > > I'll take a look at this tomorrow. > > > > On Wed, 21 Mar 2012, Michael Wilde wrote: > > > >> Jon, > >> > >> Regarding the walltime, your sites file mis-spells maxwalltime; > >> hence the jobs emitted by your script probably dont sum to anything > >> beyond 17:00h at the 10m default time. > >> > >> I dont see why the res isnt making it through to the PBS script. > >> > >> - Mike > >> > >> > >> > >> ----- Original Message ----- > >>> From: "Jonathan Monette" < jonmon at mcs.anl.gov > > >>> To: " swift-devel at ci.uchicago.edu Devel" < > >>> swift-devel at ci.uchicago.edu > > >>> Sent: Wednesday, March 21, 2012 6:30:14 PM > >>> Subject: [Swift-devel] using a reservation > >>> Hello, > >>> I am trying to use a reservation I have on Beagle. Here is my > >>> sites > >>> file: > >>> > >>> > >>> > >>> > >>> > >>> > >>> 0.5 > >>> 10000 > >>> > >>> _WORK_/local > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> KEEP > >>> > >>> > >>> CI-MCB000119 > >>> 1 > >>> >>> key="workerLoggingLevel">DEBUG > >>> >>> key="workerLoggingDirectory">_WORK_/beagleRes/workers > >>> 100 > >>> 100 > >>> >>> key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 > >>> 86400 > >>> 00:04:00 > >>> 1 > >>> 20 > >>> 20 > >>> > >>> > >>> >>> key="pbs.properties">advres=18833.687 > >>> > >>> > >>> 12.00 > >>> 10000 > >>> > >>> > >>> > >>> _WORK_/beagleRes > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> I have tried both pbs.properties and pbs.resources as a sites > >>> entry, I > >>> got this information from > >>> https://sites.google.com/site/swiftdevel/sites/pbs > >>> However here is the pbs script that has been generated: > >>> > >>> > >>> > >>> #CoG This script generated by CoG > >>> #CoG by class: class > >>> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor > >>> #CoG on date: 2012/03/21 23:22:35 > >>> > >>> > >>> #PBS -S /bin/bash > >>> #PBS -N Block-0321-2211 > >>> #PBS -m n > >>> #PBS -A CI-MCB000119 > >>> #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 > >>> #PBS -l walltime=17:00:00 > >>> #PBS -o > >>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout > >>> #PBS -e > >>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr > >>> WORKER_LOGGING_LEVEL=DEBUG > >>> #PBS -v WORKER_LOGGING_LEVEL > >>> cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c > >>> '/usr/bin/perl > >>> /home/jonmon/.globus/coasters/ cscript9177561070598799820.pl > >>> http://10.128.2.243:40904 , http://127.0.0.2:40904 , > >>> http://192.5.86.103:40904 > >>> 0321-221135-000000 > >>> /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' > >>> /bin/echo $? > >>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode > >>> > >>> > >>> I have asked for a coaster block of 24 hours(my reservation is 96 > >>> hours) but it shows a wall time of 17 hours. Furthermore, the line > >>> #PBS -l advres= is missing so I am not using my > >>> reservation, I > >>> just get added to the batch queue and sit there. Does any remember > >>> how > >>> to specify a reservation in the sites file for PBS? > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > >> > > > > -- > > Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Thu Mar 22 08:38:11 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 08:38:11 -0500 Subject: [Swift-devel] using a reservation In-Reply-To: <838650131.106803.1332421409483.JavaMail.root@zimbra.anl.gov> References: <838650131.106803.1332421409483.JavaMail.root@zimbra.anl.gov> Message-ID: <3E3CA428-7117-4637-A72F-5BC0B62151C9@mcs.anl.gov> Using pbs.resource_list in the in the providerAttributes key works. The google site that I was using does not specify this. It seems to lead you to using advres=res_id instead. On Mar 22, 2012, at 8:03 AM, Michael Wilde wrote: > Thanks, Ketan. > > We should match this against the 0.93 & trunk provider codes: > > pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 > > I was wondering if in fact pbs.properties needed to be specified in the manner you show here (above). Im wondering if pbs.resource_list was changed in a revision after you ran this? > > - Mike > > > ----- Original Message ----- >> From: "Ketan Maheshwari" >> To: "Jonathan Monette" >> Cc: "swift-devel at ci.uchicago.edu Devel" >> Sent: Thursday, March 22, 2012 7:56:38 AM >> Subject: Re: [Swift-devel] using a reservation >> Jon, >> >> >> Here is a sites.xml that I used for a Beagle reservation a while ago >> for modftdock. This worked well on swift-r4252 cog-r3088. See if it >> helps at all comparing yours and this one: >> >> ========= >> >> >> >> >> >> >> CI-CCR000013 >> 24 >> 24 >> 16000 >> 100 >> 100 >> >> pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 >> >> 25 >> 2 >> 2 >> 13.00 >> 10000 >> >> /lustre/beagle/ketan/labs/modftdock/bgl.reserved.run/swift.workdir >> >> >> >> >> >> >> >> >> >> >> ========= >> >> >> >> >> On Thu, Mar 22, 2012 at 12:37 AM, Jonathan Monette < >> jonmon at mcs.anl.gov > wrote: >> >> >> So I have been looking at this. I tried adding my own reservation key >> to PBSExecutor but that does not seem to work. So my question is, does >> this not work because the JobSpecification object does not know to >> look for this attribute? If so, could this be the reason why I cannot >> seem to get the reservation to the PBS script using pbs.properties or >> pbs.resources(I also tried pbs.resource_list as that is what the code >> looks for). Where does the JobSpecification get built? Where is the >> xml sites file parsed? I cannot seem to find this code. >> >> >> >> >> On Mar 21, 2012, at 10:57 PM, Justin M Wozniak wrote: >> >>> >>> I'll take a look at this tomorrow. >>> >>> On Wed, 21 Mar 2012, Michael Wilde wrote: >>> >>>> Jon, >>>> >>>> Regarding the walltime, your sites file mis-spells maxwalltime; >>>> hence the jobs emitted by your script probably dont sum to anything >>>> beyond 17:00h at the 10m default time. >>>> >>>> I dont see why the res isnt making it through to the PBS script. >>>> >>>> - Mike >>>> >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Jonathan Monette" < jonmon at mcs.anl.gov > >>>>> To: " swift-devel at ci.uchicago.edu Devel" < >>>>> swift-devel at ci.uchicago.edu > >>>>> Sent: Wednesday, March 21, 2012 6:30:14 PM >>>>> Subject: [Swift-devel] using a reservation >>>>> Hello, >>>>> I am trying to use a reservation I have on Beagle. Here is my >>>>> sites >>>>> file: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 0.5 >>>>> 10000 >>>>> >>>>> _WORK_/local >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> KEEP >>>>> >>>>> >>>>> CI-MCB000119 >>>>> 1 >>>>> >>>> key="workerLoggingLevel">DEBUG >>>>> >>>> key="workerLoggingDirectory">_WORK_/beagleRes/workers >>>>> 100 >>>>> 100 >>>>> >>>> key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 >>>>> 86400 >>>>> 00:04:00 >>>>> 1 >>>>> 20 >>>>> 20 >>>>> >>>>> >>>>> >>>> key="pbs.properties">advres=18833.687 >>>>> >>>>> >>>>> 12.00 >>>>> 10000 >>>>> >>>>> >>>>> >>>>> _WORK_/beagleRes >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I have tried both pbs.properties and pbs.resources as a sites >>>>> entry, I >>>>> got this information from >>>>> https://sites.google.com/site/swiftdevel/sites/pbs >>>>> However here is the pbs script that has been generated: >>>>> >>>>> >>>>> >>>>> #CoG This script generated by CoG >>>>> #CoG by class: class >>>>> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor >>>>> #CoG on date: 2012/03/21 23:22:35 >>>>> >>>>> >>>>> #PBS -S /bin/bash >>>>> #PBS -N Block-0321-2211 >>>>> #PBS -m n >>>>> #PBS -A CI-MCB000119 >>>>> #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 >>>>> #PBS -l walltime=17:00:00 >>>>> #PBS -o >>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout >>>>> #PBS -e >>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr >>>>> WORKER_LOGGING_LEVEL=DEBUG >>>>> #PBS -v WORKER_LOGGING_LEVEL >>>>> cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c >>>>> '/usr/bin/perl >>>>> /home/jonmon/.globus/coasters/ cscript9177561070598799820.pl >>>>> http://10.128.2.243:40904 , http://127.0.0.2:40904 , >>>>> http://192.5.86.103:40904 >>>>> 0321-221135-000000 >>>>> /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' >>>>> /bin/echo $? >>>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode >>>>> >>>>> >>>>> I have asked for a coaster block of 24 hours(my reservation is 96 >>>>> hours) but it shows a wall time of 17 hours. Furthermore, the line >>>>> #PBS -l advres= is missing so I am not using my >>>>> reservation, I >>>>> just get added to the batch queue and sit there. Does any remember >>>>> how >>>>> to specify a reservation in the sites file for PBS? >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> >>> >>> -- >>> Justin M Wozniak >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> >> >> -- >> Ketan >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From wozniak at mcs.anl.gov Thu Mar 22 13:47:16 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 22 Mar 2012 13:47:16 -0500 (CDT) Subject: [Swift-devel] using a reservation In-Reply-To: <3E3CA428-7117-4637-A72F-5BC0B62151C9@mcs.anl.gov> References: <838650131.106803.1332421409483.JavaMail.root@zimbra.anl.gov> <3E3CA428-7117-4637-A72F-5BC0B62151C9@mcs.anl.gov> Message-ID: The issue is that Coasters does not know about PBS-specific settings. The providerAttributes profile allows you to pack multiple scheduler-specific settings into an opaque string; Coasters unpacks and passes these through to the underlying scheduler. Cf. https://sites.google.com/site/swiftdevel/internals/providers/coasters-provider advres=res_id This should work if you use PBS directly. Justin On Thu, 22 Mar 2012, Jonathan Monette wrote: > Using pbs.resource_list in the in the providerAttributes key works. > The google site that I was using does not specify this. It seems to > lead you to using key="pbs.resource_list">advres=res_id instead. > > On Mar 22, 2012, at 8:03 AM, Michael Wilde wrote: > >> Thanks, Ketan. >> >> We should match this against the 0.93 & trunk provider codes: >> >> pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 >> >> I was wondering if in fact pbs.properties needed to be specified in the >> manner you show here (above). Im wondering if pbs.resource_list was >> changed in a revision after you ran this? >> >> - Mike >> >> >> ----- Original Message ----- >>> From: "Ketan Maheshwari" >>> To: "Jonathan Monette" >>> Cc: "swift-devel at ci.uchicago.edu Devel" >>> Sent: Thursday, March 22, 2012 7:56:38 AM >>> Subject: Re: [Swift-devel] using a reservation >>> Jon, >>> >>> >>> Here is a sites.xml that I used for a Beagle reservation a while ago >>> for modftdock. This worked well on swift-r4252 cog-r3088. See if it >>> helps at all comparing yours and this one: >>> >>> ========= >>> >>> >>> >>> >>> >>> >>> CI-CCR000013 >>> 24 >>> 24 >>> 16000 >>> 100 >>> 100 >>> >>> pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 >>> >>> 25 >>> 2 >>> 2 >>> 13.00 >>> 10000 >>> >>> /lustre/beagle/ketan/labs/modftdock/bgl.reserved.run/swift.workdir >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ========= >>> >>> >>> >>> >>> On Thu, Mar 22, 2012 at 12:37 AM, Jonathan Monette < >>> jonmon at mcs.anl.gov > wrote: >>> >>> >>> So I have been looking at this. I tried adding my own reservation key >>> to PBSExecutor but that does not seem to work. So my question is, does >>> this not work because the JobSpecification object does not know to >>> look for this attribute? If so, could this be the reason why I cannot >>> seem to get the reservation to the PBS script using pbs.properties or >>> pbs.resources(I also tried pbs.resource_list as that is what the code >>> looks for). Where does the JobSpecification get built? Where is the >>> xml sites file parsed? I cannot seem to find this code. >>> >>> >>> >>> >>> On Mar 21, 2012, at 10:57 PM, Justin M Wozniak wrote: >>> >>>> >>>> I'll take a look at this tomorrow. >>>> >>>> On Wed, 21 Mar 2012, Michael Wilde wrote: >>>> >>>>> Jon, >>>>> >>>>> Regarding the walltime, your sites file mis-spells maxwalltime; >>>>> hence the jobs emitted by your script probably dont sum to anything >>>>> beyond 17:00h at the 10m default time. >>>>> >>>>> I dont see why the res isnt making it through to the PBS script. >>>>> >>>>> - Mike >>>>> >>>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Jonathan Monette" < jonmon at mcs.anl.gov > >>>>>> To: " swift-devel at ci.uchicago.edu Devel" < >>>>>> swift-devel at ci.uchicago.edu > >>>>>> Sent: Wednesday, March 21, 2012 6:30:14 PM >>>>>> Subject: [Swift-devel] using a reservation >>>>>> Hello, >>>>>> I am trying to use a reservation I have on Beagle. Here is my >>>>>> sites >>>>>> file: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 0.5 >>>>>> 10000 >>>>>> >>>>>> _WORK_/local >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> KEEP >>>>>> >>>>>> >>>>>> CI-MCB000119 >>>>>> 1 >>>>>> >>>>> key="workerLoggingLevel">DEBUG >>>>>> >>>>> key="workerLoggingDirectory">_WORK_/beagleRes/workers >>>>>> 100 >>>>>> 100 >>>>>> >>>>> key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 >>>>>> 86400 >>>>>> 00:04:00 >>>>>> 1 >>>>>> 20 >>>>>> 20 >>>>>> >>>>>> >>>>>> >>>>> key="pbs.properties">advres=18833.687 >>>>>> >>>>>> >>>>>> 12.00 >>>>>> 10000 >>>>>> >>>>>> >>>>>> >>>>>> _WORK_/beagleRes >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I have tried both pbs.properties and pbs.resources as a sites >>>>>> entry, I >>>>>> got this information from >>>>>> https://sites.google.com/site/swiftdevel/sites/pbs >>>>>> However here is the pbs script that has been generated: >>>>>> >>>>>> >>>>>> >>>>>> #CoG This script generated by CoG >>>>>> #CoG by class: class >>>>>> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor >>>>>> #CoG on date: 2012/03/21 23:22:35 >>>>>> >>>>>> >>>>>> #PBS -S /bin/bash >>>>>> #PBS -N Block-0321-2211 >>>>>> #PBS -m n >>>>>> #PBS -A CI-MCB000119 >>>>>> #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 >>>>>> #PBS -l walltime=17:00:00 >>>>>> #PBS -o >>>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout >>>>>> #PBS -e >>>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr >>>>>> WORKER_LOGGING_LEVEL=DEBUG >>>>>> #PBS -v WORKER_LOGGING_LEVEL >>>>>> cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c >>>>>> '/usr/bin/perl >>>>>> /home/jonmon/.globus/coasters/ cscript9177561070598799820.pl >>>>>> http://10.128.2.243:40904 , http://127.0.0.2:40904 , >>>>>> http://192.5.86.103:40904 >>>>>> 0321-221135-000000 >>>>>> /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' >>>>>> /bin/echo $? >>>>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode >>>>>> >>>>>> >>>>>> I have asked for a coaster block of 24 hours(my reservation is 96 >>>>>> hours) but it shows a wall time of 17 hours. Furthermore, the line >>>>>> #PBS -l advres= is missing so I am not using my >>>>>> reservation, I >>>>>> just get added to the batch queue and sit there. Does any remember >>>>>> how >>>>>> to specify a reservation in the sites file for PBS? >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> >>>> >>>> -- >>>> Justin M Wozniak >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >>> >>> >>> -- >>> Ketan >>> >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak From wozniak at mcs.anl.gov Thu Mar 22 13:54:58 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 22 Mar 2012 13:54:58 -0500 (CDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> Message-ID: On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: > ??? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. > ??/home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts What is the entry point? Are we missing common.sh? -- Justin M Wozniak From jonmon at mcs.anl.gov Thu Mar 22 13:54:55 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 13:54:55 -0500 Subject: [Swift-devel] using a reservation In-Reply-To: References: <838650131.106803.1332421409483.JavaMail.root@zimbra.anl.gov> <3E3CA428-7117-4637-A72F-5BC0B62151C9@mcs.anl.gov> Message-ID: <9152E697-A84D-4321-B6A4-36797E278B0C@mcs.anl.gov> I read this in reading the comments to the JobSoecification source. And using providerAttributed key that Ketan used in his example worked. I will add information to the swiftdevel PBS page regarding this and point to the coaster page for more information. Thanks Justin. On Mar 22, 2012, at 13:47, Justin M Wozniak wrote: > > The issue is that Coasters does not know about PBS-specific settings. The providerAttributes profile allows you to pack multiple scheduler-specific settings into an opaque string; Coasters unpacks and passes these through to the underlying scheduler. > > Cf. https://sites.google.com/site/swiftdevel/internals/providers/coasters-provider > > advres=res_id > > This should work if you use PBS directly. > > Justin > > On Thu, 22 Mar 2012, Jonathan Monette wrote: > >> Using pbs.resource_list in the in the providerAttributes key works. The google site that I was using does not specify this. It seems to lead you to using advres=res_id instead. >> >> On Mar 22, 2012, at 8:03 AM, Michael Wilde wrote: >> >>> Thanks, Ketan. >>> >>> We should match this against the 0.93 & trunk provider codes: >>> >>> pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 >>> >>> I was wondering if in fact pbs.properties needed to be specified in the manner you show here (above). Im wondering if pbs.resource_list was changed in a revision after you ran this? >>> >>> - Mike >>> >>> >>> ----- Original Message ----- >>>> From: "Ketan Maheshwari" >>>> To: "Jonathan Monette" >>>> Cc: "swift-devel at ci.uchicago.edu Devel" >>>> Sent: Thursday, March 22, 2012 7:56:38 AM >>>> Subject: Re: [Swift-devel] using a reservation >>>> Jon, >>>> >>>> >>>> Here is a sites.xml that I used for a Beagle reservation a while ago >>>> for modftdock. This worked well on swift-r4252 cog-r3088. See if it >>>> helps at all comparing yours and this one: >>>> >>>> ========= >>>> >>>> >>>> >>>> >>>> >>>> >>>> CI-CCR000013 >>>> 24 >>>> 24 >>>> 16000 >>>> 100 >>>> 100 >>>> >>>> pbs.aprun;pbs.mpp;pbs.resource_list=advres=modFTDock.47 >>>> >>>> 25 >>>> 2 >>>> 2 >>>> 13.00 >>>> 10000 >>>> >>>> /lustre/beagle/ketan/labs/modftdock/bgl.reserved.run/swift.workdir >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ========= >>>> >>>> >>>> >>>> >>>> On Thu, Mar 22, 2012 at 12:37 AM, Jonathan Monette < >>>> jonmon at mcs.anl.gov > wrote: >>>> >>>> >>>> So I have been looking at this. I tried adding my own reservation key >>>> to PBSExecutor but that does not seem to work. So my question is, does >>>> this not work because the JobSpecification object does not know to >>>> look for this attribute? If so, could this be the reason why I cannot >>>> seem to get the reservation to the PBS script using pbs.properties or >>>> pbs.resources(I also tried pbs.resource_list as that is what the code >>>> looks for). Where does the JobSpecification get built? Where is the >>>> xml sites file parsed? I cannot seem to find this code. >>>> >>>> >>>> >>>> >>>> On Mar 21, 2012, at 10:57 PM, Justin M Wozniak wrote: >>>> >>>>> >>>>> I'll take a look at this tomorrow. >>>>> >>>>> On Wed, 21 Mar 2012, Michael Wilde wrote: >>>>> >>>>>> Jon, >>>>>> >>>>>> Regarding the walltime, your sites file mis-spells maxwalltime; >>>>>> hence the jobs emitted by your script probably dont sum to anything >>>>>> beyond 17:00h at the 10m default time. >>>>>> >>>>>> I dont see why the res isnt making it through to the PBS script. >>>>>> >>>>>> - Mike >>>>>> >>>>>> >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Jonathan Monette" < jonmon at mcs.anl.gov > >>>>>>> To: " swift-devel at ci.uchicago.edu Devel" < >>>>>>> swift-devel at ci.uchicago.edu > >>>>>>> Sent: Wednesday, March 21, 2012 6:30:14 PM >>>>>>> Subject: [Swift-devel] using a reservation >>>>>>> Hello, >>>>>>> I am trying to use a reservation I have on Beagle. Here is my >>>>>>> sites >>>>>>> file: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 0.5 >>>>>>> 10000 >>>>>>> >>>>>>> _WORK_/local >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> KEEP >>>>>>> >>>>>>> >>>>>>> CI-MCB000119 >>>>>>> 1 >>>>>>> >>>>>> key="workerLoggingLevel">DEBUG >>>>>>> >>>>>> key="workerLoggingDirectory">_WORK_/beagleRes/workers >>>>>>> 100 >>>>>>> 100 >>>>>>> >>>>>> key="providerAttributes">pbs.aprun;pbs.mpp;depth=24 >>>>>>> 86400 >>>>>>> 00:04:00 >>>>>>> 1 >>>>>>> 20 >>>>>>> 20 >>>>>>> >>>>>>> >>>>>>> >>>>>> key="pbs.properties">advres=18833.687 >>>>>>> >>>>>>> >>>>>>> 12.00 >>>>>>> 10000 >>>>>>> >>>>>>> >>>>>>> >>>>>>> _WORK_/beagleRes >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I have tried both pbs.properties and pbs.resources as a sites >>>>>>> entry, I >>>>>>> got this information from >>>>>>> https://sites.google.com/site/swiftdevel/sites/pbs >>>>>>> However here is the pbs script that has been generated: >>>>>>> >>>>>>> >>>>>>> >>>>>>> #CoG This script generated by CoG >>>>>>> #CoG by class: class >>>>>>> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor >>>>>>> #CoG on date: 2012/03/21 23:22:35 >>>>>>> >>>>>>> >>>>>>> #PBS -S /bin/bash >>>>>>> #PBS -N Block-0321-2211 >>>>>>> #PBS -m n >>>>>>> #PBS -A CI-MCB000119 >>>>>>> #PBS -l mppwidth=20,mppnppn=1,mppdepth=24 >>>>>>> #PBS -l walltime=17:00:00 >>>>>>> #PBS -o >>>>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stdout >>>>>>> #PBS -e >>>>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.stderr >>>>>>> WORKER_LOGGING_LEVEL=DEBUG >>>>>>> #PBS -v WORKER_LOGGING_LEVEL >>>>>>> cd / && aprun -n 20 -N 1 -cc none -d 24 -F exclusive /bin/sh -c >>>>>>> '/usr/bin/perl >>>>>>> /home/jonmon/.globus/coasters/ cscript9177561070598799820.pl >>>>>>> http://10.128.2.243:40904 , http://127.0.0.2:40904 , >>>>>>> http://192.5.86.103:40904 >>>>>>> 0321-221135-000000 >>>>>>> /lustre/beagle/jonmon/Swift/SciColSim/run163/swiftwork/beagleRes/workers' >>>>>>> /bin/echo $? >>>>>>>> /home/jonmon/.globus/scripts/PBS1332885235909759395.submit.exitcode >>>>>>> >>>>>>> >>>>>>> I have asked for a coaster block of 24 hours(my reservation is 96 >>>>>>> hours) but it shows a wall time of 17 hours. Furthermore, the line >>>>>>> #PBS -l advres= is missing so I am not using my >>>>>>> reservation, I >>>>>>> just get added to the batch queue and sit there. Does any remember >>>>>>> how >>>>>>> to specify a reservation in the sites file for PBS? >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Justin M Wozniak >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> >>>> >>>> >>>> -- >>>> Ketan >>>> >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > -- > Justin M Wozniak From svemalayan at yahoo.com Thu Mar 22 14:35:19 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 22 Mar 2012 12:35:19 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> Message-ID: <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> Hi Justin, Please use ./run_local.sh to run the montage without cdm locally on the headnode. The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. Please let me know if you have questions. Thank you Emalayan???? ________________________________ From: Justin M Wozniak To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: > ??? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. > ??/home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts What is the entry point? Are we missing common.sh? -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Mar 22 14:47:26 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 14:47:26 -0500 Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: So here is an old cdm file I used when running this on PADS rule .*raw_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big rule .*proj_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*diff_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*stat_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*rect_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 The on Emalayan is using is rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift/SwiftMontage/scripts rule .*final.* DIRECT /home/emalayan/App/montage-swift/SwiftMontage/scripts rule .*.* DIRECT /tmp/mosa I explicitly used DIRECT on the different intermediate directories(proj_dir, diff_dir, stat_dir, and rect_dir). In your however you match anything that is not raw_dir and final and put it into mosa. I would say try to do something like mine where you explicitly map those intermediate directories to /tmp/mosa On Mar 22, 2012, at 2:35 PM, Emalayan Vairavanathan wrote: > Hi Justin, > > Please use ./run_local.sh to run the montage without cdm locally on the headnode. > > The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. > > Please let me know if you have questions. > > Thank you > Emalayan > From: Justin M Wozniak > To: Emalayan Vairavanathan > Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei > Sent: Thursday, 22 March 2012 11:54 AM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: > > > But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. > > /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > > What is the entry point? > > Are we missing common.sh? > > -- Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Thu Mar 22 15:42:40 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 22 Mar 2012 15:42:40 -0500 (CDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: Ok, I can get it started but I get: 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null resulting in: File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null This looks like a Swift bug. However, do you guys have an existing workaround? Thanks On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > Hi Justin, > > Please use ./run_local.sh to run the montage without cdm locally on the headnode. > > > The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. > > Please let me know if you have questions. > > Thank you > Emalayan???? > > ________________________________ > From: Justin M Wozniak > To: Emalayan Vairavanathan > Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei > Sent: Thursday, 22 March 2012 11:54 AM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: > >> ??? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >> ??/home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > > What is the entry point? > > Are we missing common.sh? > > -- Justin M Wozniak -- Justin M Wozniak From jonmon at mcs.anl.gov Thu Mar 22 15:44:50 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 15:44:50 -0500 Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: Was that with CDM in your run? I am going to take a look too as to why that is showing up. On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: > > Ok, I can get it started but I get: > > 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null > 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null > 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null > > resulting in: > > File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null > > This looks like a Swift bug. However, do you guys have an existing workaround? > > Thanks > > On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > >> Hi Justin, >> >> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >> >> >> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >> Please let me know if you have questions. >> >> Thank you >> Emalayan >> ________________________________ >> From: Justin M Wozniak >> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >> >>> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >> >> What is the entry point? >> >> Are we missing common.sh? >> >> -- Justin M Wozniak > > -- > Justin M Wozniak_______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From svemalayan at yahoo.com Thu Mar 22 15:47:18 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 22 Mar 2012 13:47:18 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <1332449238.3479.YahooMailNeo@web39508.mail.mud.yahoo.com> Hi Jon, I already tried different combinations too. Some works and some does not work. The cdm rules? you provided (fs_1.data) works; But some cdm rules (fs_2.data,? fs_3.data, fs_4.data and fs_5.data) didnt work.? To run montage with Mosa+Swift we need? to use rules in fs_3.data (input is in GPFS, intemediate results in Mosa, and output should be written again to GPFS). Or may be we can modify montage-swift script to do stage-in and stage-out data. In this case we need some rules as in fs_5.data. I think the some stage of montage expect the input to be in a specific location. Please let me know if you need any help with debugging.? Also please correct me if I am wrong. Thank you Emalayan fs_1.data rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*proj_dir/.* DIRECT /tmp/local1 rule .*diff_dir/.* DIRECT /tmp/local2 rule .*stat_dir/.* DIRECT /tmp/local3 rule .*rect_dir/.* DIRECT /tmp/local4 fs_2.data rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*proj_dir/.* DIRECT /tmp/local1 rule .*diff_dir/.* DIRECT /tmp/local2 rule .*stat_dir/.* DIRECT /tmp/local3 rule .*rect_dir/.* DIRECT /tmp/local4 rule .*final/.* DIRECT /tmp/local5 fs_3.datarule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*final.* DIRECT? /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*.* DIRECT /tmp/local fs_4.data rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*.* DIRECT /tmp/local fs_5.data (Here I copied all the input data to /tmp/local) rule .*.* DIRECT /tmp/local ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 12:47 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM So here is an old cdm file I used when running this on PADS rule .*raw_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big rule .*proj_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*diff_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*stat_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*rect_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 The on Emalayan is using is? rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift/SwiftMontage/scripts rule .*final.* DIRECT /home/emalayan/App/montage-swift/SwiftMontage/scripts rule .*.* DIRECT /tmp/mosa I explicitly used DIRECT on the different intermediate directories(proj_dir, diff_dir, stat_dir, and rect_dir). ?In your however you match anything that is not raw_dir and final and put it into mosa. I would say try to do something like mine where you explicitly map those intermediate directories to /tmp/mosa On Mar 22, 2012, at 2:35 PM, Emalayan Vairavanathan wrote: Hi Justin, > > >Please use ./run_local.sh to run the montage without cdm locally on the headnode. > > > >The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. > >Please let me know if you have questions. > >Thank you >Emalayan???? > >________________________________ > From: Justin M Wozniak >To: Emalayan Vairavanathan >Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei >Sent: Thursday, 22 March 2012 11:54 AM >Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > >On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: > >> ??? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >> ??/home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > >What is the entry point? > >Are we missing common.sh? > >-- Justin M Wozniak > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Thu Mar 22 15:49:24 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 22 Mar 2012 15:49:24 -0500 (CDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: This run had no CDM file. On Thu, 22 Mar 2012, Jonathan Monette wrote: > Was that with CDM in your run? I am going to take a look too as to why that is showing up. > > > On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: > >> >> Ok, I can get it started but I get: >> >> 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null >> 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null >> 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null >> >> resulting in: >> >> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >> >> This looks like a Swift bug. However, do you guys have an existing workaround? >> >> Thanks >> >> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >> >>> Hi Justin, >>> >>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>> >>> >>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>> Please let me know if you have questions. >>> >>> Thank you >>> Emalayan >>> ________________________________ >>> From: Justin M Wozniak >>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>> >>>> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>> >>> What is the entry point? >>> >>> Are we missing common.sh? >>> >>> -- Justin M Wozniak >> >> -- >> Justin M Wozniak_______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Justin M Wozniak From svemalayan at yahoo.com Thu Mar 22 15:49:33 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 22 Mar 2012 13:49:33 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332449238.3479.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332449238.3479.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <1332449373.71139.YahooMailNeo@web39504.mail.mud.yahoo.com> All these combinations I tried on a local file system in our cluster with Coasters. If you need more detail about my setup please let me know. ________________________________ From: Emalayan Vairavanathan To: Jonathan Monette Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:47 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM Hi Jon, I already tried different combinations too. Some works and some does not work. The cdm rules? you provided (fs_1.data) works; But some cdm rules (fs_2.data,? fs_3.data, fs_4.data and fs_5.data) didnt work.? To run montage with Mosa+Swift we need? to use rules in fs_3.data (input is in GPFS, intemediate results in Mosa, and output should be written again to GPFS). Or may be we can modify montage-swift script to do stage-in and stage-out data. In this case we need some rules as in fs_5.data. I think the some stage of montage expect the input to be in a specific location. Please let me know if you need any help with debugging.? Also please correct me if I am wrong. Thank you Emalayan fs_1.data rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*proj_dir/.* DIRECT /tmp/local1 rule .*diff_dir/.* DIRECT /tmp/local2 rule .*stat_dir/.* DIRECT /tmp/local3 rule .*rect_dir/.* DIRECT /tmp/local4 fs_2.data rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*proj_dir/.* DIRECT /tmp/local1 rule .*diff_dir/.* DIRECT /tmp/local2 rule .*stat_dir/.* DIRECT /tmp/local3 rule .*rect_dir/.* DIRECT /tmp/local4 rule .*final/.* DIRECT /tmp/local5 fs_3.datarule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*final.* DIRECT? /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*.* DIRECT /tmp/local fs_4.data rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*.* DIRECT /tmp/local fs_5.data (Here I copied all the input data to /tmp/local) rule .*.* DIRECT /tmp/local ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 12:47 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM So here is an old cdm file I used when running this on PADS rule .*raw_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big rule .*proj_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*diff_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*stat_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 rule .*rect_dir/.* DIRECT /gpfs/pads/swift/jonmon/Swift/SwiftMontage/big/run.0012 The on Emalayan is using is? rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift/SwiftMontage/scripts rule .*final.* DIRECT /home/emalayan/App/montage-swift/SwiftMontage/scripts rule .*.* DIRECT /tmp/mosa I explicitly used DIRECT on the different intermediate directories(proj_dir, diff_dir, stat_dir, and rect_dir). ?In your however you match anything that is not raw_dir and final and put it into mosa. I would say try to do something like mine where you explicitly map those intermediate directories to /tmp/mosa On Mar 22, 2012, at 2:35 PM, Emalayan Vairavanathan wrote: Hi Justin, > > >Please use ./run_local.sh to run the montage without cdm locally on the headnode. > > > >The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. > >Please let me know if you have questions. > >Thank you >Emalayan???? > >________________________________ > From: Justin M Wozniak >To: Emalayan Vairavanathan >Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei >Sent: Thursday, 22 March 2012 11:54 AM >Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > >On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: > >> ??? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >> ??/home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > >What is the entry point? > >Are we missing common.sh? > >-- Justin M Wozniak > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Thu Mar 22 16:02:21 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 22 Mar 2012 14:02:21 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> Hi Justin, For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift May be you are using a different swift version or may be a different login machine. Thank you Emalayan ________________________________ From: Justin M Wozniak To: Jonathan Monette Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:49 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM This run had no CDM file. On Thu, 22 Mar 2012, Jonathan Monette wrote: > Was that with CDM in your run?? I am going to take a look too as to why that is showing up. > > > On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: > >> >> Ok, I can get it started but I get: >> >> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >> >> resulting in: >> >> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >> >> This looks like a Swift bug.? However, do you guys have an existing workaround? >> >> ??? Thanks >> >> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >> >>> Hi Justin, >>> >>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>> >>> >>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>> Please let me know if you have questions. >>> >>> Thank you >>> Emalayan >>> ________________________________ >>> From: Justin M Wozniak >>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>> >>>>? ? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>> >>> What is the entry point? >>> >>> Are we missing common.sh? >>> >>> -- Justin M Wozniak >> >> -- >> Justin M Wozniak_______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Mar 22 16:20:10 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 16:20:10 -0500 Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> Message-ID: So I was able to run it on surveyor and it completed. I see those same SetFielfValue lines but I don't think that is an issue. That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. However I do see the Swift garbage collector kicking in and throwing exceptions: 2012-03-22 20:58:03,070+0000 INFO FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) I am not sure if that may be causing problems for when Emalayan tries to run with CDM. Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. The only things that are expected to be in certain places are the raw_dir and header.hdr. Those have to be in the pwd. However I will continue debugging to make sure those assumptions I made are holding. As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. CDM may not like that. Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. I am not sure, that is just a hypothesis. I can always change the scripts to not use the concurrent mapper and use a better mapper for the CDM rules if that turns out to be the case. On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: > Hi Justin, > > For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. > > I tried on login2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift > > May be you are using a different swift version or may be a different login machine. > > Thank you > Emalayan > > From: Justin M Wozniak > To: Jonathan Monette > Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore > Sent: Thursday, 22 March 2012 1:49 PM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > > This run had no CDM file. > > On Thu, 22 Mar 2012, Jonathan Monette wrote: > > > Was that with CDM in your run? I am going to take a look too as to why that is showing up. > > > > > > On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: > > > >> > >> Ok, I can get it started but I get: > >> > >> 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null > >> 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null > >> 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null > >> > >> resulting in: > >> > >> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null > >> > >> This looks like a Swift bug. However, do you guys have an existing workaround? > >> > >> Thanks > >> > >> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > >> > >>> Hi Justin, > >>> > >>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. > >>> > >>> > >>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. > >>> Please let me know if you have questions. > >>> > >>> Thank you > >>> Emalayan > >>> ________________________________ > >>> From: Justin M Wozniak > >>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM > >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > >>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: > >>> > >>>> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. > >>>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > >>> > >>> What is the entry point? > >>> > >>> Are we missing common.sh? > >>> > >>> -- Justin M Wozniak > >> > >> -- > >> Justin M Wozniak_______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > Justin M Wozniak > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Thu Mar 22 16:38:03 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 22 Mar 2012 16:38:03 -0500 (CDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> Message-ID: Ok. I may be seeing a transient bug, that sometimes produces: rect_img:Image = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies Neglecting this bug, I can get the whole thing to run w/o CDM. With CDM rule: rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts/ mProjectBatch works, producing log lines: CDM_POLICY: raw_dir/2mass-atlas-990214n-j1190021.fits -> DIRECT and completing those tasks. I still have some more tests to run... Justin On Thu, 22 Mar 2012, Jonathan Monette wrote: > So I was able to run it on surveyor and it completed. I see those same SetFielfValue lines but I don't think that is an issue. That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. However I do see the Swift garbage collector kicking in and throwing exceptions: > > 2012-03-22 20:58:03,070+0000 INFO FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b > java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. > at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) > at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) > > > I am not sure if that may be causing problems for when Emalayan tries to run with CDM. > > Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. The only things that are expected to be in certain places are the raw_dir and header.hdr. Those have to be in the pwd. However I will continue debugging to make sure those assumptions I made are holding. > > As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. CDM may not like that. Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. I am not sure, that is just a hypothesis. I can always change the scripts to not use the concurrent mapper and use a better mapper for the CDM rules if that turns out to be the case. > > On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: > >> Hi Justin, >> >> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >> >> I tried on login2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >> >> May be you are using a different swift version or may be a different login machine. >> >> Thank you >> Emalayan >> >> From: Justin M Wozniak >> To: Jonathan Monette >> Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >> Sent: Thursday, 22 March 2012 1:49 PM >> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> >> >> This run had no CDM file. >> >> On Thu, 22 Mar 2012, Jonathan Monette wrote: >> >>> Was that with CDM in your run? I am going to take a look too as to why that is showing up. >>> >>> >>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>> >>>> >>>> Ok, I can get it started but I get: >>>> >>>> 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null >>>> 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null >>>> 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null >>>> >>>> resulting in: >>>> >>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>> >>>> This looks like a Swift bug. However, do you guys have an existing workaround? >>>> >>>> Thanks >>>> >>>> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >>>> >>>>> Hi Justin, >>>>> >>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>> >>>>> >>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>> Please let me know if you have questions. >>>>> >>>>> Thank you >>>>> Emalayan >>>>> ________________________________ >>>>> From: Justin M Wozniak >>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>> >>>>>> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>> >>>>> What is the entry point? >>>>> >>>>> Are we missing common.sh? >>>>> >>>>> -- Justin M Wozniak >>>> >>>> -- >>>> Justin M Wozniak_______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >> >> -- >> Justin M Wozniak >> >> > > -- Justin M Wozniak From svemalayan at yahoo.com Thu Mar 22 16:40:59 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 22 Mar 2012 14:40:59 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> Message-ID: <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> Hi Jon, Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of? location dependency ? rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*final.* DIRECT? /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ rule .*.* DIRECT /tmp/local Meantime I will run the montage again and see whether swift garbage-collector throws some error. Please let me know if you have a better idea to debug the problem. Thank you Emalayan ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:20 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM So I was able to run it on surveyor and it completed. ?I see those same SetFielfValue lines but I don't think that is an issue. ?That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. ?However I do see the Swift garbage collector kicking in and throwing exceptions: 2012-03-22 20:58:03,070+0000 INFO ?FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) I am not sure if that may be causing problems for when Emalayan tries to run with CDM. Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. ?In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. ?Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. ?The only things that are expected to be in certain places are the raw_dir and header.hdr. ?Those have to be in the pwd. ?However I will continue debugging to make sure those assumptions I made are holding. As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. ?I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. ?CDM may not like that. ?Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. ?I am not sure, that is just a hypothesis. ?I can always change the scripts to not use the concurrent mapper and use a better mapper ?for the CDM rules if that turns out to be the case. On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: Hi Justin, > > >For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. > > > >I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift > > >May be you are using a different swift version or may be a different login machine. > > > >Thank you >Emalayan > > > > >________________________________ > From: Justin M Wozniak >To: Jonathan Monette >Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >Sent: Thursday, 22 March 2012 1:49 PM >Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > >This run had no CDM file. > >On Thu, 22 Mar 2012, Jonathan Monette wrote: > >> Was that with CDM in your run?? I am going to take a look too as to why that is showing up. >> >> >> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >> >>> >>> Ok, I can get it started but I get: >>> >>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >>> >>> resulting in: >>> >>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>> >>> This looks like a Swift bug.? However, do you guys have an existing workaround? >>> >>> ??? Thanks >>> >>> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >>> >>>> Hi Justin, >>>> >>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>> >>>> >>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>> Please let me know if you have questions. >>>> >>>> Thank you >>>> Emalayan >>>> ________________________________ >>>> From: Justin M Wozniak >>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>> >>>>>? ? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>> >>>> What is the entry point? >>>> >>>> Are we missing common.sh? >>>> >>>> -- Justin M Wozniak >>> >>> -- >>> Justin M Wozniak_______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> > >-- >Justin M Wozniak > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Mar 22 16:45:11 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 16:45:11 -0500 Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> Message-ID: <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> I'll run with those rules. I saw the garbage collection exception in the log file. The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. But it just occurred to me, which swift are you using again? I am running with the copy I grabbed from Justin's directory. Are you running that as well or with 0.93? If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: > Hi Jon, > > Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of location dependency ? > > rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts > rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ > rule .*final.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ > rule .*.* DIRECT /tmp/local > > Meantime I will run the montage again and see whether swift garbage-collector throws some error. > > Please let me know if you have a better idea to debug the problem. > > Thank you > Emalayan > > From: Jonathan Monette > To: Emalayan Vairavanathan > Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore > Sent: Thursday, 22 March 2012 2:20 PM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > So I was able to run it on surveyor and it completed. I see those same SetFielfValue lines but I don't think that is an issue. That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. However I do see the Swift garbage collector kicking in and throwing exceptions: > > 2012-03-22 20:58:03,070+0000 INFO FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b > java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. > at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) > at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) > > > I am not sure if that may be causing problems for when Emalayan tries to run with CDM. > > Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. The only things that are expected to be in certain places are the raw_dir and header.hdr. Those have to be in the pwd. However I will continue debugging to make sure those assumptions I made are holding. > > As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. CDM may not like that. Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. I am not sure, that is just a hypothesis. I can always change the scripts to not use the concurrent mapper and use a better mapper for the CDM rules if that turns out to be the case. > > On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: > >> Hi Justin, >> >> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >> >> I tried on login2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >> >> May be you are using a different swift version or may be a different login machine. >> >> Thank you >> Emalayan >> >> From: Justin M Wozniak >> To: Jonathan Monette >> Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >> Sent: Thursday, 22 March 2012 1:49 PM >> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> >> >> This run had no CDM file. >> >> On Thu, 22 Mar 2012, Jonathan Monette wrote: >> >> > Was that with CDM in your run? I am going to take a look too as to why that is showing up. >> > >> > >> > On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >> > >> >> >> >> Ok, I can get it started but I get: >> >> >> >> 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null >> >> 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null >> >> 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null >> >> >> >> resulting in: >> >> >> >> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >> >> >> >> This looks like a Swift bug. However, do you guys have an existing workaround? >> >> >> >> Thanks >> >> >> >> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >> >> >> >>> Hi Justin, >> >>> >> >>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >> >>> >> >>> >> >>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >> >>> Please let me know if you have questions. >> >>> >> >>> Thank you >> >>> Emalayan >> >>> ________________________________ >> >>> From: Justin M Wozniak >> >>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >> >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> >>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >> >>> >> >>>> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >> >>>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >> >>> >> >>> What is the entry point? >> >>> >> >>> Are we missing common.sh? >> >>> >> >>> -- Justin M Wozniak >> >> >> >> -- >> >> Justin M Wozniak_______________________________________________ >> >> Swift-devel mailing list >> >> Swift-devel at ci.uchicago.edu >> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > >> > >> >> -- >> Justin M Wozniak >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Thu Mar 22 16:53:37 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 22 Mar 2012 14:53:37 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> Message-ID: <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. Thank you Emalayan ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:45 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM I'll run with those rules. ?I saw the garbage collection exception in the log file. ?The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. But it just occurred to me, which swift are you using again? ?I am running with the copy I grabbed from Justin's directory. ?Are you running that as well or with 0.93? ?If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: Hi Jon, > > >Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of? location dependency ? > > > >rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >rule .*final.* DIRECT? /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >rule .*.* DIRECT /tmp/local > > >Meantime I will run the montage again and see whether swift garbage-collector throws some error. > > >Please let me know if you have a better idea to debug the problem. > > > >Thank you >Emalayan > > > >________________________________ > From: Jonathan Monette >To: Emalayan Vairavanathan >Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >Sent: Thursday, 22 March 2012 2:20 PM >Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > >So I was able to run it on surveyor and it completed. ?I see those same SetFielfValue lines but I don't think that is an issue. ?That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. ?However I do see the Swift garbage collector kicking in and throwing exceptions: > > >2012-03-22 20:58:03,070+0000 INFO ?FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) > > > > >I am not sure if that may be causing problems for when Emalayan tries to run with CDM. > > >Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. ?In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. ?Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. ?The only things that are expected to be in certain places are the raw_dir and header.hdr. ?Those have to be in the pwd. ?However I will continue debugging to make sure those assumptions I made are holding. > > >As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. ?I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. ?CDM may not like that. ?Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. ?I am not sure, that is just a hypothesis. ?I can always change the scripts to not use the concurrent mapper and use a better mapper ?for the CDM rules if that turns out to be the case. > > >On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: > >Hi Justin, >> >> >>For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >> >> >> >>I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >> >> >>May be you are using a different swift version or may be a different login machine. >> >> >> >>Thank you >>Emalayan >> >> >> >> >>________________________________ >> From: Justin M Wozniak >>To: Jonathan Monette >>Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >>Sent: Thursday, 22 March 2012 1:49 PM >>Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> >> >>This run had no CDM file. >> >>On Thu, 22 Mar 2012, Jonathan Monette wrote: >> >>> Was that with CDM in your run?? I am going to take a look too as to why that is showing up. >>> >>> >>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>> >>>> >>>> Ok, I can get it started but I get: >>>> >>>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >>>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >>>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >>>> >>>> resulting in: >>>> >>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>> >>>> This looks like a Swift bug.? However, do you guys have an existing workaround? >>>> >>>> ??? Thanks >>>> >>>> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >>>> >>>>> Hi Justin, >>>>> >>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>> >>>>> >>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>> Please let me know if you have questions. >>>>> >>>>> Thank you >>>>> Emalayan >>>>> ________________________________ >>>>> From: Justin M Wozniak >>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>> >>>>>>? ? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>> >>>>> What is the entry point? >>>>> >>>>> Are we missing common.sh? >>>>> >>>>> -- Justin M Wozniak >>>> >>>> -- >>>> Justin M Wozniak_______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> >> >>-- >>Justin M Wozniak >> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Thu Mar 22 16:56:33 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 22 Mar 2012 16:56:33 -0500 (CDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> Message-ID: Ok, I can run the whole workflow with fs.data: rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts # rule .*.* DIRECT /tmp/local I am now going to uncomment the last line... Emalayan, can you try to run that and see what happens? On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. > > I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. > > Thank you > Emalayan > > > > ________________________________ > From: Jonathan Monette > To: Emalayan Vairavanathan > Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore > Sent: Thursday, 22 March 2012 2:45 PM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > > I'll run with those rules. ?I saw the garbage collection exception in the log file. ?The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. > > But it just occurred to me, which swift are you using again? ?I am running with the copy I grabbed from Justin's directory. ?Are you running that as well or with 0.93? ?If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. > > On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: > > Hi Jon, >> >> >> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of? location dependency ? >> >> >> >> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >> rule .*final.* DIRECT? > /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >> rule .*.* DIRECT /tmp/local >> >> >> Meantime I will run the montage again and see whether swift garbage-collector throws some error. >> >> >> Please let me know if you have a better idea to debug the problem. >> >> >> >> Thank you >> Emalayan >> >> >> >> ________________________________ >> From: Jonathan Monette >> To: Emalayan Vairavanathan >> Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >> Sent: Thursday, 22 March 2012 2:20 PM >> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> >> >> So I was able to run it on surveyor and it completed. ?I see those same SetFielfValue lines but I don't think that is an issue. ?That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. ?However I do see the Swift garbage collector kicking in and throwing exceptions: >> >> >> 2012-03-22 20:58:03,070+0000 INFO ?FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >> >> >> >> >> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. >> >> >> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. ?In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. ?Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. ?The only things that are expected to be in certain places are the raw_dir and header.hdr. ?Those have to be in the pwd. ?However I will continue debugging to make sure those assumptions I made are holding. >> >> >> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. ?I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. ?CDM may not like that. ?Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. ?I am not sure, that is just a hypothesis. ?I can always change the scripts to not use the concurrent mapper and use a better mapper ?for the CDM rules if that turns out to be the case. >> >> >> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >> >> Hi Justin, >>> >>> >>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>> >>> >>> >>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >>> >>> >>> May be you are using a different swift version or may be a different login machine. >>> >>> >>> >>> Thank you >>> Emalayan >>> >>> >>> >>> >>> ________________________________ >>> From: Justin M Wozniak >>> To: Jonathan Monette >>> Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >>> Sent: Thursday, 22 March 2012 1:49 PM >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>> >>> >>> This run had no CDM > file. >>> >>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>> >>>> Was that with CDM in your run?? I am going to take a look too as to why that is showing up. >>>> >>>> >>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>> >>>>> >>>>> Ok, I can get it started but I get: >>>>> >>>>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >>>>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >>>>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >>>>> >>>>> resulting in: >>>>> >>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>> >>>>> This looks like a Swift bug.? However, do you guys have an existing workaround? >>>>> >>>>> ??? Thanks >>>>> >>>>> On Thu, 22 Mar 2012, Emalayan > Vairavanathan wrote: >>>>> >>>>>> Hi Justin, >>>>>> >>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>>> >>>>>> >>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>>> Please let me know if you have questions. >>>>>> >>>>>> Thank you >>>>>> Emalayan >>>>>> ________________________________ >>>>>> From: Justin M Wozniak >>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>> >>>>>>> ? ? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>> ? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>> >>>>>> What is the entry > point? >>>>>> >>>>>> Are we missing common.sh? >>>>>> >>>>>> -- Justin M > Wozniak >>>>> >>>>> -- >>>>> Justin M Wozniak_______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> >>> >>> -- >>> Justin M Wozniak >>> >>> >>> >> >> >> -- Justin M Wozniak From svemalayan at yahoo.com Thu Mar 22 17:01:16 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 22 Mar 2012 15:01:16 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> Message-ID: <1332453676.97766.YahooMailNeo@web39505.mail.mud.yahoo.com> Hi Justin, I already tried this too. This works for me as well. It did not work when I un-commented the last line. Thank you Emalayan ________________________________ From: Justin M Wozniak To: Emalayan Vairavanathan Cc: Jonathan Monette ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:56 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM Ok, I can run the whole workflow with fs.data: rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts # rule .*.* DIRECT /tmp/local I am now going to uncomment the last line... Emalayan, can you try to run that and see what happens? On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. > > I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. > > Thank you > Emalayan > > > > ________________________________ > From: Jonathan Monette > To: Emalayan Vairavanathan > Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore > Sent: Thursday, 22 March 2012 2:45 PM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > > I'll run with those rules. ?I saw the garbage collection exception in the log file. ?The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. > > But it just occurred to me, which swift are you using again? ?I am running with the copy I grabbed from Justin's directory. ?Are you running that as well or with 0.93? ?If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. > > On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: > > Hi Jon, >> >> >> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of? location dependency ? >> >> >> >> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >> rule .*final.* DIRECT? > /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >> rule .*.* DIRECT /tmp/local >> >> >> Meantime I will run the montage again and see whether swift garbage-collector throws some error. >> >> >> Please let me know if you have a better idea to debug the problem. >> >> >> >> Thank you >> Emalayan >> >> >> >> ________________________________ >> From: Jonathan Monette >> To: Emalayan Vairavanathan >> Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >> Sent: Thursday, 22 March 2012 2:20 PM >> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> >> >> So I was able to run it on surveyor and it completed. ?I see those same SetFielfValue lines but I don't think that is an issue. ?That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. ?However I do see the Swift garbage collector kicking in and throwing exceptions: >> >> >> 2012-03-22 20:58:03,070+0000 INFO ?FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >> >> >> >> >> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. >> >> >> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. ?In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. ?Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. ?The only things that are expected to be in certain places are the raw_dir and header.hdr. ?Those have to be in the pwd. ?However I will continue debugging to make sure those assumptions I made are holding. >> >> >> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. ?I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. ?CDM may not like that. ?Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. ?I am not sure, that is just a hypothesis. ?I can always change the scripts to not use the concurrent mapper and use a better mapper ?for the CDM rules if that turns out to be the case. >> >> >> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >> >> Hi Justin, >>> >>> >>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>> >>> >>> >>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >>> >>> >>> May be you are using a different swift version or may be a different login machine. >>> >>> >>> >>> Thank you >>> Emalayan >>> >>> >>> >>> >>> ________________________________ >>> From: Justin M Wozniak >>> To: Jonathan Monette >>> Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >>> Sent: Thursday, 22 March 2012 1:49 PM >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>> >>> >>> This run had no CDM > file. >>> >>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>> >>>> Was that with CDM in your run?? I am going to take a look too as to why that is showing up. >>>> >>>> >>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>> >>>>> >>>>> Ok, I can get it started but I get: >>>>> >>>>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >>>>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >>>>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >>>>> >>>>> resulting in: >>>>> >>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>> >>>>> This looks like a Swift bug.? However, do you guys have an existing workaround? >>>>> >>>>> ??? Thanks >>>>> >>>>> On Thu, 22 Mar 2012, Emalayan > Vairavanathan wrote: >>>>> >>>>>> Hi Justin, >>>>>> >>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>>> >>>>>> >>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>>> Please let me know if you have questions. >>>>>> >>>>>> Thank you >>>>>> Emalayan >>>>>> ________________________________ >>>>>> From: Justin M Wozniak >>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>> >>>>>>> ? ?? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>> ?? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>> >>>>>> What is the entry > point? >>>>>> >>>>>> Are we missing common.sh? >>>>>> >>>>>> -- Justin M > Wozniak >>>>> >>>>> -- >>>>> Justin M Wozniak_______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> >>> >>> -- >>> Justin M Wozniak >>> >>> >>> >> >> >> -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Thu Mar 22 17:03:22 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 22 Mar 2012 17:03:22 -0500 (CDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> Message-ID: I think SwiftMontage is trying to do a readData() on a CDM DIRECT file. This file is not accessible to the Swift engine, as it will be on the compute nodes in MosaStore. We need to enumerate the file names that must be accessible to Swift for readData(). These will have to be CDM DEFAULT. The rest can be CDM DIRECT. Justin On Thu, 22 Mar 2012, Justin M Wozniak wrote: > Ok, I can run the whole workflow with fs.data: > > rule .*raw_dir.* DIRECT > /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > rule .*header.hdr.* DIRECT > /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > # rule .*.* DIRECT /tmp/local > > I am now going to uncomment the last line... > > Emalayan, can you try to run that and see what happens? > > On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > >> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not >> need to check for garbage-collector errors. >> >> I forgot to mention one point about concurrent mappers. The pipeline swift >> benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark >> successfully regardless of the concurrent mappers location (I tried various >> location for concurrent mappers). So I guess the location of the concurrent >> mappers wont be an issue. >> >> Thank you >> Emalayan >> >> >> >> ________________________________ >> From: Jonathan Monette >> To: Emalayan Vairavanathan Cc: Justin M Wozniak >> ; matei ; >> "swift-devel at ci.uchicago.edu" ; MosaStore >> Sent: Thursday, 22 March 2012 2:45 PM >> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> >> >> I'll run with those rules. ?I saw the garbage collection exception in the >> log file. ?The exception may be causing no harm, but it probably shouldn't >> be throwing an exception at all so this may help fix it in trunk. >> >> But it just occurred to me, which swift are you using again? ?I am running >> with the copy I grabbed from Justin's directory. ?Are you running that as >> well or with 0.93? ?If with 0.93 you will not see that exception because >> that version does not have the swift garbage collector so we should not >> waste time with that exception. >> >> On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: >> >> Hi Jon, >>> >>> >>> Thank you for you suggestions. By the way can you try to run motage with >>> the CDM rules below and see whether the problem is with concurrent mapper >>> and not because of? location dependency ? >>> >>> >>> >>> rule .*raw_dir.* DIRECT >>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >>> rule .*header.hdr.* DIRECT >>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>> rule .*final.* DIRECT? >> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>> rule .*.* DIRECT /tmp/local >>> >>> >>> Meantime I will run the montage again and see whether swift >>> garbage-collector throws some error. >>> >>> >>> Please let me know if you have a better idea to debug the problem. >>> >>> >>> >>> Thank you >>> Emalayan >>> >>> >>> >>> ________________________________ >>> From: Jonathan Monette >>> To: Emalayan Vairavanathan Cc: Justin M Wozniak >>> ; matei ; >>> "swift-devel at ci.uchicago.edu" ; MosaStore >>> Sent: Thursday, 22 March 2012 2:20 PM >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>> >>> >>> So I was able to run it on surveyor and it completed. ?I see those same >>> SetFielfValue lines but I don't think that is an issue. ?That is just >>> because I declare an array called projected_images and fill it up inside >>> another function with files I used the regexp mapper on. ?However I do see >>> the Swift garbage collector kicking in and throwing exceptions: >>> >>> >>> 2012-03-22 20:58:03,070+0000 INFO ?FileGarbageCollector Failed to clean >>> file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >>> java.lang.RuntimeException: >>> org.globus.cog.abstraction.impl.file.FileNotFoundException: >>> _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >>> not found. >>> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >>> at >>> org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >>> >>> >>> >>> >>> I am not sure if that may be causing problems for when Emalayan tries to >>> run with CDM. >>> >>> >>> Emalayan, it does not look like those scripts expect files to be in a >>> certain location, at least that is not what is intended. ?In the main >>> swiftscript, when you call the other functions you pass the directory name >>> you want the intermediate files to be stored in. ?Then in the >>> SwiftMontage_Batch functions it uses those directories you passed to map >>> input/output files. ?The only things that are expected to be in certain >>> places are the raw_dir and header.hdr. ?Those have to be in the pwd. >>> ?However I will continue debugging to make sure those assumptions I made >>> are holding. >>> >>> >>> As to the different cdm setups, I do use the concurrent mapper(files that >>> get dumped to _concurrent) where swift decides on the names. ?I did this >>> for a couple files that I did not care what they were named and they were >>> small enough that I didn't care if they were staged in/out or not. ?CDM >>> may not like that. ?Perhaps Swift expects those _concurrent files in a >>> certain place but you told CDM to put them someplace different. ?I am not >>> sure, that is just a hypothesis. ?I can always change the scripts to not >>> use the concurrent mapper and use a better mapper ?for the CDM rules if >>> that turns out to be the case. >>> >>> >>> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >>> >>> Hi Justin, >>>> >>>> >>>> For me the setup in >>>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>>> >>>> >>>> >>>> I tried onlogin2.surveyor with the swift binaries located in >>>> /home/wozniak/Public/swift/bin/swift >>>> >>>> >>>> May be you are using a different swift version or may be a different >>>> login machine. >>>> >>>> >>>> Thank you >>>> Emalayan >>>> >>>> >>>> >>>> >>>> ________________________________ >>>> From: Justin M Wozniak >>>> To: Jonathan Monette Cc: Emalayan Vairavanathan >>>> ; matei ; >>>> "swift-devel at ci.uchicago.edu" ; MosaStore >>>> Sent: Thursday, 22 March 2012 1:49 PM >>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>> >>>> >>>> This run had no CDM >> file. >>>> >>>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>>> >>>>> Was that with CDM in your run?? I am going to take a look too as to why >>>>> that is showing up. >>>>> >>>>> >>>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>>> >>>>>> >>>>>> Ok, I can get it started but I get: >>>>>> >>>>>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: >>>>>> projected_images[0]=null >>>>>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: >>>>>> projected_images[6]=null >>>>>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: >>>>>> projected_images[7]=null >>>>>> >>>>>> resulting in: >>>>>> >>>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>>> >>>>>> This looks like a Swift bug.? However, do you guys have an existing >>>>>> workaround? >>>>>> >>>>>> ??? Thanks >>>>>> >>>>>> On Thu, 22 Mar 2012, Emalayan >> Vairavanathan wrote: >>>>>> >>>>>>> Hi Justin, >>>>>>> >>>>>>> Please use ./run_local.sh to run the montage without cdm locally on >>>>>>> the headnode. >>>>>>> >>>>>>> >>>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, >>>>>>> main.sh) are written to run experiments in our cluster and wont work >>>>>>> in Surveyor. >>>>>>> Please let me know if you have questions. >>>>>>> >>>>>>> Thank you >>>>>>> Emalayan >>>>>>> ________________________________ >>>>>>> From: Justin M Wozniak >>>>>>> To: Emalayan Vairavanathan Cc: >>>>>>> "swift-devel at ci.uchicago.edu" ; MosaStore >>>>>>> ; matei Sent: Thursday, >>>>>>> 22 March 2012 11:54 AM >>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>>> >>>>>>>> ? ? But I just setup everything on Surveyor and it works locally on >>>>>>>> the head node. You can find the setup here. >>>>>>>> ? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>>> >>>>>>> What is the entry >> point? >>>>>>> >>>>>>> Are we missing common.sh? >>>>>>> >>>>>>> -- Justin M >> Wozniak >>>>>> >>>>>> -- >>>>>> Justin M Wozniak_______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> >>>> >>>> -- >>>> Justin M Wozniak >>>> >>>> >>>> >>> >>> >>> > > -- Justin M Wozniak From jonmon at mcs.anl.gov Thu Mar 22 17:11:53 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 17:11:53 -0500 Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> Message-ID: <61547069-A191-4F9A-9807-51AA84EF6528@mcs.anl.gov> So those files are in _concurrent. Those files(the ones mapped with the concurrent mapper) are read by readData or readData2. So if he uses fs_1.data from a previous email it works, which he said he confirmed. On Mar 22, 2012, at 5:03 PM, Justin M Wozniak wrote: > > I think SwiftMontage is trying to do a readData() on a CDM DIRECT file. This file is not accessible to the Swift engine, as it will be on the compute nodes in MosaStore. > > We need to enumerate the file names that must be accessible to Swift for readData(). These will have to be CDM DEFAULT. The rest can be CDM DIRECT. > > Justin > > On Thu, 22 Mar 2012, Justin M Wozniak wrote: > >> Ok, I can run the whole workflow with fs.data: >> >> rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >> rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >> # rule .*.* DIRECT /tmp/local >> >> I am now going to uncomment the last line... >> >> Emalayan, can you try to run that and see what happens? >> >> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >> >>> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. >>> I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. >>> Thank you >>> Emalayan >>> ________________________________ >>> From: Jonathan Monette >>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:45 PM >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>> I'll run with those rules. I saw the garbage collection exception in the log file. The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. >>> But it just occurred to me, which swift are you using again? I am running with the copy I grabbed from Justin's directory. Are you running that as well or with 0.93? If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. >>> On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: >>> Hi Jon, >>>> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of location dependency ? >>>> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >>>> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>> rule .*final.* DIRECT >>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>> rule .*.* DIRECT /tmp/local >>>> Meantime I will run the montage again and see whether swift garbage-collector throws some error. >>>> Please let me know if you have a better idea to debug the problem. >>>> Thank you >>>> Emalayan >>>> ________________________________ >>>> From: Jonathan Monette >>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:20 PM >>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>> So I was able to run it on surveyor and it completed. I see those same SetFielfValue lines but I don't think that is an issue. That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. However I do see the Swift garbage collector kicking in and throwing exceptions: >>>> 2012-03-22 20:58:03,070+0000 INFO FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >>>> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >>>> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >>>> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >>>> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. >>>> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. The only things that are expected to be in certain places are the raw_dir and header.hdr. Those have to be in the pwd. However I will continue debugging to make sure those assumptions I made are holding. >>>> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. CDM may not like that. Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. I am not sure, that is just a hypothesis. I can always change the scripts to not use the concurrent mapper and use a better mapper for the CDM rules if that turns out to be the case. >>>> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >>>> Hi Justin, >>>>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>>>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >>>>> May be you are using a different swift version or may be a different login machine. Thank you >>>>> Emalayan >>>>> ________________________________ >>>>> From: Justin M Wozniak >>>>> To: Jonathan Monette Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:49 PM >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>> This run had no CDM >>> file. >>>>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>>>>> Was that with CDM in your run? I am going to take a look too as to why that is showing up. >>>>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>>>>> Ok, I can get it started but I get: >>>>>>> 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null >>>>>>> 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null >>>>>>> 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null >>>>>>> resulting in: >>>>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>>>> This looks like a Swift bug. However, do you guys have an existing workaround? >>>>>>> Thanks >>>>>>> On Thu, 22 Mar 2012, Emalayan >>> Vairavanathan wrote: >>>>>>>> Hi Justin, >>>>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>>>>> Please let me know if you have questions. >>>>>>>> Thank you >>>>>>>> Emalayan >>>>>>>> ________________________________ >>>>>>>> From: Justin M Wozniak >>>>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>>>>> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>>>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>>>> What is the entry >>> point? >>>>>>>> Are we missing common.sh? >>>>>>>> -- Justin M >>> Wozniak >>>>>>> -- >>>>>>> Justin M Wozniak_______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> -- >>>>> Justin M Wozniak >> >> > > -- > Justin M Wozniak_______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From iraicu at cs.iit.edu Thu Mar 22 17:33:24 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Thu, 22 Mar 2012 17:33:24 -0500 Subject: [Swift-devel] Call for Participation: 12th IEEE/ACM Int. Symp. on Cluster, Grid and Cloud, Computing (CCGrid 2012) -- May13-16 in Ottawa, Canada Message-ID: <4F6BA8B4.80008@cs.iit.edu> *** Our apologies if you receive multiple copies of this Call *** 12th IEEE/ACM International Symposium on Cluster, Grid and Cloud Computing (CCGrid 2012) Ottawa, Canada May 13-16, 2012 http://www.cloudbus.org/ccgrid2012 Venue: The Delta Ottawa City Centre Hotel [Special rates for conference attendees. To book visit: www.cloudbus.org/ccgrid2012/accommodations.html.] *************************** CALL FOR PARTICPATION *************************** *** Registration is Open. *** *** Take advantage of the Early Registration Rates until April 9/2012 *** Overview: Rapid advances in processing, communication and systems/middleware technologies are leading to new paradigms and platforms for computing, ranging from computing Clusters to widely distributed Grid and emerging Clouds. CCGrid is a series of very successful conferences, sponsored by the IEEE Computer Society Technical Committee on Scalable Computing (TCSC) and ACM, with the overarching goal of bringing together international researchers, developers, and users and to provide an international forum to present leading research activities and results on a broad range of topics related to these platforms and paradigms and their applications. The conference features keynotes, technical presentations, posters and research demos, workshops, tutorials, as well as the SCALE challenges featuring live demonstrations. In 2012, CCGrid will come to Canada for the first time and will be held in Ottawa, the capital city. The symposium will be held on May 13-16 during which the city will be celebrating its world-famous Tulip Festival. CCGrid 2012 will have a focus on important and immediate issues that are significantly influencing all aspects of cluster, cloud and grid computing. Topics discussed in the technical sessions include: Applications and Experiences; Architecture: System architectures, Design and deployment; Autonomic Computing and Cyberinfrastructure; Performance Modeling and Evaluation; Programming Models, Systems, and Fault-Tolerant Computing; Multicore and Accelerator-based Computing; Scheduling and Resource Management; Cloud Computing: Cloud architectures; Software tools and techniques for clouds. ******************* PROGRAM HIGHLIGHTS ****************** TECHNICAL SESSIONS WILL INCLUDE * Programming Models and File Systems * Map Reduce and Workflows * QoS and Architecture * GPU * Cloud Services I * I/O and File Systems) * Programming Models) * Cloud Computing I * Communication and Networks * Faults, Failures and Reliability * Workflows * Scheduling and Monitoring * Virtualization * Cloud Services * Data on the Cloud * Multicore Architectures * Cloud Computing II * Applications KEYNOTES: * Dr. Alok Chaudhury (North Western University, USA) * Dr. Dick Epema (TU Delfts, Netherlands) * The winner of the TCSC medal on Scalable Computing WORKSHOPS: * International workshop on Cloud for Business, Industry and Enterprises (C4BIE 2012) * Workshop on Cloud Computing Optimization (CCOPT 2012) * 2nd International Workshop on Cloud Computing and Scientific Applications (CCSA 2012) * Workshop on Modeling and Simulation on Grid and Cloud Computing (MSGC 2012) * 1st International Workshop on Data-intensive Process Management in Large-Scale Sensor Systems (DPMSS 2012) TUTORIALS PANELS & INDUSTRIAL SESSIONS DOCTORAL SYMPOSIUM POSTER/DEMO SESSIONS SCALE 2012: The Fourth IEEE International Scalable Computing Challenge CHAIRS General Chair * Shikharesh Majumdar, Carleton University, Canada Program Committee Co-Chairs * Rajkumar Buyya, University of Melbourne, Australia * Pavan Balaji, Argonne National Laboratory, USA Program Committee Vice-chairs * Daniel S. Katz (Applications and Experiences) * Dhabaleswar K. Panda (Architecture) * Manish Parashar (Middleware, Autonomic Computing, and Cyberinfrastructure) * Ahmad Afsahi (Performance Modeling and Analysis) * Xian-He Sun (Performance Measurement and Evaluation) * William Gropp (Programming Models, Systems, and Fault-Tolerant computing) * David Bader (Multicore and Accelerator-based Computing) * Thomas Fahringer (Scheduling and Resource Management) * Ignacio Martin Llorente and Madhusudhan Govindaraju (Cloud Computing) Honorary Chair * Geoffrey Fox, Indiana University, USA IMPORTANT DATES Early Registration: From now until April 9/2012 Late/Onsite Registration: April 10, 2012 onwards Conference: May 13-16, 2012 -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From svemalayan at yahoo.com Thu Mar 22 18:28:32 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 22 Mar 2012 16:28:32 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <61547069-A191-4F9A-9807-51AA84EF6528@mcs.anl.gov> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> <61547069-A191-4F9A-9807-51AA84EF6528@mcs.anl.gov> Message-ID: <1332458912.89567.YahooMailNeo@web39508.mail.mud.yahoo.com> Hi Justin and Jon, I thought the goal of having CDM is to provide a translation between the file data type (used in swift) and the actual location of the files. This will help to avoid the actual location of the file being hard coded in the swift-script and also help swift to harness the platform specific data transfer mechanisms. But from what Justin said, it seems the issue is with the rules (usage of CDM_DIRECT Vs CDM DEFAULT). I did not understand how such translation layer get confused by? CDM_DIRECT / CDM DEFAULT. Does readData() calles does not go through this translation layer ? May be I am wrong here. If so please correct me and provide more high level information. Jon: What is the action plan ? Do we need modification in the monage swift scripts ? or Do you suggest me to use CDM rules with CDM_DEFAULT for some files ? (In this case these intermediate files will be stored in GPFS) Thank you very much Emalayan ________________________________ From: Jonathan Monette To: Justin M Wozniak Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 3:11 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM So those files are in _concurrent.? Those files(the ones mapped with the concurrent mapper) are read by readData or readData2.? So if he uses fs_1.data from a previous email it works, which he said he confirmed. On Mar 22, 2012, at 5:03 PM, Justin M Wozniak wrote: > > I think SwiftMontage is trying to do a readData() on a CDM DIRECT file. This file is not accessible to the Swift engine, as it will be on the compute nodes in MosaStore. > > We need to enumerate the file names that must be accessible to Swift for readData().? These will have to be CDM DEFAULT.? The rest can be CDM DIRECT. > > ??? Justin > > On Thu, 22 Mar 2012, Justin M Wozniak wrote: > >> Ok, I can run the whole workflow with fs.data: >> >> rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >> rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >> # rule .*.* DIRECT /tmp/local >> >> I am now going to uncomment the last line... >> >> Emalayan, can you try to run that and see what happens? >> >> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >> >>> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. >>> I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. >>> Thank you >>> Emalayan >>> ________________________________ >>> From: Jonathan Monette >>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:45 PM >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>> I'll run with those rules.? I saw the garbage collection exception in the log file.? The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. >>> But it just occurred to me, which swift are you using again?? I am running with the copy I grabbed from Justin's directory.? Are you running that as well or with 0.93?? If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. >>> On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: >>> Hi Jon, >>>> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of? location dependency ? >>>> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >>>> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>> rule .*final.* DIRECT >>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>> rule .*.* DIRECT /tmp/local >>>> Meantime I will run the montage again and see whether swift garbage-collector throws some error. >>>> Please let me know if you have a better idea to debug the problem. >>>> Thank you >>>> Emalayan >>>> ________________________________ >>>> From: Jonathan Monette >>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:20 PM >>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>> So I was able to run it on surveyor and it completed.? I see those same SetFielfValue lines but I don't think that is an issue.? That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on.? However I do see the Swift garbage collector kicking in and throwing exceptions: >>>> 2012-03-22 20:58:03,070+0000 INFO? FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >>>> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >>>> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >>>> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >>>> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. >>>> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended.? In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in.? Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files.? The only things that are expected to be in certain places are the raw_dir and header.hdr.? Those have to be in the pwd.? However I will continue debugging to make sure those assumptions I made are holding. >>>> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names.? I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not.? CDM may not like that.? Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different.? I am not sure, that is just a hypothesis.? I can always change the scripts to not use the concurrent mapper and use a better mapper? for the CDM rules if that turns out to be the case. >>>> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >>>> Hi Justin, >>>>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>>>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >>>>> May be you are using a different swift version or may be a different login machine. Thank you >>>>> Emalayan >>>>> ________________________________ >>>>> From: Justin M Wozniak >>>>> To: Jonathan Monette Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:49 PM >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>> This run had no CDM >>> file. >>>>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>>>>> Was that with CDM in your run?? I am going to take a look too as to why that is showing up. >>>>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>>>>> Ok, I can get it started but I get: >>>>>>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >>>>>>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >>>>>>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >>>>>>> resulting in: >>>>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>>>> This looks like a Swift bug.? However, do you guys have an existing workaround? >>>>>>>? ? Thanks >>>>>>> On Thu, 22 Mar 2012, Emalayan >>> Vairavanathan wrote: >>>>>>>> Hi Justin, >>>>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>>>>> Please let me know if you have questions. >>>>>>>> Thank you >>>>>>>> Emalayan >>>>>>>> ________________________________ >>>>>>>> From: Justin M Wozniak >>>>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>>>>>? ? ? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>>>>? ? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>>>> What is the entry >>> point? >>>>>>>> Are we missing common.sh? >>>>>>>> -- Justin M >>> Wozniak >>>>>>> -- >>>>>>> Justin M Wozniak_______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> -- >>>>> Justin M Wozniak >> >> > > -- > Justin M Wozniak_______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Mar 22 23:01:49 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 22 Mar 2012 23:01:49 -0500 Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332458912.89567.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> <61547069-A191-4F9A-9807-51AA84EF6528@mcs.anl.gov> <1332458912.89567.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <0E7D9820-106D-4B45-A195-A6267BA13A2B@mcs.anl.gov> So Justin can elaborate more(probably) on the readData issue and CDM. This is actually the first I heard of that . I was under the same assumptions you were on CDM and Swift but Justin is the one who wrote CDM for Swift so he knows more of the pitfalls. As for the action plan, I think until we have more information about Swift/CDM interaction(the readData interaction Justin gave) you should have some files DEFAULT where they will be stored on gpfs. In the SwiftMontage case those files are small and only used once or twice at most, in fact I believe the apps that generate those files run on localhost(at least that is how I tested them). You should not witness any fs performance difference but since the overall goal of mosa is to use it for all intermediate data we will probably need to modify the scripts. The only question is what to modify them too and for that we needs Justin's input. You can certainly proceed in setting up SwiftMontage for testing and start writing code to capture the performance numbers you need if you let some files DEFAULT, so this may not be a blocker at the moment. Notice, that you do not need to specify a file to DEFAULT in CDM. If there is not a rule in the cdm rule file, it automatically defaults. On Mar 22, 2012, at 6:28 PM, Emalayan Vairavanathan wrote: > Hi Justin and Jon, > > I thought the goal of having CDM is to provide a translation between the file data type (used in swift) and the actual location of the files. This will help to avoid the actual location of the file being hard coded in the swift-script and also help swift to harness the platform specific data transfer mechanisms. > > But from what Justin said, it seems the issue is with the rules (usage of CDM_DIRECT Vs CDM DEFAULT). I did not understand how such translation layer get confused by CDM_DIRECT / CDM DEFAULT. Does readData() calles does not go through this translation layer ? > > May be I am wrong here. If so please correct me and provide more high level information. > > > Jon: What is the action plan ? Do we need modification in the monage swift scripts ? or Do you suggest me to use CDM rules with CDM_DEFAULT for some files ? (In this case these intermediate files will be stored in GPFS) > > Thank you very much > Emalayan > > From: Jonathan Monette > To: Justin M Wozniak > Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore > Sent: Thursday, 22 March 2012 3:11 PM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > So those files are in _concurrent. Those files(the ones mapped with the concurrent mapper) are read by readData or readData2. So if he uses fs_1.data from a previous email it works, which he said he confirmed. > > On Mar 22, 2012, at 5:03 PM, Justin M Wozniak wrote: > > > > > I think SwiftMontage is trying to do a readData() on a CDM DIRECT file. This file is not accessible to the Swift engine, as it will be on the compute nodes in MosaStore. > > > > We need to enumerate the file names that must be accessible to Swift for readData(). These will have to be CDM DEFAULT. The rest can be CDM DIRECT. > > > > Justin > > > > On Thu, 22 Mar 2012, Justin M Wozniak wrote: > > > >> Ok, I can run the whole workflow with fs.data: > >> > >> rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > >> rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > >> # rule .*.* DIRECT /tmp/local > >> > >> I am now going to uncomment the last line... > >> > >> Emalayan, can you try to run that and see what happens? > >> > >> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > >> > >>> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. > >>> I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. > >>> Thank you > >>> Emalayan > >>> ________________________________ > >>> From: Jonathan Monette > >>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:45 PM > >>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > >>> I'll run with those rules. I saw the garbage collection exception in the log file. The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. > >>> But it just occurred to me, which swift are you using again? I am running with the copy I grabbed from Justin's directory. Are you running that as well or with 0.93? If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. > >>> On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: > >>> Hi Jon, > >>>> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of location dependency ? > >>>> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts > >>>> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ > >>>> rule .*final.* DIRECT > >>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ > >>>> rule .*.* DIRECT /tmp/local > >>>> Meantime I will run the montage again and see whether swift garbage-collector throws some error. > >>>> Please let me know if you have a better idea to debug the problem. > >>>> Thank you > >>>> Emalayan > >>>> ________________________________ > >>>> From: Jonathan Monette > >>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:20 PM > >>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > >>>> So I was able to run it on surveyor and it completed. I see those same SetFielfValue lines but I don't think that is an issue. That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. However I do see the Swift garbage collector kicking in and throwing exceptions: > >>>> 2012-03-22 20:58:03,070+0000 INFO FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b > >>>> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. > >>>> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) > >>>> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) > >>>> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. > >>>> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. The only things that are expected to be in certain places are the raw_dir and header.hdr. Those have to be in the pwd. However I will continue debugging to make sure those assumptions I made are holding. > >>>> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. CDM may not like that. Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. I am not sure, that is just a hypothesis. I can always change the scripts to not use the concurrent mapper and use a better mapper for the CDM rules if that turns out to be the case. > >>>> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: > >>>> Hi Justin, > >>>>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. > >>>>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift > >>>>> May be you are using a different swift version or may be a different login machine. Thank you > >>>>> Emalayan > >>>>> ________________________________ > >>>>> From: Justin M Wozniak > >>>>> To: Jonathan Monette Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:49 PM > >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > >>>>> This run had no CDM > >>> file. > >>>>> On Thu, 22 Mar 2012, Jonathan Monette wrote: > >>>>>> Was that with CDM in your run? I am going to take a look too as to why that is showing up. > >>>>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: > >>>>>>> Ok, I can get it started but I get: > >>>>>>> 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null > >>>>>>> 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null > >>>>>>> 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null > >>>>>>> resulting in: > >>>>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null > >>>>>>> This looks like a Swift bug. However, do you guys have an existing workaround? > >>>>>>> Thanks > >>>>>>> On Thu, 22 Mar 2012, Emalayan > >>> Vairavanathan wrote: > >>>>>>>> Hi Justin, > >>>>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. > >>>>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. > >>>>>>>> Please let me know if you have questions. > >>>>>>>> Thank you > >>>>>>>> Emalayan > >>>>>>>> ________________________________ > >>>>>>>> From: Justin M Wozniak > >>>>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM > >>>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > >>>>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: > >>>>>>>>> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. > >>>>>>>>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts > >>>>>>>> What is the entry > >>> point? > >>>>>>>> Are we missing common.sh? > >>>>>>>> -- Justin M > >>> Wozniak > >>>>>>> -- > >>>>>>> Justin M Wozniak_______________________________________________ > >>>>>>> Swift-devel mailing list > >>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>>>> -- > >>>>> Justin M Wozniak > >> > >> > > > > -- > > Justin M Wozniak_______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Fri Mar 23 05:00:14 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Fri, 23 Mar 2012 03:00:14 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <0E7D9820-106D-4B45-A195-A6267BA13A2B@mcs.anl.gov> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> <61547069-A191-4F9A-9807-51AA84EF6528@mcs.anl.gov> <1332458912.89567.YahooMailNeo@web39508.mail.mud.yahoo.com> <0E7D9820-106D-4B45-A195-A626! 7BA13A2B@mcs.anl.gov> Message-ID: <1332496814.91686.YahooMailNeo@web39505.mail.mud.yahoo.com> Hi Jon, Please find my reply below. [Jon]: I think until we have more information about Swift/CDM interaction(the readData interaction Justin gave) you should have some files DEFAULT where they will be stored on gpfs. In the SwiftMontage case those files are small and only used once or twice at most. [Emalayan] I believe Justin's reply would clear my doubts too. Meanwhile I can proceed and try to integrate Montage on BG/P with MosaStore (by keeping some intermediate files on GPFS ). As an immediate next step first I will try the setup with GPFS as the intermediate storage (before using MosaStore). Thank you Emalayan ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 9:01 PM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM So Justin can elaborate more(probably) on the readData issue and CDM. ?This is actually the first I heard of that . I was under the same assumptions you were on CDM and Swift but Justin is the one who wrote CDM for Swift so he knows more of the pitfalls. As for the action plan, I think until we have more information about Swift/CDM interaction(the readData interaction Justin gave) you should have some files DEFAULT where they will be stored on gpfs. ?In the SwiftMontage case those files are small and only used once or twice at most, in fact I believe the apps that generate those files run on localhost(at least that is how I tested them). ?You should not witness any fs performance difference but since the overall goal of mosa is to use it for all intermediate data we will probably need to modify the scripts. ?The only question is what to modify them too and for that we needs Justin's input. ?You can certainly proceed in setting up SwiftMontage for testing and start writing code to capture the performance numbers you need if you let some files DEFAULT, so this may not be a blocker at the moment. Notice, that you do not need to specify a file to DEFAULT in CDM. ?If there is not a rule in the cdm rule file, it automatically defaults. On Mar 22, 2012, at 6:28 PM, Emalayan Vairavanathan wrote: Hi Justin and Jon, > > >I thought the goal of having CDM is to provide a translation between the file data type (used in swift) and the actual location of the files. This will help to avoid the actual location of the file being hard coded in the swift-script and also help swift to harness the platform specific data transfer mechanisms. > > >But from what Justin said, it seems the issue is with the rules (usage of CDM_DIRECT Vs CDM DEFAULT). I did not understand how such translation layer get confused by? CDM_DIRECT / CDM DEFAULT. Does readData() calles does not go through this translation layer ? > > > >May be I am wrong here. If so please correct me and provide more high level information. > > > > > >Jon: What is the action plan ? Do we need modification in the monage swift scripts ? or Do you suggest me to use CDM rules with CDM_DEFAULT for some files ? (In this case these intermediate files will be stored in GPFS) > > >Thank you very much > >Emalayan > > > > >________________________________ > From: Jonathan Monette >To: Justin M Wozniak >Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore >Sent: Thursday, 22 March 2012 3:11 PM >Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > >So those files are in _concurrent.? Those files(the ones mapped with the concurrent mapper) are read by readData or readData2.? So if he uses fs_1.data from a previous email it works, which he said he confirmed. > >On Mar 22, 2012, at 5:03 PM, Justin M Wozniak wrote: > >> >> I think SwiftMontage is trying to do a readData() on a CDM DIRECT file. This file is not accessible to the Swift engine, as it will be on the compute nodes in MosaStore. >> >> We need to enumerate the file names that must be accessible to Swift for readData().? These will have to be CDM DEFAULT.? The rest can be CDM DIRECT. >> >> ??? Justin >> >> On Thu, 22 Mar 2012, Justin M Wozniak wrote: >> >>> Ok, I can run the whole workflow with fs.data: >>> >>> rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>> rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>> # rule .*.* DIRECT /tmp/local >>> >>> I am now going to uncomment the last line... >>> >>> Emalayan, can you try to run that and see what happens? >>> >>> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >>> >>>> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. >>>> I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. >>>> Thank you >>>> Emalayan >>>> ________________________________ >>>> From: Jonathan Monette >>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:45 PM >>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>> I'll run with those rules.? I saw the garbage collection exception in the log file.? The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. >>>> But it just occurred to me, which swift are you using again?? I am running with the copy I grabbed from Justin's directory.? Are you running that as well or with 0.93?? If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. >>>> On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: >>>> Hi Jon, >>>>> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of? location dependency ? >>>>> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >>>>> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>>> rule .*final.* DIRECT >>>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>>> rule .*.* DIRECT /tmp/local >>>>> Meantime I will run the montage again and see whether swift garbage-collector throws some error. >>>>> Please let me know if you have a better idea to debug the problem. >>>>> Thank you >>>>> Emalayan >>>>> ________________________________ >>>>> From: Jonathan Monette >>>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:20 PM >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>> So I was able to run it on surveyor and it completed.? I see those same SetFielfValue lines but I don't think that is an issue.? That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on.? However I do see the Swift garbage collector kicking in and throwing exceptions: >>>>> 2012-03-22 20:58:03,070+0000 INFO? FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >>>>> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >>>>> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >>>>> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >>>>> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. >>>>> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended.? In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in.? Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files.? The only things that are expected to be in certain places are the raw_dir and header.hdr.? Those have to be in the pwd.? However I will continue debugging to make sure those assumptions I made are holding. >>>>> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names.? I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not.? CDM may not like that.? Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different.? I am not sure, that is just a hypothesis.? I can always change the scripts to not use the concurrent mapper and use a better mapper? for the CDM rules if that turns out to be the case. >>>>> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >>>>> Hi Justin, >>>>>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>>>>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >>>>>> May be you are using a different swift version or may be a different login machine. Thank you >>>>>> Emalayan >>>>>> ________________________________ >>>>>> From: Justin M Wozniak >>>>>> To: Jonathan Monette Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:49 PM >>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>> This run had no CDM >>>> file. >>>>>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>>>>>> Was that with CDM in your run?? I am going to take a look too as to why that is showing up. >>>>>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>>>>>> Ok, I can get it started but I get: >>>>>>>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >>>>>>>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >>>>>>>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >>>>>>>> resulting in: >>>>>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>>>>> This looks like a Swift bug.? However, do you guys have an existing workaround? >>>>>>>>? ? Thanks >>>>>>>> On Thu, 22 Mar 2012, Emalayan >>>> Vairavanathan wrote: >>>>>>>>> Hi Justin, >>>>>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>>>>>> Please let me know if you have questions. >>>>>>>>> Thank you >>>>>>>>> Emalayan >>>>>>>>> ________________________________ >>>>>>>>> From: Justin M Wozniak >>>>>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>>>>>>? ? ? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>>>>>? ? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>>>>> What is the entry >>>> point? >>>>>>>>> Are we missing common.sh? >>>>>>>>> -- Justin M >>>> Wozniak >>>>>>>> -- >>>>>>>> Justin M Wozniak_______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>> -- >>>>>> Justin M Wozniak >>> >>> >> >> -- >> Justin M Wozniak_______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Fri Mar 23 05:02:42 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Fri, 23 Mar 2012 03:02:42 -0700 (PDT) Subject: [Swift-devel] Montage+Swift workflow fails on Surveyor Message-ID: <1332496962.56258.YahooMailNeo@web39503.mail.mud.yahoo.com> Hi Jon, I tired to run Montage-Swift setup on surveyor withpersistent coasters and GPFS (this is without cdm). The last stage of the workflow fails. Have you ever tried this before on Surveyor? Do you have any clues about this issue? It would be great if you can have a look ? (I pasted the standard out put below). The setup is available in /home/emalayan/app/montage-swift/SwiftMontage/scripts. After starting the coasters via start-coaster-service, please use ./run_pc.sh to run the setup. Please let me know if I can help you in anyways. Thank you Emalayan ?========================================stdout============================================== emalayan at login2.surveyor:~/app/montage-swift/SwiftMontage/scripts> ./run_pc.sh Swift trunk swift-r5704 (swift modified locally) cog-r3361 (cog modified locally) RunID: 20120323-0900-oidnkire ?(input): found 10 files Progress:? time: Fri, 23 Mar 2012 09:00:40 +0000 Find: http://172.17.3.12:12346 Find:? keepalive(120), reconnect - http://172.17.3.12:12346 Progress:? time: Fri, 23 Mar 2012 09:00:54 +0000? Selecting site:4? Submitted:3? Active:2? Checking status:1 Progress:? time: Fri, 23 Mar 2012 09:01:07 +0000? Selecting site:1? Submitted:3? Active:2? Checking status:1? Finished successfully:3 Progress:? time: Fri, 23 Mar 2012 09:01:10 +0000? Submitted:1? Active:3? Finished successfully:6 Progress:? time: Fri, 23 Mar 2012 09:01:19 +0000? Submitted:1? Active:2? Checking status:1? Finished successfully:6 Progress:? time: Fri, 23 Mar 2012 09:01:32 +0000? Checking status:1? Finished successfully:9 Progress:? time: Fri, 23 Mar 2012 09:01:34 +0000? Checking status:1? Finished successfully:10 Progress:? time: Fri, 23 Mar 2012 09:01:35 +0000? Checking status:1? Finished successfully:11 Progress:? time: Fri, 23 Mar 2012 09:01:36 +0000? Selecting site:12? Submitted:3? Active:2? Checking status:1? Finished successfully:12 Progress:? time: Fri, 23 Mar 2012 09:01:38 +0000? Selecting site:9? Submitted:3? Active:2? Checking status:1? Finished successfully:15 Progress:? time: Fri, 23 Mar 2012 09:01:39 +0000? Selecting site:6? Submitted:3? Active:2? Checking status:1? Finished successfully:18 Progress:? time: Fri, 23 Mar 2012 09:01:40 +0000? Selecting site:3? Submitted:3? Active:3? Finished successfully:21 Progress:? time: Fri, 23 Mar 2012 09:01:41 +0000? Submitted:3? Active:3? Finished successfully:24 Progress:? time: Fri, 23 Mar 2012 09:01:43 +0000? Active:2? Checking status:1? Finished successfully:27 Progress:? time: Fri, 23 Mar 2012 09:01:44 +0000? Checking status:1? Finished successfully:30 Progress:? time: Fri, 23 Mar 2012 09:01:46 +0000? Checking status:1? Finished successfully:31 Progress:? time: Fri, 23 Mar 2012 09:01:47 +0000? Checking status:1? Finished successfully:32 Progress:? time: Fri, 23 Mar 2012 09:01:49 +0000? Selecting site:4? Submitted:3? Active:2? Checking status:1? Finished successfully:33 Progress:? time: Fri, 23 Mar 2012 09:01:50 +0000? Selecting site:1? Submitted:3? Active:2? Checking status:1? Finished successfully:36 Progress:? time: Fri, 23 Mar 2012 09:01:52 +0000? Submitted:1? Active:2? Checking status:1? Finished successfully:39 Progress:? time: Fri, 23 Mar 2012 09:01:53 +0000? Finished successfully:42 Failed but can retry:1 EXCEPTION Exception in mBackground_wrap: Arguments: [-n, proj_dir/proj_2mass-atlas-990214n-j1110032.fits, rect_dir/proj_2mass-atlas-990214n-j1110032.fits, 0.0, 0.0, 0.0] Host: persistent-coasters Directory: SwiftMontage-20120323-0900-oidnkire/jobs/t/mBackground_wrap-t3it8uok stderr.txt: stdout.txt: [struct stat="ERROR", status=252, msg="1st key not SIMPLE or XTENSION"] ---- Caused by: Application /home/emalayan/app/montage-swift/SwiftMontage/apps/mBackground_wrap.py failed with an exit code of 1 Execution failed: ??? Application /home/emalayan/app/montage-swift/SwiftMontage/apps/mBackground_wrap.py failed with an exit code of 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Fri Mar 23 05:24:46 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Fri, 23 Mar 2012 03:24:46 -0700 (PDT) Subject: [Swift-devel] Montage+Swift workflow fails on Surveyor In-Reply-To: <1332496962.56258.YahooMailNeo@web39503.mail.mud.yahoo.com> References: <1332496962.56258.YahooMailNeo@web39503.mail.mud.yahoo.com> Message-ID: <1332498286.84978.YahooMailNeo@web39505.mail.mud.yahoo.com> Hi Jon, I figured out the reason just now. It seems I didnt have enough space in my home directory. Now it works and sorry for the confusion. Thank you Emalayan ________________________________ From: Emalayan Vairavanathan To: "swift-devel at ci.uchicago.edu" ; MosaStore ; Jonathan Monette Sent: Friday, 23 March 2012 3:02 AM Subject: [Swift-devel] Montage+Swift workflow fails on Surveyor Hi Jon, I tired to run Montage-Swift setup on surveyor withpersistent coasters and GPFS (this is without cdm). The last stage of the workflow fails. Have you ever tried this before on Surveyor? Do you have any clues about this issue? It would be great if you can have a look ? (I pasted the standard out put below). The setup is available in /home/emalayan/app/montage-swift/SwiftMontage/scripts. After starting the coasters via start-coaster-service, please use ./run_pc.sh to run the setup. Please let me know if I can help you in anyways. Thank you Emalayan ?========================================stdout============================================== emalayan at login2.surveyor:~/app/montage-swift/SwiftMontage/scripts> ./run_pc.sh Swift trunk swift-r5704 (swift modified locally) cog-r3361 (cog modified locally) RunID: 20120323-0900-oidnkire ?(input): found 10 files Progress:? time: Fri, 23 Mar 2012 09:00:40 +0000 Find: http://172.17.3.12:12346 Find:? keepalive(120), reconnect - http://172.17.3.12:12346 Progress:? time: Fri, 23 Mar 2012 09:00:54 +0000? Selecting site:4? Submitted:3? Active:2? Checking status:1 Progress:? time: Fri, 23 Mar 2012 09:01:07 +0000? Selecting site:1? Submitted:3? Active:2? Checking status:1? Finished successfully:3 Progress:? time: Fri, 23 Mar 2012 09:01:10 +0000? Submitted:1? Active:3? Finished successfully:6 Progress:? time: Fri, 23 Mar 2012 09:01:19 +0000? Submitted:1? Active:2? Checking status:1? Finished successfully:6 Progress:? time: Fri, 23 Mar 2012 09:01:32 +0000? Checking status:1? Finished successfully:9 Progress:? time: Fri, 23 Mar 2012 09:01:34 +0000? Checking status:1? Finished successfully:10 Progress:? time: Fri, 23 Mar 2012 09:01:35 +0000? Checking status:1? Finished successfully:11 Progress:? time: Fri, 23 Mar 2012 09:01:36 +0000? Selecting site:12? Submitted:3? Active:2? Checking status:1? Finished successfully:12 Progress:? time: Fri, 23 Mar 2012 09:01:38 +0000? Selecting site:9? Submitted:3? Active:2? Checking status:1? Finished successfully:15 Progress:? time: Fri, 23 Mar 2012 09:01:39 +0000? Selecting site:6? Submitted:3? Active:2? Checking status:1? Finished successfully:18 Progress:? time: Fri, 23 Mar 2012 09:01:40 +0000? Selecting site:3? Submitted:3? Active:3? Finished successfully:21 Progress:? time: Fri, 23 Mar 2012 09:01:41 +0000? Submitted:3? Active:3? Finished successfully:24 Progress:? time: Fri, 23 Mar 2012 09:01:43 +0000? Active:2? Checking status:1? Finished successfully:27 Progress:? time: Fri, 23 Mar 2012 09:01:44 +0000? Checking status:1? Finished successfully:30 Progress:? time: Fri, 23 Mar 2012 09:01:46 +0000? Checking status:1? Finished successfully:31 Progress:? time: Fri, 23 Mar 2012 09:01:47 +0000? Checking status:1? Finished successfully:32 Progress:? time: Fri, 23 Mar 2012 09:01:49 +0000? Selecting site:4? Submitted:3? Active:2? Checking status:1? Finished successfully:33 Progress:? time: Fri, 23 Mar 2012 09:01:50 +0000? Selecting site:1? Submitted:3? Active:2? Checking status:1? Finished successfully:36 Progress:? time: Fri, 23 Mar 2012 09:01:52 +0000? Submitted:1? Active:2? Checking status:1? Finished successfully:39 Progress:? time: Fri, 23 Mar 2012 09:01:53 +0000? Finished successfully:42 Failed but can retry:1 EXCEPTION Exception in mBackground_wrap: Arguments: [-n, proj_dir/proj_2mass-atlas-990214n-j1110032.fits, rect_dir/proj_2mass-atlas-990214n-j1110032.fits, 0.0, 0.0, 0.0] Host: persistent-coasters Directory: SwiftMontage-20120323-0900-oidnkire/jobs/t/mBackground_wrap-t3it8uok stderr.txt: stdout.txt: [struct stat="ERROR", status=252, msg="1st key not SIMPLE or XTENSION"] ---- Caused by: Application /home/emalayan/app/montage-swift/SwiftMontage/apps/mBackground_wrap.py failed with an exit code of 1 Execution failed: ??? Application /home/emalayan/app/montage-swift/SwiftMontage/apps/mBackground_wrap.py failed with an exit code of 1 _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Fri Mar 23 08:25:48 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 23 Mar 2012 08:25:48 -0500 (CDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: <1332458912.89567.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> <61547069-A191-4F9A-9807-51AA84EF6528@mcs.anl.gov> <1332458912.89567.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: For MosaSwift purposes, CDM DIRECT is used to place files in a file system (Mosa) accessible to the worker nodes but not the login node. readData() is executed by the Swift Java process on the login node to read data into script variables. Can we set up a phone call today? I am free at 2pm Central. Justin On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > Hi Justin and Jon, > > I thought the goal of having CDM is to provide a translation between the > file data type (used in swift) and the actual location of the files. > This will help to avoid the actual location of the file being hard coded > in the swift-script and also help swift to harness the platform specific > data transfer mechanisms. > > But from what Justin said, it seems the issue is with the rules (usage > of CDM_DIRECT Vs CDM DEFAULT). I did not understand how such translation > layer get confused by? CDM_DIRECT / CDM DEFAULT. Does readData() calles > does not go through this translation layer ? > > > May be I am wrong here. If so please correct me and provide more high level information. > > > > Jon: What is the action plan ? Do we need modification in the monage swift scripts ? or Do you suggest me to use CDM rules with CDM_DEFAULT for some files ? (In this case these intermediate files will be stored in GPFS) > > Thank you very much > > Emalayan > > > > ________________________________ > From: Jonathan Monette > To: Justin M Wozniak > Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore > Sent: Thursday, 22 March 2012 3:11 PM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > So those files are in _concurrent.? Those files(the ones mapped with the concurrent mapper) are read by readData or readData2.? So if he uses fs_1.data from a previous email it works, which he said he confirmed. > > On Mar 22, 2012, at 5:03 PM, Justin M Wozniak wrote: > >> >> I think SwiftMontage is trying to do a readData() on a CDM DIRECT file. >> This file is not accessible to the Swift engine, as it will be on the >> compute nodes in MosaStore. >> >> We need to enumerate the file names that must be accessible to Swift >> for readData().? These will have to be CDM DEFAULT.? The rest can be >> CDM DIRECT. >> >> ??? Justin >> >> On Thu, 22 Mar 2012, Justin M Wozniak wrote: >> >>> Ok, I can run the whole workflow with fs.data: >>> >>> rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>> rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>> # rule .*.* DIRECT /tmp/local >>> >>> I am now going to uncomment the last line... >>> >>> Emalayan, can you try to run that and see what happens? >>> >>> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >>> >>>> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. >>>> I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. >>>> Thank you >>>> Emalayan >>>> ________________________________ >>>> From: Jonathan Monette >>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:45 PM >>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>> I'll run with those rules.? I saw the garbage collection exception in the log file.? The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. >>>> But it just occurred to me, which swift are you using again?? I am running with the copy I grabbed from Justin's directory.? Are you running that as well or with 0.93?? If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. >>>> On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: >>>> Hi Jon, >>>>> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of? location dependency ? >>>>> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >>>>> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>>> rule .*final.* DIRECT >>>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>>> rule .*.* DIRECT /tmp/local >>>>> Meantime I will run the montage again and see whether swift garbage-collector throws some error. >>>>> Please let me know if you have a better idea to debug the problem. >>>>> Thank you >>>>> Emalayan >>>>> ________________________________ >>>>> From: Jonathan Monette >>>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:20 PM >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>> So I was able to run it on surveyor and it completed.? I see those same SetFielfValue lines but I don't think that is an issue.? That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on.? However I do see the Swift garbage collector kicking in and throwing exceptions: >>>>> 2012-03-22 20:58:03,070+0000 INFO? FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >>>>> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >>>>> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >>>>> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >>>>> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. >>>>> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended.? In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in.? Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files.? The only things that are expected to be in certain places are the raw_dir and header.hdr.? Those have to be in the pwd.? However I will continue debugging to make sure those assumptions I made are holding. >>>>> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names.? I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not.? CDM may not like that.? Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different.? I am not sure, that is just a hypothesis.? I can always change the scripts to not use the concurrent mapper and use a better mapper? for the CDM rules if that turns out to be the case. >>>>> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >>>>> Hi Justin, >>>>>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>>>>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >>>>>> May be you are using a different swift version or may be a different login machine. Thank you >>>>>> Emalayan >>>>>> ________________________________ >>>>>> From: Justin M Wozniak >>>>>> To: Jonathan Monette Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:49 PM >>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>> This run had no CDM >>>> file. >>>>>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>>>>>> Was that with CDM in your run?? I am going to take a look too as to why that is showing up. >>>>>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>>>>>> Ok, I can get it started but I get: >>>>>>>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >>>>>>>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >>>>>>>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >>>>>>>> resulting in: >>>>>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>>>>> This looks like a Swift bug.? However, do you guys have an existing workaround? >>>>>>>> ? ? Thanks >>>>>>>> On Thu, 22 Mar 2012, Emalayan >>>> Vairavanathan wrote: >>>>>>>>> Hi Justin, >>>>>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>>>>>> Please let me know if you have questions. >>>>>>>>> Thank you >>>>>>>>> Emalayan >>>>>>>>> ________________________________ >>>>>>>>> From: Justin M Wozniak >>>>>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>>>>>> ? ? ? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>>>>> ? ? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>>>>> What is the entry >>>> point? >>>>>>>>> Are we missing common.sh? >>>>>>>>> -- Justin M >>>> Wozniak >>>>>>>> -- >>>>>>>> Justin M Wozniak_______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>> -- >>>>>> Justin M Wozniak >>> >>> >> >> -- >> Justin M Wozniak_______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Justin M Wozniak From jonmon at mcs.anl.gov Fri Mar 23 09:25:07 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 23 Mar 2012 09:25:07 -0500 Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> <61547069-A191-4F9A-9807-51AA84EF6528@mcs.anl.gov> <1332458912.89567.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: I am free at that time as well. On Mar 23, 2012, at 8:25, Justin M Wozniak wrote: > > For MosaSwift purposes, CDM DIRECT is used to place files in a file system (Mosa) accessible to the worker nodes but not the login node. readData() is executed by the Swift Java process on the login node to read data into script variables. > > Can we set up a phone call today? I am free at 2pm Central. > > Justin > > On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > >> Hi Justin and Jon, >> >> I thought the goal of having CDM is to provide a translation between the file data type (used in swift) and the actual location of the files. This will help to avoid the actual location of the file being hard coded in the swift-script and also help swift to harness the platform specific data transfer mechanisms. >> >> But from what Justin said, it seems the issue is with the rules (usage of CDM_DIRECT Vs CDM DEFAULT). I did not understand how such translation layer get confused by CDM_DIRECT / CDM DEFAULT. Does readData() calles does not go through this translation layer ? >> >> >> May be I am wrong here. If so please correct me and provide more high level information. >> >> >> >> Jon: What is the action plan ? Do we need modification in the monage swift scripts ? or Do you suggest me to use CDM rules with CDM_DEFAULT for some files ? (In this case these intermediate files will be stored in GPFS) >> >> Thank you very much >> >> Emalayan >> >> >> >> ________________________________ >> From: Jonathan Monette >> To: Justin M Wozniak Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 3:11 PM >> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >> So those files are in _concurrent. Those files(the ones mapped with the concurrent mapper) are read by readData or readData2. So if he uses fs_1.data from a previous email it works, which he said he confirmed. >> >> On Mar 22, 2012, at 5:03 PM, Justin M Wozniak wrote: >> >>> I think SwiftMontage is trying to do a readData() on a CDM DIRECT file. This file is not accessible to the Swift engine, as it will be on the compute nodes in MosaStore. >>> We need to enumerate the file names that must be accessible to Swift for readData(). These will have to be CDM DEFAULT. The rest can be CDM DIRECT. >>> Justin >>> On Thu, 22 Mar 2012, Justin M Wozniak wrote: >>>> Ok, I can run the whole workflow with fs.data: >>>> rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>> rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>> # rule .*.* DIRECT /tmp/local >>>> I am now going to uncomment the last line... >>>> Emalayan, can you try to run that and see what happens? >>>> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >>>>> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. >>>>> I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. >>>>> Thank you >>>>> Emalayan >>>>> ________________________________ >>>>> From: Jonathan Monette >>>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:45 PM >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>> I'll run with those rules. I saw the garbage collection exception in the log file. The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. >>>>> But it just occurred to me, which swift are you using again? I am running with the copy I grabbed from Justin's directory. Are you running that as well or with 0.93? If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. >>>>> On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: >>>>> Hi Jon, >>>>>> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of location dependency ? >>>>>> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >>>>>> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>>>> rule .*final.* DIRECT >>>>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>>>> rule .*.* DIRECT /tmp/local >>>>>> Meantime I will run the montage again and see whether swift garbage-collector throws some error. >>>>>> Please let me know if you have a better idea to debug the problem. >>>>>> Thank you >>>>>> Emalayan >>>>>> ________________________________ >>>>>> From: Jonathan Monette >>>>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:20 PM >>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>> So I was able to run it on surveyor and it completed. I see those same SetFielfValue lines but I don't think that is an issue. That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on. However I do see the Swift garbage collector kicking in and throwing exceptions: >>>>>> 2012-03-22 20:58:03,070+0000 INFO FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >>>>>> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >>>>>> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >>>>>> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >>>>>> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. >>>>>> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended. In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in. Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files. The only things that are expected to be in certain places are the raw_dir and header.hdr. Those have to be in the pwd. However I will continue debugging to make sure those assumptions I made are holding. >>>>>> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names. I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not. CDM may not like that. Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different. I am not sure, that is just a hypothesis. I can always change the scripts to not use the concurrent mapper and use a better mapper for the CDM rules if that turns out to be the case. >>>>>> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >>>>>> Hi Justin, >>>>>>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>>>>>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >>>>>>> May be you are using a different swift version or may be a different login machine. Thank you >>>>>>> Emalayan >>>>>>> ________________________________ >>>>>>> From: Justin M Wozniak >>>>>>> To: Jonathan Monette Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:49 PM >>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>>> This run had no CDM >>>>> file. >>>>>>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>>>>>>> Was that with CDM in your run? I am going to take a look too as to why that is showing up. >>>>>>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>>>>>>> Ok, I can get it started but I get: >>>>>>>>> 2012-03-22 20:24:06,048+0000 INFO SetFieldValue Set: projected_images[0]=null >>>>>>>>> 2012-03-22 20:24:06,049+0000 INFO SetFieldValue Set: projected_images[6]=null >>>>>>>>> 2012-03-22 20:24:06,050+0000 INFO SetFieldValue Set: projected_images[7]=null >>>>>>>>> resulting in: >>>>>>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>>>>>> This looks like a Swift bug. However, do you guys have an existing workaround? >>>>>>>>> Thanks >>>>>>>>> On Thu, 22 Mar 2012, Emalayan >>>>> Vairavanathan wrote: >>>>>>>>>> Hi Justin, >>>>>>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>>>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>>>>>>> Please let me know if you have questions. >>>>>>>>>> Thank you >>>>>>>>>> Emalayan >>>>>>>>>> ________________________________ >>>>>>>>>> From: Justin M Wozniak >>>>>>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>>>>>>> But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>>>>>> /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>>>>>> What is the entry >>>>> point? >>>>>>>>>> Are we missing common.sh? >>>>>>>>>> -- Justin M >>>>> Wozniak >>>>>>>>> -- >>>>>>>>> Justin M Wozniak_______________________________________________ >>>>>>>>> Swift-devel mailing list >>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>>> -- >>>>>>> Justin M Wozniak >>> -- >>> Justin M Wozniak_______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Justin M Wozniak From svemalayan at yahoo.com Fri Mar 23 11:16:37 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Fri, 23 Mar 2012 09:16:37 -0700 (PDT) Subject: [Swift-devel] Issues with Montage & Swift-CDM In-Reply-To: References: <1332207379.84736.YahooMailNeo@web39502.mail.mud.yahoo.com> <1332374137.43424.YahooMailNeo@web39507.mail.mud.yahoo.com> <1332444919.62721.YahooMailNeo@web39508.mail.mud.yahoo.com> <1332450141.14405.YahooMailNeo@web39501.mail.mud.yahoo.com> <1332452459.67730.YahooMailNeo@web39503.mail.mud.yahoo.com> <3C236C2E-0899-4D9D-B1F7-8289B8D51034@mcs.anl.gov> <1332453217.83638.YahooMailNeo@web39505.mail.mud.yahoo.com> <61547069-A191-4F9A-9807-51AA84EF6528@mcs.anl.gov> <1332458912.89567.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <1332519397.74361.YahooMailNeo@web39504.mail.mud.yahoo.com> Hi Justin, Sure we can meet, I am free too. Samer can? you also join ? Thank you Emalayan ________________________________ From: Justin M Wozniak To: Emalayan Vairavanathan Cc: Jonathan Monette ; Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Friday, 23 March 2012 6:25 AM Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM For MosaSwift purposes, CDM DIRECT is used to place files in a file system (Mosa) accessible to the worker nodes but not the login node.? readData() is executed by the Swift Java process on the login node to read data into script variables. Can we set up a phone call today?? I am free at 2pm Central. ??? Justin On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: > Hi Justin and Jon, > > I thought the goal of having CDM is to provide a translation between the file data type (used in swift) and the actual location of the files. This will help to avoid the actual location of the file being hard coded in the swift-script and also help swift to harness the platform specific data transfer mechanisms. > > But from what Justin said, it seems the issue is with the rules (usage of CDM_DIRECT Vs CDM DEFAULT). I did not understand how such translation layer get confused by? CDM_DIRECT / CDM DEFAULT. Does readData() calles does not go through this translation layer ? > > > May be I am wrong here. If so please correct me and provide more high level information. > > > > Jon: What is the action plan ? Do we need modification in the monage swift scripts ? or Do you suggest me to use CDM rules with CDM_DEFAULT for some files ? (In this case these intermediate files will be stored in GPFS) > > Thank you very much > > Emalayan > > > > ________________________________ > From: Jonathan Monette > To: Justin M Wozniak Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 3:11 PM > Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM > > So those files are in _concurrent.? Those files(the ones mapped with the concurrent mapper) are read by readData or readData2.? So if he uses fs_1.data from a previous email it works, which he said he confirmed. > > On Mar 22, 2012, at 5:03 PM, Justin M Wozniak wrote: > >> >> I think SwiftMontage is trying to do a readData() on a CDM DIRECT file. This file is not accessible to the Swift engine, as it will be on the compute nodes in MosaStore. >> >> We need to enumerate the file names that must be accessible to Swift for readData().? These will have to be CDM DEFAULT.? The rest can be CDM DIRECT. >> >> ??? Justin >> >> On Thu, 22 Mar 2012, Justin M Wozniak wrote: >> >>> Ok, I can run the whole workflow with fs.data: >>> >>> rule .*raw_dir.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>> rule .*header.hdr.* DIRECT /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>> # rule .*.* DIRECT /tmp/local >>> >>> I am now going to uncomment the last line... >>> >>> Emalayan, can you try to run that and see what happens? >>> >>> On Thu, 22 Mar 2012, Emalayan Vairavanathan wrote: >>> >>>> Thank you Jon. I am using swift-0.93 (this is in our Cluster). So I do not need to check for garbage-collector errors. >>>> I forgot to mention one point about concurrent mappers. The pipeline swift benchmark I wrote uses concurrent mapper too. Last week I ran the benchmark successfully regardless of the concurrent mappers location (I tried various location for concurrent mappers). So I guess the location of the concurrent mappers wont be an issue. >>>> Thank you >>>> Emalayan >>>> ________________________________ >>>> From: Jonathan Monette >>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:45 PM >>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>> I'll run with those rules.? I saw the garbage collection exception in the log file.? The exception may be causing no harm, but it probably shouldn't be throwing an exception at all so this may help fix it in trunk. >>>> But it just occurred to me, which swift are you using again?? I am running with the copy I grabbed from Justin's directory.? Are you running that as well or with 0.93?? If with 0.93 you will not see that exception because that version does not have the swift garbage collector so we should not waste time with that exception. >>>> On Mar 22, 2012, at 4:40 PM, Emalayan Vairavanathan wrote: >>>> Hi Jon, >>>>> Thank you for you suggestions. By the way can you try to run motage with the CDM rules below and see whether the problem is with concurrent mapper and not because of? location dependency ? >>>>> rule .*raw_dir.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts >>>>> rule .*header.hdr.* DIRECT /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>>> rule .*final.* DIRECT >>>> /home/emalayan/App/montage-swift-cdm/SwiftMontage/scripts/ >>>>> rule .*.* DIRECT /tmp/local >>>>> Meantime I will run the montage again and see whether swift garbage-collector throws some error. >>>>> Please let me know if you have a better idea to debug the problem. >>>>> Thank you >>>>> Emalayan >>>>> ________________________________ >>>>> From: Jonathan Monette >>>>> To: Emalayan Vairavanathan Cc: Justin M Wozniak ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 2:20 PM >>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>> So I was able to run it on surveyor and it completed.? I see those same SetFielfValue lines but I don't think that is an issue.? That is just because I declare an array called projected_images and fill it up inside another function with files I used the regexp mapper on.? However I do see the Swift garbage collector kicking in and throwing exceptions: >>>>> 2012-03-22 20:58:03,070+0000 INFO? FileGarbageCollector Failed to clean file://localhost/_concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b >>>>> java.lang.RuntimeException: org.globus.cog.abstraction.impl.file.FileNotFoundException: _concurrent/back_struct-d10301cf-5be0-4918-b0c2-49be758cf53a-7-array//elt-2.-field/b not found. >>>>> at org.griphyn.vdl.mapping.AbsFile.clean(AbsFile.java:191) >>>>> at org.griphyn.vdl.mapping.file.FileGarbageCollector.run(FileGarbageCollector.java:115) >>>>> I am not sure if that may be causing problems for when Emalayan tries to run with CDM. >>>>> Emalayan, it does not look like those scripts expect files to be in a certain location, at least that is not what is intended.? In the main swiftscript, when you call the other functions you pass the directory name you want the intermediate files to be stored in.? Then in the SwiftMontage_Batch functions it uses those directories you passed to map input/output files.? The only things that are expected to be in certain places are the raw_dir and header.hdr.? Those have to be in the pwd.? However I will continue debugging to make sure those assumptions I made are holding. >>>>> As to the different cdm setups, I do use the concurrent mapper(files that get dumped to _concurrent) where swift decides on the names.? I did this for a couple files that I did not care what they were named and they were small enough that I didn't care if they were staged in/out or not.? CDM may not like that.? Perhaps Swift expects those _concurrent files in a certain place but you told CDM to put them someplace different.? I am not sure, that is just a hypothesis.? I can always change the scripts to not use the concurrent mapper and use a better mapper? for the CDM rules if that turns out to be the case. >>>>> On Mar 22, 2012, at 4:02 PM, Emalayan Vairavanathan wrote: >>>>> Hi Justin, >>>>>> For me the setup in /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts works. >>>>>> I tried onlogin2.surveyor with the swift binaries located in /home/wozniak/Public/swift/bin/swift >>>>>> May be you are using a different swift version or may be a different login machine. Thank you >>>>>> Emalayan >>>>>> ________________________________ >>>>>> From: Justin M Wozniak >>>>>> To: Jonathan Monette Cc: Emalayan Vairavanathan ; matei ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Thursday, 22 March 2012 1:49 PM >>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>> This run had no CDM >>>> file. >>>>>> On Thu, 22 Mar 2012, Jonathan Monette wrote: >>>>>>> Was that with CDM in your run?? I am going to take a look too as to why that is showing up. >>>>>>> On Mar 22, 2012, at 3:42 PM, Justin M Wozniak wrote: >>>>>>>> Ok, I can get it started but I get: >>>>>>>> 2012-03-22 20:24:06,048+0000 INFO? SetFieldValue Set: projected_images[0]=null >>>>>>>> 2012-03-22 20:24:06,049+0000 INFO? SetFieldValue Set: projected_images[6]=null >>>>>>>> 2012-03-22 20:24:06,050+0000 INFO? SetFieldValue Set: projected_images[7]=null >>>>>>>> resulting in: >>>>>>>> File not found: /gpfs/home/wozniak/SwiftMontage/scripts/./proj_dir/null >>>>>>>> This looks like a Swift bug.? However, do you guys have an existing workaround? >>>>>>>> ? ?? Thanks >>>>>>>> On Thu, 22 Mar 2012, Emalayan >>>> Vairavanathan wrote: >>>>>>>>> Hi Justin, >>>>>>>>> Please use ./run_local.sh to run the montage without cdm locally on the headnode. >>>>>>>>> The rest of the scripts (run-workers.sh, run.sh, run-swift.sh, main.sh) are written to run experiments in our cluster and wont work in Surveyor. >>>>>>>>> Please let me know if you have questions. >>>>>>>>> Thank you >>>>>>>>> Emalayan >>>>>>>>> ________________________________ >>>>>>>>> From: Justin M Wozniak >>>>>>>>> To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore ; matei Sent: Thursday, 22 March 2012 11:54 AM >>>>>>>>> Subject: Re: [Swift-devel] Issues with Montage & Swift-CDM >>>>>>>>> On Wed, 21 Mar 2012, Emalayan Vairavanathan wrote: >>>>>>>>>> ? ? ? But I just setup everything on Surveyor and it works locally on the head node. You can find the setup here. >>>>>>>>>> ? ? /home/emalayan/app/montage-swift-cdm/SwiftMontage/scripts >>>>>>>>> What is the entry >>>> point? >>>>>>>>> Are we missing common.sh? >>>>>>>>> -- Justin M >>>> Wozniak >>>>>>>> -- >>>>>>>> Justin M Wozniak_______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>>> -- Justin M Wozniak >>> >>> >> >> -- Justin M Wozniak_______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Fri Mar 23 18:59:17 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 23 Mar 2012 18:59:17 -0500 Subject: [Swift-devel] CFP: Special Issue on Cloud Computing in Science & Engineering, in the the IEEE Computing in Science & Engineering (CiSE) Message-ID: <4F6D0E55.9040906@cs.iit.edu> *Call for Papers* *IEEE Computing in Science & Engineering* ** *Special Issue on Cloud Computing in Science & Engineering* http://www.computer.org/portal/web/computingnow/cise ** *Submissions due: November 04, 2012* *Estimated Publication date: July/August, 2013* Cloud computing has emerged as a dominant paradigm that has been widely adopted by enterprises. Clouds provide on-demand access to computing utilities, an abstraction of unlimited computing resources, and support for on-demand scale up, scale down and scale out. Clouds are also rapidly joining more traditional computing platforms as viable platforms for scientific exploration and discovery, and education. As a result, understanding application formulations and usage modes that are meaningful in such a hybrid infrastructure, what are the fundamental conceptual and technological challenges, and how applications can effectively utilize it, is critical. The goal of this special issue of CiSE is to explore how Clouds platforms and abstractions, either by themselves or in combination with other platforms, can be effectively used to support real-world science and engineering applications. Topics of interest include (but are not limited to) algorithmic and application formulations, programming models and systems, runtime systems and middleware, end-to-end application workflows and experiences with real applications. Published by the IEEE Computer Society, Computing in Science & Engineering magazine features the latest computational science and engineering research in an accessible format, along with departments covering news and analysis, CSE in education, and emerging technologies. We strongly encourage submissions that include multimedia, data, and community content, which will be featured on the IEEE Computer Society website along with the accepted papers. ** For more information please see http://www.computer.org/portal/web/computingnow/cscfp4 *Questions?* Contact guest editors *Manish Parashar, *Rutgers University (parashar at rutgers.edu) or *George K. Thiruvathukal, *Loyola University Chicago?(gkt at cs.luc.edu).** ** *Submission Guidelines* Authors are asked to submit high-quality original work that has neither appeared in nor is under consideration by other journals. All submissions will be peer-reviewed following standard journal practices. Manuscripts based on previously published conference papers must be extended substantially to include at least 30 percent new material. Manuscripts should be written in the active voice, should be no longer than 7,200 words (counting each standard figure and table as 250 words), and should follow the style and presentation guidelines of /CiSE /(see *www.computer.org/cise/author* for details). Please submit your article using the online manuscript submission service at *https://mc.manuscriptcentral.com/cs-ieee*. When uploading your article, select the appropriate special-issue title under the category "Manuscript Type." Also include complete contact information for all authors. If you have any questions about submitting your article, contact the peer review coordinator at *cise at computer.org* . -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Mar 28 20:30:52 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 28 Mar 2012 20:30:52 -0500 Subject: [Swift-devel] Coaster socket issue Message-ID: <1998FC8A-84E3-49FE-99CA-1B9BBBD79F79@mcs.anl.gov> Hello, In running the SciColSim app on raven(which is a cluster similar to Beagle) I noticed that the app hung. It was not hung where the hang checker kicked in but Swift was waiting for jobs to be active but there was none submitted to PBS. I took a look at the log file and noticed that I had a java.io.IOException thrown for "too many open files". Since I killed it I couldn't probe the run but I had the same run running on Beagle. Upon Mike's suggestion I took a look at the /proc//fd directory. There were over 2000 sockets in the CLOSE_WAIT state with a single message in the receive queue. Raven has a limit of 1024 open files at a time while Beagle has a limit around 60K number of files open. I got this limit using ulimit -n. So my question is, why is there so many sockets waiting to be closed? I did some reading about the CLOSE_WAIT state and it seems this happens when one of the ends closes there socket but the other does not. Is Coaster not closing the socket when a worker shuts down? What other information should I be looking for to help debug the issue. From davidk at ci.uchicago.edu Wed Mar 28 20:49:21 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 28 Mar 2012 20:49:21 -0500 (CDT) Subject: [Swift-devel] Coaster socket issue In-Reply-To: <1998FC8A-84E3-49FE-99CA-1B9BBBD79F79@mcs.anl.gov> Message-ID: <1412804836.77427.1332985761945.JavaMail.root@zimbra-mb2.anl.gov> Strange, I just ran into a similar issues tonight while running on ibicluster (SGE). I saw the "too many open files" error after sitting in the queue waiting for a job to start. I restarted the job and then periodically ran 'lsof' to see the number of java pipes increasing over time. I thought at first this might be SGE specific, but perhaps it is something else. (This was with 0.93) ----- Original Message ----- > From: "Jonathan Monette" > To: "swift-devel at ci.uchicago.edu Devel" > Sent: Wednesday, March 28, 2012 8:30:52 PM > Subject: [Swift-devel] Coaster socket issue > Hello, > In running the SciColSim app on raven(which is a cluster similar to > Beagle) I noticed that the app hung. It was not hung where the hang > checker kicked in but Swift was waiting for jobs to be active but > there was none submitted to PBS. I took a look at the log file and > noticed that I had a java.io.IOException thrown for "too many open > files". Since I killed it I couldn't probe the run but I had the same > run running on Beagle. Upon Mike's suggestion I took a look at the > /proc//fd directory. There were over 2000 sockets in the > CLOSE_WAIT state with a single message in the receive queue. Raven has > a limit of 1024 open files at a time while Beagle has a limit around > 60K number of files open. I got this limit using ulimit -n. > > So my question is, why is there so many sockets waiting to be closed? > I did some reading about the CLOSE_WAIT state and it seems this > happens when one of the ends closes there socket but the other does > not. Is Coaster not closing the socket when a worker shuts down? What > other information should I be looking for to help debug the issue. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Wed Mar 28 20:57:03 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 28 Mar 2012 20:57:03 -0500 Subject: [Swift-devel] Coaster socket issue In-Reply-To: <1412804836.77427.1332985761945.JavaMail.root@zimbra-mb2.anl.gov> References: <1412804836.77427.1332985761945.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <51C7CF38-1303-4C84-8CB3-76AB24D011A6@mcs.anl.gov> What is the open files limit on that machine(ulimit -n)? I have never witnessed this issue before so it may only appear on machines with relatively low open file limits(raven has 1K but beagle has 60K). This is still something we should look into though. On Mar 28, 2012, at 8:49 PM, David Kelly wrote: > > Strange, I just ran into a similar issues tonight while running on ibicluster (SGE). I saw the "too many open files" error after sitting in the queue waiting for a job to start. I restarted the job and then periodically ran 'lsof' to see the number of java pipes increasing over time. I thought at first this might be SGE specific, but perhaps it is something else. (This was with 0.93) > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "swift-devel at ci.uchicago.edu Devel" >> Sent: Wednesday, March 28, 2012 8:30:52 PM >> Subject: [Swift-devel] Coaster socket issue >> Hello, >> In running the SciColSim app on raven(which is a cluster similar to >> Beagle) I noticed that the app hung. It was not hung where the hang >> checker kicked in but Swift was waiting for jobs to be active but >> there was none submitted to PBS. I took a look at the log file and >> noticed that I had a java.io.IOException thrown for "too many open >> files". Since I killed it I couldn't probe the run but I had the same >> run running on Beagle. Upon Mike's suggestion I took a look at the >> /proc//fd directory. There were over 2000 sockets in the >> CLOSE_WAIT state with a single message in the receive queue. Raven has >> a limit of 1024 open files at a time while Beagle has a limit around >> 60K number of files open. I got this limit using ulimit -n. >> >> So my question is, why is there so many sockets waiting to be closed? >> I did some reading about the CLOSE_WAIT state and it seems this >> happens when one of the ends closes there socket but the other does >> not. Is Coaster not closing the socket when a worker shuts down? What >> other information should I be looking for to help debug the issue. >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Mar 28 21:10:38 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 28 Mar 2012 21:10:38 -0500 (CDT) Subject: [Swift-devel] Coaster socket issue In-Reply-To: <1412804836.77427.1332985761945.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1046559590.117285.1332987038392.JavaMail.root@zimbra.anl.gov> I think that on Jon's Beagle runs we say about 100 pipes but several thousand sockets, so we didnt pay any attention to the pipes (yet). The sockets were clearly from workers to the coaster service. I have no idea yet what the pipes are. ls -l of /proc/fd/ does a nice job of trying to identify and format the file name or object associated with each file descriptor. I suspect its doing the same thing lsof does. - Mike ----- Original Message ----- > From: "David Kelly" > To: "Jonathan Monette" > Cc: "swift-devel at ci.uchicago.edu Devel" > Sent: Wednesday, March 28, 2012 8:49:21 PM > Subject: Re: [Swift-devel] Coaster socket issue > Strange, I just ran into a similar issues tonight while running on > ibicluster (SGE). I saw the "too many open files" error after sitting > in the queue waiting for a job to start. I restarted the job and then > periodically ran 'lsof' to see the number of java pipes increasing > over time. I thought at first this might be SGE specific, but perhaps > it is something else. (This was with 0.93) > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "swift-devel at ci.uchicago.edu Devel" > > > > Sent: Wednesday, March 28, 2012 8:30:52 PM > > Subject: [Swift-devel] Coaster socket issue > > Hello, > > In running the SciColSim app on raven(which is a cluster similar to > > Beagle) I noticed that the app hung. It was not hung where the hang > > checker kicked in but Swift was waiting for jobs to be active but > > there was none submitted to PBS. I took a look at the log file and > > noticed that I had a java.io.IOException thrown for "too many open > > files". Since I killed it I couldn't probe the run but I had the > > same > > run running on Beagle. Upon Mike's suggestion I took a look at the > > /proc//fd directory. There were over 2000 sockets in the > > CLOSE_WAIT state with a single message in the receive queue. Raven > > has > > a limit of 1024 open files at a time while Beagle has a limit around > > 60K number of files open. I got this limit using ulimit -n. > > > > So my question is, why is there so many sockets waiting to be > > closed? > > I did some reading about the CLOSE_WAIT state and it seems this > > happens when one of the ends closes there socket but the other does > > not. Is Coaster not closing the socket when a worker shuts down? > > What > > other information should I be looking for to help debug the issue. > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Wed Mar 28 21:11:31 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 28 Mar 2012 21:11:31 -0500 (CDT) Subject: [Swift-devel] Coaster socket issue In-Reply-To: <51C7CF38-1303-4C84-8CB3-76AB24D011A6@mcs.anl.gov> Message-ID: <102462288.77460.1332987091069.JavaMail.root@zimbra-mb2.anl.gov> The limit here seems to be 1024. Just curious, what happens when you run 'lsof -u jonmon'? For me, I see lines like this that grow over time: java 14589 dkelly 220r FIFO 0,6 601514288 pipe java 14589 dkelly 221r FIFO 0,6 601514581 pipe java 14589 dkelly 222w FIFO 0,6 601514852 pipe java 14589 dkelly 223r FIFO 0,6 601514582 pipe ----- Original Message ----- > From: "Jonathan Monette" > To: "David Kelly" > Cc: "swift-devel at ci.uchicago.edu Devel" > Sent: Wednesday, March 28, 2012 8:57:03 PM > Subject: Re: [Swift-devel] Coaster socket issue > What is the open files limit on that machine(ulimit -n)? I have never > witnessed this issue before so it may only appear on machines with > relatively low open file limits(raven has 1K but beagle has 60K). This > is still something we should look into though. > > On Mar 28, 2012, at 8:49 PM, David Kelly wrote: > > > > > Strange, I just ran into a similar issues tonight while running on > > ibicluster (SGE). I saw the "too many open files" error after > > sitting in the queue waiting for a job to start. I restarted the job > > and then periodically ran 'lsof' to see the number of java pipes > > increasing over time. I thought at first this might be SGE specific, > > but perhaps it is something else. (This was with 0.93) > > > > ----- Original Message ----- > >> From: "Jonathan Monette" > >> To: "swift-devel at ci.uchicago.edu Devel" > >> > >> Sent: Wednesday, March 28, 2012 8:30:52 PM > >> Subject: [Swift-devel] Coaster socket issue > >> Hello, > >> In running the SciColSim app on raven(which is a cluster similar to > >> Beagle) I noticed that the app hung. It was not hung where the hang > >> checker kicked in but Swift was waiting for jobs to be active but > >> there was none submitted to PBS. I took a look at the log file and > >> noticed that I had a java.io.IOException thrown for "too many open > >> files". Since I killed it I couldn't probe the run but I had the > >> same > >> run running on Beagle. Upon Mike's suggestion I took a look at the > >> /proc//fd directory. There were over 2000 sockets in the > >> CLOSE_WAIT state with a single message in the receive queue. Raven > >> has > >> a limit of 1024 open files at a time while Beagle has a limit > >> around > >> 60K number of files open. I got this limit using ulimit -n. > >> > >> So my question is, why is there so many sockets waiting to be > >> closed? > >> I did some reading about the CLOSE_WAIT state and it seems this > >> happens when one of the ends closes there socket but the other does > >> not. Is Coaster not closing the socket when a worker shuts down? > >> What > >> other information should I be looking for to help debug the issue. > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Mar 28 21:21:18 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 28 Mar 2012 21:21:18 -0500 (CDT) Subject: [Swift-devel] Coaster socket issue In-Reply-To: <1046559590.117285.1332987038392.JavaMail.root@zimbra.anl.gov> Message-ID: <2022320755.117296.1332987678401.JavaMail.root@zimbra.anl.gov> Now that I think about it, I suspect the pipes may be from Swift running various commands, like qsub/qstat from the localscheduler provider, and/or app() calls from the local execution provider. I dint know if we ever paid much attention whether these were all getting cleaned up. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: "swift-devel at ci.uchicago.edu Devel" > Sent: Wednesday, March 28, 2012 9:10:38 PM > Subject: Re: [Swift-devel] Coaster socket issue > I think that on Jon's Beagle runs we say about 100 pipes but several > thousand sockets, so we didnt pay any attention to the pipes (yet). > > The sockets were clearly from workers to the coaster service. > > I have no idea yet what the pipes are. ls -l of /proc/fd/ does a nice > job of trying to identify and format the file name or object > associated with each file descriptor. I suspect its doing the same > thing lsof does. > > - Mike > > ----- Original Message ----- > > From: "David Kelly" > > To: "Jonathan Monette" > > Cc: "swift-devel at ci.uchicago.edu Devel" > > > > Sent: Wednesday, March 28, 2012 8:49:21 PM > > Subject: Re: [Swift-devel] Coaster socket issue > > Strange, I just ran into a similar issues tonight while running on > > ibicluster (SGE). I saw the "too many open files" error after > > sitting > > in the queue waiting for a job to start. I restarted the job and > > then > > periodically ran 'lsof' to see the number of java pipes increasing > > over time. I thought at first this might be SGE specific, but > > perhaps > > it is something else. (This was with 0.93) > > > > ----- Original Message ----- > > > From: "Jonathan Monette" > > > To: "swift-devel at ci.uchicago.edu Devel" > > > > > > Sent: Wednesday, March 28, 2012 8:30:52 PM > > > Subject: [Swift-devel] Coaster socket issue > > > Hello, > > > In running the SciColSim app on raven(which is a cluster similar > > > to > > > Beagle) I noticed that the app hung. It was not hung where the > > > hang > > > checker kicked in but Swift was waiting for jobs to be active but > > > there was none submitted to PBS. I took a look at the log file and > > > noticed that I had a java.io.IOException thrown for "too many open > > > files". Since I killed it I couldn't probe the run but I had the > > > same > > > run running on Beagle. Upon Mike's suggestion I took a look at the > > > /proc//fd directory. There were over 2000 sockets in the > > > CLOSE_WAIT state with a single message in the receive queue. Raven > > > has > > > a limit of 1024 open files at a time while Beagle has a limit > > > around > > > 60K number of files open. I got this limit using ulimit -n. > > > > > > So my question is, why is there so many sockets waiting to be > > > closed? > > > I did some reading about the CLOSE_WAIT state and it seems this > > > happens when one of the ends closes there socket but the other > > > does > > > not. Is Coaster not closing the socket when a worker shuts down? > > > What > > > other information should I be looking for to help debug the issue. > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Wed Mar 28 21:25:34 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 28 Mar 2012 21:25:34 -0500 Subject: [Swift-devel] Coaster socket issue In-Reply-To: <102462288.77460.1332987091069.JavaMail.root@zimbra-mb2.anl.gov> References: <102462288.77460.1332987091069.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: So on Beagle(where I have had a run going for about 1.5 days) I get a lot of stuff. But running the command: lsof -u jonmon | grep java | grep FIFO | wc -l I get 116. By running the command lsof -u jonmon | grep java | grep CLOSE_WAIT | wc -l I get 315. But by going to /proc/fd and running ls -l | grep socket | wc -l I get 2911. I am not sure what numbers to believe. The output from lsof -u jonmon | grep java | grep FIFO is java 21823 jonmon 99r FIFO 0,6 0t0 259630056 pipe java 21823 jonmon 100w FIFO 0,6 0t0 259630056 pipe java 21823 jonmon 2745w FIFO 0,6 0t0 318767708 pipe java 21823 jonmon 2747r FIFO 0,6 0t0 318767709 pipe java 21823 jonmon 2813r FIFO 0,6 0t0 318779029 pipe java 21823 jonmon 2830r FIFO 0,6 0t0 318767710 pipe java 21823 jonmon 2874r FIFO 0,6 0t0 318779030 pipe java 21823 jonmon 2909w FIFO 0,6 0t0 318779031 pipe java 21823 jonmon 2961r FIFO 0,6 0t0 318484490 pipe java 21823 jonmon 2964w FIFO 0,6 0t0 318484491 pipe java 21823 jonmon 2966r FIFO 0,6 0t0 318484492 pipe java 21823 jonmon 2989r FIFO 0,6 0t0 318558560 pipe java 21823 jonmon 2991r FIFO 0,6 0t0 318632607 pipe java 21823 jonmon 2993w FIFO 0,6 0t0 318558561 pipe java 21823 jonmon 2997r FIFO 0,6 0t0 318558562 pipe java 21823 jonmon 2999r FIFO 0,6 0t0 318632608 pipe java 21823 jonmon 3002r FIFO 0,6 0t0 318632609 pipe The count of these pipes seem to go up and down. A couple minutes ago it was at 116(the above number) but now it is down to ~20. So the FIFO count is going up and down. My worry is the socket count and the number of sockets in the CLOSE_WAIT state. Those seem to vastly out number of pipes, at least according to what is in the fd directory of the process. On Mar 28, 2012, at 9:11 PM, David Kelly wrote: > The limit here seems to be 1024. > > Just curious, what happens when you run 'lsof -u jonmon'? For me, I see lines like this that grow over time: > > java 14589 dkelly 220r FIFO 0,6 601514288 pipe > java 14589 dkelly 221r FIFO 0,6 601514581 pipe > java 14589 dkelly 222w FIFO 0,6 601514852 pipe > java 14589 dkelly 223r FIFO 0,6 601514582 pipe > > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "David Kelly" >> Cc: "swift-devel at ci.uchicago.edu Devel" >> Sent: Wednesday, March 28, 2012 8:57:03 PM >> Subject: Re: [Swift-devel] Coaster socket issue >> What is the open files limit on that machine(ulimit -n)? I have never >> witnessed this issue before so it may only appear on machines with >> relatively low open file limits(raven has 1K but beagle has 60K). This >> is still something we should look into though. >> >> On Mar 28, 2012, at 8:49 PM, David Kelly wrote: >> >>> >>> Strange, I just ran into a similar issues tonight while running on >>> ibicluster (SGE). I saw the "too many open files" error after >>> sitting in the queue waiting for a job to start. I restarted the job >>> and then periodically ran 'lsof' to see the number of java pipes >>> increasing over time. I thought at first this might be SGE specific, >>> but perhaps it is something else. (This was with 0.93) >>> >>> ----- Original Message ----- >>>> From: "Jonathan Monette" >>>> To: "swift-devel at ci.uchicago.edu Devel" >>>> >>>> Sent: Wednesday, March 28, 2012 8:30:52 PM >>>> Subject: [Swift-devel] Coaster socket issue >>>> Hello, >>>> In running the SciColSim app on raven(which is a cluster similar to >>>> Beagle) I noticed that the app hung. It was not hung where the hang >>>> checker kicked in but Swift was waiting for jobs to be active but >>>> there was none submitted to PBS. I took a look at the log file and >>>> noticed that I had a java.io.IOException thrown for "too many open >>>> files". Since I killed it I couldn't probe the run but I had the >>>> same >>>> run running on Beagle. Upon Mike's suggestion I took a look at the >>>> /proc//fd directory. There were over 2000 sockets in the >>>> CLOSE_WAIT state with a single message in the receive queue. Raven >>>> has >>>> a limit of 1024 open files at a time while Beagle has a limit >>>> around >>>> 60K number of files open. I got this limit using ulimit -n. >>>> >>>> So my question is, why is there so many sockets waiting to be >>>> closed? >>>> I did some reading about the CLOSE_WAIT state and it seems this >>>> happens when one of the ends closes there socket but the other does >>>> not. Is Coaster not closing the socket when a worker shuts down? >>>> What >>>> other information should I be looking for to help debug the issue. >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Wed Mar 28 21:26:17 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 28 Mar 2012 21:26:17 -0500 Subject: [Swift-devel] Coaster socket issue In-Reply-To: <2022320755.117296.1332987678401.JavaMail.root@zimbra.anl.gov> References: <2022320755.117296.1332987678401.JavaMail.root@zimbra.anl.gov> Message-ID: So it looked like the first command to fail after the IOException in the log was qstat. It also couldn't open any new wrapper files for the jobs. On Mar 28, 2012, at 9:21 PM, Michael Wilde wrote: > Now that I think about it, I suspect the pipes may be from Swift running various commands, like qsub/qstat from the localscheduler provider, and/or app() calls from the local execution provider. I dint know if we ever paid much attention whether these were all getting cleaned up. > > - Mike > > ----- Original Message ----- >> From: "Michael Wilde" >> To: "David Kelly" >> Cc: "swift-devel at ci.uchicago.edu Devel" >> Sent: Wednesday, March 28, 2012 9:10:38 PM >> Subject: Re: [Swift-devel] Coaster socket issue >> I think that on Jon's Beagle runs we say about 100 pipes but several >> thousand sockets, so we didnt pay any attention to the pipes (yet). >> >> The sockets were clearly from workers to the coaster service. >> >> I have no idea yet what the pipes are. ls -l of /proc/fd/ does a nice >> job of trying to identify and format the file name or object >> associated with each file descriptor. I suspect its doing the same >> thing lsof does. >> >> - Mike >> >> ----- Original Message ----- >>> From: "David Kelly" >>> To: "Jonathan Monette" >>> Cc: "swift-devel at ci.uchicago.edu Devel" >>> >>> Sent: Wednesday, March 28, 2012 8:49:21 PM >>> Subject: Re: [Swift-devel] Coaster socket issue >>> Strange, I just ran into a similar issues tonight while running on >>> ibicluster (SGE). I saw the "too many open files" error after >>> sitting >>> in the queue waiting for a job to start. I restarted the job and >>> then >>> periodically ran 'lsof' to see the number of java pipes increasing >>> over time. I thought at first this might be SGE specific, but >>> perhaps >>> it is something else. (This was with 0.93) >>> >>> ----- Original Message ----- >>>> From: "Jonathan Monette" >>>> To: "swift-devel at ci.uchicago.edu Devel" >>>> >>>> Sent: Wednesday, March 28, 2012 8:30:52 PM >>>> Subject: [Swift-devel] Coaster socket issue >>>> Hello, >>>> In running the SciColSim app on raven(which is a cluster similar >>>> to >>>> Beagle) I noticed that the app hung. It was not hung where the >>>> hang >>>> checker kicked in but Swift was waiting for jobs to be active but >>>> there was none submitted to PBS. I took a look at the log file and >>>> noticed that I had a java.io.IOException thrown for "too many open >>>> files". Since I killed it I couldn't probe the run but I had the >>>> same >>>> run running on Beagle. Upon Mike's suggestion I took a look at the >>>> /proc//fd directory. There were over 2000 sockets in the >>>> CLOSE_WAIT state with a single message in the receive queue. Raven >>>> has >>>> a limit of 1024 open files at a time while Beagle has a limit >>>> around >>>> 60K number of files open. I got this limit using ulimit -n. >>>> >>>> So my question is, why is there so many sockets waiting to be >>>> closed? >>>> I did some reading about the CLOSE_WAIT state and it seems this >>>> happens when one of the ends closes there socket but the other >>>> does >>>> not. Is Coaster not closing the socket when a worker shuts down? >>>> What >>>> other information should I be looking for to help debug the issue. >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Mar 28 21:30:05 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 28 Mar 2012 21:30:05 -0500 (CDT) Subject: [Swift-devel] Coaster socket issue In-Reply-To: <102462288.77460.1332987091069.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <690810736.117300.1332988205375.JavaMail.root@zimbra.anl.gov> Does ls -l /proc/14598/fd tell you anything more? Sounds to me like swift is trying to qstat a qsub'ed job. Perhaps some incompatibility between the SGE provider and the local SGE release? We've seen similar things with older (or newer) SGE releases. (I think you in fact diagnosed some of these issues as I recall...) - Mike ----- Original Message ----- > From: "David Kelly" > To: "Jonathan Monette" > Cc: "swift-devel at ci.uchicago.edu Devel" > Sent: Wednesday, March 28, 2012 9:11:31 PM > Subject: Re: [Swift-devel] Coaster socket issue > The limit here seems to be 1024. > > Just curious, what happens when you run 'lsof -u jonmon'? For me, I > see lines like this that grow over time: > > java 14589 dkelly 220r FIFO 0,6 601514288 pipe > java 14589 dkelly 221r FIFO 0,6 601514581 pipe > java 14589 dkelly 222w FIFO 0,6 601514852 pipe > java 14589 dkelly 223r FIFO 0,6 601514582 pipe > > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "David Kelly" > > Cc: "swift-devel at ci.uchicago.edu Devel" > > > > Sent: Wednesday, March 28, 2012 8:57:03 PM > > Subject: Re: [Swift-devel] Coaster socket issue > > What is the open files limit on that machine(ulimit -n)? I have > > never > > witnessed this issue before so it may only appear on machines with > > relatively low open file limits(raven has 1K but beagle has 60K). > > This > > is still something we should look into though. > > > > On Mar 28, 2012, at 8:49 PM, David Kelly wrote: > > > > > > > > Strange, I just ran into a similar issues tonight while running on > > > ibicluster (SGE). I saw the "too many open files" error after > > > sitting in the queue waiting for a job to start. I restarted the > > > job > > > and then periodically ran 'lsof' to see the number of java pipes > > > increasing over time. I thought at first this might be SGE > > > specific, > > > but perhaps it is something else. (This was with 0.93) > > > > > > ----- Original Message ----- > > >> From: "Jonathan Monette" > > >> To: "swift-devel at ci.uchicago.edu Devel" > > >> > > >> Sent: Wednesday, March 28, 2012 8:30:52 PM > > >> Subject: [Swift-devel] Coaster socket issue > > >> Hello, > > >> In running the SciColSim app on raven(which is a cluster similar > > >> to > > >> Beagle) I noticed that the app hung. It was not hung where the > > >> hang > > >> checker kicked in but Swift was waiting for jobs to be active but > > >> there was none submitted to PBS. I took a look at the log file > > >> and > > >> noticed that I had a java.io.IOException thrown for "too many > > >> open > > >> files". Since I killed it I couldn't probe the run but I had the > > >> same > > >> run running on Beagle. Upon Mike's suggestion I took a look at > > >> the > > >> /proc//fd directory. There were over 2000 sockets in the > > >> CLOSE_WAIT state with a single message in the receive queue. > > >> Raven > > >> has > > >> a limit of 1024 open files at a time while Beagle has a limit > > >> around > > >> 60K number of files open. I got this limit using ulimit -n. > > >> > > >> So my question is, why is there so many sockets waiting to be > > >> closed? > > >> I did some reading about the CLOSE_WAIT state and it seems this > > >> happens when one of the ends closes there socket but the other > > >> does > > >> not. Is Coaster not closing the socket when a worker shuts down? > > >> What > > >> other information should I be looking for to help debug the > > >> issue. > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From svemalayan at yahoo.com Wed Mar 28 21:59:46 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Wed, 28 Mar 2012 19:59:46 -0700 (PDT) Subject: [Swift-devel] Montage workload Message-ID: <1332989986.60713.YahooMailNeo@web39504.mail.mud.yahoo.com> Hi Jon, Zhao and All, I am planning to do large scale Montage-swift runs.? Do you have typical montage workload used in large scale experiments (input image files) ? If so could you please share with me? Thank you Emalayan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Mar 28 23:06:11 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 28 Mar 2012 23:06:11 -0500 Subject: [Swift-devel] Montage workload In-Reply-To: <1332989986.60713.YahooMailNeo@web39504.mail.mud.yahoo.com> References: <1332989986.60713.YahooMailNeo@web39504.mail.mud.yahoo.com> Message-ID: <60138EF6-922F-4A4C-87E1-D42CCBBE433A@mcs.anl.gov> I have one I normally use. I'll get it to you tomorrow morning. I have to create a tarball for you as it is large. On Mar 28, 2012, at 21:59, Emalayan Vairavanathan wrote: > Hi Jon, Zhao and All, > > I am planning to do large scale Montage-swift runs. Do you have typical montage workload used in large scale experiments (input image files) ? If so could you please share with me? > > Thank you > Emalayan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Thu Mar 29 02:12:48 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 29 Mar 2012 00:12:48 -0700 (PDT) Subject: [Swift-devel] Montage workload In-Reply-To: <60138EF6-922F-4A4C-87E1-D42CCBBE433A@mcs.anl.gov> References: <1332989986.60713.YahooMailNeo@web39504.mail.mud.yahoo.com> <60138EF6-922F-4A4C-87E1-D42CCBBE433A@mcs.anl.gov> Message-ID: <1333005168.31151.YahooMailNeo@web39505.mail.mud.yahoo.com> Thank you Jon. ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" Sent: Wednesday, 28 March 2012 9:06 PM Subject: Re: [Swift-devel] Montage workload I have one I normally use. I'll get it to you tomorrow morning. ?I have to create a tarball for you as it is large.? On Mar 28, 2012, at 21:59, Emalayan Vairavanathan wrote: Hi Jon, Zhao and All, > > >I am planning to do large scale Montage-swift runs.? Do you have typical montage workload used in large scale experiments (input image files) ? If so could you please share with me? > > >Thank you >Emalayan _______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Thu Mar 29 10:59:41 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 29 Mar 2012 08:59:41 -0700 (PDT) Subject: [Swift-devel] Data-aware scheduling in Swift ? Message-ID: <1333036781.9178.YahooMailNeo@web39508.mail.mud.yahoo.com> Hi All, I have a question about how swift schedules computations. Suppose there are two computation stages namely Stage-A and Stage-B in an application. Stage-A produces the data and Stage-B consumes the data . Could you please tell me how swift schedules these computations? Does it schedules Stage-A and Stage-B on the same node or on multiple nodes? Is it possible to configure the swift to schedules these computations on the same node (or is this the default behavior of swift ) ? Thank you Emalayan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Mar 29 11:08:22 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 29 Mar 2012 11:08:22 -0500 Subject: [Swift-devel] Data-aware scheduling in Swift ? In-Reply-To: <1333036781.9178.YahooMailNeo@web39508.mail.mud.yahoo.com> References: <1333036781.9178.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: I believe that Swift will schedule true work on any node that becomes available for work. It does not give preference to a node for to take advantage of data locality. As far as I know there is no way to configure Swift to use data aware scheduling. This is Swift's default behavior. On Mar 29, 2012, at 10:59, Emalayan Vairavanathan wrote: > Hi All, > > I have a question about how swift schedules computations. > > Suppose there are two computation stages namely Stage-A and Stage-B in an application. Stage-A produces the data and Stage-B consumes the data . Could you please tell me how swift schedules these computations? Does it schedules Stage-A and Stage-B on the same node or on multiple nodes? > Is it possible to configure the swift to schedules these computations on the same node (or is this the default behavior of swift ) ? > > > Thank you > Emalayan > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Mar 29 11:15:54 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Mar 2012 11:15:54 -0500 (CDT) Subject: [Swift-devel] Data-aware scheduling in Swift ? In-Reply-To: <1333036781.9178.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: <1600398213.118229.1333037754156.JavaMail.root@zimbra.anl.gov> Swift will place an app() call on any free node. (As Jon just replied, while I was writing this...) If we want to do an experiment with some kind of data affinity, we can try the following hack: - Stage-A returns the node that it ran on - swift script passes that as an arg "preferredNode(nodeName) to Stage-B - scheduler tries to place Stage-B on the coaster named nodeName. Its that last part thats the trickiest, as this will require a mod to the scheduler. And it gets trickier if the scheduler needs to try to defer Stage-B until nodeName can take a new job. It *might* be easier, in a first pass, to only place STage-B on nodeName if nodeName has a free job slot, else to place it anywhere. But all of this will require going into the coaster scheduler code. I suggest we do this as a joint effort; I can try, with help from Mihael and Justin, to locate the code that we'd need to modify, if you are willing to do some experiments and hacking. - Mike ----- Original Message ----- > From: "Emalayan Vairavanathan" > To: swift-devel at ci.uchicago.edu > Cc: "MosaStore" > Sent: Thursday, March 29, 2012 10:59:41 AM > Subject: [Swift-devel] Data-aware scheduling in Swift ? > Hi All, > > > I have a question about how swift schedules computations. > > > Suppose there are two computation stages namely Stage-A and Stage-B in > an application. Stage-A produces the data and Stage-B consumes the > data . Could you please tell me how swift schedules these > computations? Does it schedules Stage-A and Stage-B on the same node > or on multiple nodes? > Is it possible to configure the swift to schedules these computations > on the same node (or is this the default behavior of swift ) ? > > > > > Thank you > Emalayan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Thu Mar 29 11:24:38 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 29 Mar 2012 11:24:38 -0500 (CDT) Subject: [Swift-devel] Data-aware scheduling in Swift ? In-Reply-To: References: <1333036781.9178.YahooMailNeo@web39508.mail.mud.yahoo.com> Message-ID: Right, this feature would require changes to Coasters. In the Coasters internal abstraction, a worker Cpu requests work from the queue manager. Coasters currently only uses two criteria to select the job: wall time and, if MPI is used, the MPI process count. For this task, we would add in a selector based on information about file locations. Do we have scripts/numbers for the difference between a local MosaStore access and a remote access? Justin On Thu, 29 Mar 2012, Jonathan Monette wrote: > I believe that Swift will schedule true work on any node that becomes > available for work. It does not give preference to a node for to take > advantage of data locality. > > As far as I know there is no way to configure Swift to use data aware > scheduling. This is Swift's default behavior. > > On Mar 29, 2012, at 10:59, Emalayan Vairavanathan wrote: > >> Hi All, >> >> I have a question about how swift schedules computations. >> >> Suppose there are two computation stages namely Stage-A and Stage-B in >> an application. Stage-A produces the data and Stage-B consumes the data >> . Could you please tell me how swift schedules these computations? Does >> it schedules Stage-A and Stage-B on the same node or on multiple nodes? >> Is it possible to configure the swift to schedules these computations >> on the same node (or is this the default behavior of swift ) ? >> >> >> Thank you >> Emalayan >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak From svemalayan at yahoo.com Thu Mar 29 12:06:22 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 29 Mar 2012 10:06:22 -0700 (PDT) Subject: [Swift-devel] Data-aware scheduling in Swift ? In-Reply-To: <1600398213.118229.1333037754156.JavaMail.root@zimbra.anl.gov> References: <1333036781.9178.YahooMailNeo@web39508.mail.mud.yahoo.com> <1600398213.118229.1333037754156.JavaMail.root@zimbra.anl.gov> Message-ID: <1333040782.55930.YahooMailNeo@web39507.mail.mud.yahoo.com> Thank you Jon, Mike and Justin. Having this functionality would be really useful for us to demonstrate how useful extended attributes are in MosaStore in long term. Further for our SC paper this is a critical functionality and we need this to support both pipeline and reduce patters. Mike: I will be happy to help with this. In terms of effort and priories, how much time we need to spend to get this done? Is it feasible to target this for our SC paper ? Justin: We do have numbers for the difference between a local MosaStore access and a remote access on our cluster. This is what we have published in CCGrid 2012 (I have attached the paper). But we do not have numbers on BG/P. I can try it on BG/P and get back to you. Matei: Do you have any suggestion ? Thank you Emalayan ________________________________ From: Michael Wilde To: Emalayan Vairavanathan Cc: MosaStore ; swift-devel at ci.uchicago.edu Sent: Thursday, 29 March 2012 9:15 AM Subject: Re: [Swift-devel] Data-aware scheduling in Swift ? Swift will place an app() call on any free node. (As Jon just replied, while I was writing this...) If we want to do an experiment with some kind of data affinity, we can try the following hack: - Stage-A returns the node that it ran on - swift script passes that as an arg "preferredNode(nodeName) to Stage-B - scheduler tries to place Stage-B on the coaster named nodeName. Its that last part thats the trickiest, as this will require a mod to the scheduler. And it gets trickier if the scheduler needs to try to defer Stage-B until nodeName can take a new job.? It *might* be easier, in a first pass, to only place STage-B on nodeName if nodeName has a free job slot, else to place it anywhere. But all of this will require going into the coaster scheduler code. I suggest we do this as a joint effort; I can try, with help from Mihael and Justin, to locate the code that we'd need to modify, if you are willing to do some experiments and hacking. - Mike ----- Original Message ----- > From: "Emalayan Vairavanathan" > To: swift-devel at ci.uchicago.edu > Cc: "MosaStore" > Sent: Thursday, March 29, 2012 10:59:41 AM > Subject: [Swift-devel] Data-aware scheduling in Swift ? > Hi All, > > > I have a question about how swift schedules computations. > > > Suppose there are two computation stages namely Stage-A and Stage-B in > an application. Stage-A produces the data and Stage-B consumes the > data . Could you please tell me how swift schedules these > computations? Does it schedules Stage-A and Stage-B on the same node > or on multiple nodes? > Is it possible to configure the swift to schedules these computations > on the same node (or is this the default behavior of swift ) ? > > > > > Thank you > Emalayan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PID2235315.pdf Type: application/pdf Size: 368279 bytes Desc: not available URL: From jonmon at mcs.anl.gov Thu Mar 29 15:43:21 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 29 Mar 2012 15:43:21 -0500 Subject: [Swift-devel] Montage workload In-Reply-To: <1333005168.31151.YahooMailNeo@web39505.mail.mud.yahoo.com> References: <1332989986.60713.YahooMailNeo@web39504.mail.mud.yahoo.com> <60138EF6-922F-4A4C-87E1-D42CCBBE433A@mcs.anl.gov> <1333005168.31151.YahooMailNeo@web39505.mail.mud.yahoo.com> Message-ID: <702C9A7A-68BE-47D6-9BD2-D71B3062F30C@mcs.anl.gov> Sorry it took awhile. The tar ball is located at /pvfs-surveyor/jonmon/montage-big-dataset.tar.gz I believe this is about 4000 images. On Mar 29, 2012, at 2:12 AM, Emalayan Vairavanathan wrote: > Thank you Jon. > > From: Jonathan Monette > To: Emalayan Vairavanathan > Cc: "swift-devel at ci.uchicago.edu" > Sent: Wednesday, 28 March 2012 9:06 PM > Subject: Re: [Swift-devel] Montage workload > > I have one I normally use. I'll get it to you tomorrow morning. I have to create a tarball for you as it is large. > > On Mar 28, 2012, at 21:59, Emalayan Vairavanathan wrote: > >> Hi Jon, Zhao and All, >> >> I am planning to do large scale Montage-swift runs. Do you have typical montage workload used in large scale experiments (input image files) ? If so could you please share with me? >> >> Thank you >> Emalayan >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Thu Mar 29 18:13:43 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Thu, 29 Mar 2012 16:13:43 -0700 (PDT) Subject: [Swift-devel] Montage workload In-Reply-To: <702C9A7A-68BE-47D6-9BD2-D71B3062F30C@mcs.anl.gov> References: <1332989986.60713.YahooMailNeo@web39504.mail.mud.yahoo.com> <60138EF6-922F-4A4C-87E1-D42CCBBE433A@mcs.anl.gov> <1333005168.31151.YahooMailNeo@web39505.mail.mud.yahoo.com> <702C9A7A-68BE-47D6-9BD2-D71B3062F30C@mcs.anl.gov> Message-ID: <1333062823.67452.YahooMailNeo@web39507.mail.mud.yahoo.com> Thank you very much Jon. ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" Sent: Thursday, 29 March 2012 1:43 PM Subject: Re: [Swift-devel] Montage workload Sorry it took awhile. ?The tar ball is located at /pvfs-surveyor/jonmon/montage-big-dataset.tar.gz I believe this is about 4000 images. On Mar 29, 2012, at 2:12 AM, Emalayan Vairavanathan wrote: Thank you Jon. > > > > >________________________________ > From: Jonathan Monette >To: Emalayan Vairavanathan >Cc: "swift-devel at ci.uchicago.edu" >Sent: Wednesday, 28 March 2012 9:06 PM >Subject: Re: [Swift-devel] Montage workload > > >I have one I normally use. I'll get it to you tomorrow morning. ?I have to create a tarball for you as it is large.? > >On Mar 28, 2012, at 21:59, Emalayan Vairavanathan wrote: > > >Hi Jon, Zhao and All, >> >> >>I am planning to do large scale Montage-swift runs.? Do you have typical montage workload used in large scale experiments (input image files) ? If so could you please share with me? >> >> >>Thank you >>Emalayan >_______________________________________________ >>Swift-devel mailing list >>Swift-devel at ci.uchicago.edu >>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matei.ripeanu at gmail.com Fri Mar 30 06:04:03 2012 From: matei.ripeanu at gmail.com (Matei Ripeanu) Date: Fri, 30 Mar 2012 13:04:03 +0200 Subject: [Swift-devel] Data-aware scheduling in Swift ? In-Reply-To: <1333040782.55930.YahooMailNeo@web39507.mail.mud.yahoo.com> References: <1333036781.9178.YahooMailNeo@web39508.mail.mud.yahoo.com> <1600398213.118229.1333037754156.JavaMail.root@zimbra.anl.gov> <1333040782.55930.YahooMailNeo@web39507.mail.mud.yahoo.com> Message-ID: <000d01cd0e64$d5047510$7f0d5f30$@gmail.com> Emalayan, Mike, Justin, all, There are a number of points worth discussing before we fully embark into this: First: We need to better understand what the gains we expect to have on BG/P from locality. We know we have sizeable gains on our cluster with data stored on disk (and where we have much lower cross-section bandwidth). I expect that most of these gains are preserved when we use RAM disks on our cluster. And will stay there as long as we do not have to transfer huge volumes of data. Unfortunately we can test this only with 20 nodes - I have no good intuition about what will happen o BG/P at large scale. Second: We should discuss how key is having this feature on Swift on BG/P for all the other points we want to prove for the paper. I think support for only one of the patterns we look at to optimize with the cross-layer communication can be demonstrated without (e.g., the one for broadcast) while the other two (pipelines and gather) can not. On the other side, is there a way to run our benchmark scripts on BG/P (I guess not) to demonstrate the potential gains if Swift implemented that? Or can we run (some of) the applications without Swift on our cluster? Third: I am afraid getting functionality this into Swift/Coasters is quite some work. On the other side Mike suggests a relatively clear implementation path. (It will probably work for pipelines but I'm not sure it will work for 'gather') What I suggest: Let's discuss between ourselves three things before embarking into changing Swift/Coasters: (1) we want to increase the certainty that we'll see performance gains if we implement this, (2) see whether there aren't ways to demonstrate (some of) what we want outside Swift; (3) re-evaluate the schedule and priorities - we have roughly four weeks to the deadline. Let me know what you think, -Matei From: Emalayan Vairavanathan [mailto:svemalayan at yahoo.com] Sent: March-29-12 7:06 PM To: mosastore at googlegroups.com; matei Cc: swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] Data-aware scheduling in Swift ? Thank you Jon, Mike and Justin. Having this functionality would be really useful for us to demonstrate how useful extended attributes are in MosaStore in long term. Further for our SC paper this is a critical functionality and we need this to support both pipeline and reduce patters. Mike: I will be happy to help with this. In terms of effort and priories, how much time we need to spend to get this done? Is it feasible to target this for our SC paper ? Justin: We do have numbers for the difference between a local MosaStore access and a remote access on our cluster. This is what we have published in CCGrid 2012 (I have attached the paper). But we do not have numbers on BG/P. I can try it on BG/P and get back to you. Matei: Do you have any suggestion ? Thank you Emalayan _____ From: Michael Wilde To: Emalayan Vairavanathan Cc: MosaStore ; swift-devel at ci.uchicago.edu Sent: Thursday, 29 March 2012 9:15 AM Subject: Re: [Swift-devel] Data-aware scheduling in Swift ? Swift will place an app() call on any free node. (As Jon just replied, while I was writing this...) If we want to do an experiment with some kind of data affinity, we can try the following hack: - Stage-A returns the node that it ran on - swift script passes that as an arg "preferredNode(nodeName) to Stage-B - scheduler tries to place Stage-B on the coaster named nodeName. Its that last part thats the trickiest, as this will require a mod to the scheduler. And it gets trickier if the scheduler needs to try to defer Stage-B until nodeName can take a new job. It *might* be easier, in a first pass, to only place STage-B on nodeName if nodeName has a free job slot, else to place it anywhere. But all of this will require going into the coaster scheduler code. I suggest we do this as a joint effort; I can try, with help from Mihael and Justin, to locate the code that we'd need to modify, if you are willing to do some experiments and hacking. - Mike ----- Original Message ----- > From: "Emalayan Vairavanathan" > To: swift-devel at ci.uchicago.edu > Cc: "MosaStore" > Sent: Thursday, March 29, 2012 10:59:41 AM > Subject: [Swift-devel] Data-aware scheduling in Swift ? > Hi All, > > > I have a question about how swift schedules computations. > > > Suppose there are two computation stages namely Stage-A and Stage-B in > an application. Stage-A produces the data and Stage-B consumes the > data . Could you please tell me how swift schedules these > computations? Does it schedules Stage-A and Stage-B on the same node > or on multiple nodes? > Is it possible to configure the swift to schedules these computations > on the same node (or is this the default behavior of swift ) ? > > > > > Thank you > Emalayan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Fri Mar 30 19:10:22 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Fri, 30 Mar 2012 17:10:22 -0700 (PDT) Subject: [Swift-devel] Data-aware scheduling in Swift ? In-Reply-To: <000d01cd0e64$d5047510$7f0d5f30$@gmail.com> References: <1333036781.9178.YahooMailNeo@web39508.mail.mud.yahoo.com> <1600398213.118229.1333037754156.JavaMail.root@zimbra.anl.gov> <1333040782.55930.YahooMailNeo@web39507.mail.mud.yahoo.com> <000d01cd0e64$d5047510$7f0d5f30$@gmail.com> Message-ID: <1333152622.59640.YahooMailNeo@web39508.mail.mud.yahoo.com> Hi Matei, I am currently working on evaluating the gains of locality on BG/P. I will be able to get some numbers today / tomorrow. This will help us in taking decisions. Thank you Emalayan. ________________________________ From: Matei Ripeanu To: 'Emalayan Vairavanathan' ; mosastore at googlegroups.com; 'matei' Cc: swift-devel at ci.uchicago.edu Sent: Friday, 30 March 2012 4:04 AM Subject: RE: [Swift-devel] Data-aware scheduling in Swift ? Emalayan, Mike, Justin, all, ? There are a number of points worth discussing before we fully embark into this: ? First:? We need to better understand what the gains we expect to have on BG/P from locality.? We know we have sizeable gains on our cluster with data stored on disk (and where we have much lower cross-section bandwidth).? I expect that most of these gains are preserved when we use RAM disks on our cluster. And will stay there as long as we do not have to transfer huge volumes of data.? Unfortunately we can test this only with 20 nodes - I have no good intuition about what will happen o BG/P at large scale. ? Second: We should discuss how key is having this feature on Swift on BG/P for all the other points we want to prove for the paper.?? I think support for only one of the patterns we look at to optimize with the cross-layer communication can be demonstrated without (e.g., the one for broadcast) while the other two (pipelines and gather) can not.?? ??On the other side, is there a way to run our benchmark scripts on BG/P? (I guess not) to demonstrate the potential gains if Swift implemented that? Or can we run (some of) the applications ?without Swift on our cluster? ? Third:? I am afraid getting functionality this into Swift/Coasters is quite some work.? On the other side Mike suggests a relatively clear implementation path. (It will probably work for pipelines but I?m not sure it will work for ?gather?) ? What I suggest:? Let?s discuss between ourselves three things before embarking into changing Swift/Coasters:? (1) we want to increase the certainty that we?ll see performance gains if we implement this,? (2) see whether there aren?t ways to demonstrate (some of) what we ?want outside Swift; (3) re-evaluate the schedule and priorities ? we have roughly four weeks to the deadline. ? Let me know what you think, ? -Matei ?? ? ? ? ? ? ? ? ? ? From:Emalayan Vairavanathan [mailto:svemalayan at yahoo.com] Sent: March-29-12 7:06 PM To: mosastore at googlegroups.com; matei Cc: swift-devel at ci.uchicago.edu Subject: Re: [Swift-devel] Data-aware scheduling in Swift ? ? Thank you Jon, Mike and Justin. ? Having this functionality would be really useful for us to demonstrate how useful extended attributes are in MosaStore in long term. Further for our SC paper this is a critical functionality and we need this to support both pipeline and reduce patters. ? Mike:I will be happy to help with this. In terms of effort and priories, how much time we need to spend to get this done? Is it feasible to target this for our SC paper ? ? Justin: We do have numbers for the difference between a local MosaStore access and a remote access on our cluster. This is what we have published in CCGrid 2012 (I have attached the paper). But we do not have numbers on BG/P. I can try it on BG/P and get back to you. ? Matei:Do you have any suggestion ? ? Thank you Emalayan ? ________________________________ From:Michael Wilde To: Emalayan Vairavanathan Cc: MosaStore ; swift-devel at ci.uchicago.edu Sent: Thursday, 29 March 2012 9:15 AM Subject: Re: [Swift-devel] Data-aware scheduling in Swift ? Swift will place an app() call on any free node. (As Jon just replied, while I was writing this...) If we want to do an experiment with some kind of data affinity, we can try the following hack: - Stage-A returns the node that it ran on - swift script passes that as an arg "preferredNode(nodeName) to Stage-B - scheduler tries to place Stage-B on the coaster named nodeName. Its that last part thats the trickiest, as this will require a mod to the scheduler. And it gets trickier if the scheduler needs to try to defer Stage-B until nodeName can take a new job.? It *might* be easier, in a first pass, to only place STage-B on nodeName if nodeName has a free job slot, else to place it anywhere. But all of this will require going into the coaster scheduler code. I suggest we do this as a joint effort; I can try, with help from Mihael and Justin, to locate the code that we'd need to modify, if you are willing to do some experiments and hacking. - Mike ----- Original Message ----- > From: "Emalayan Vairavanathan" > To: swift-devel at ci.uchicago.edu > Cc: "MosaStore" > Sent: Thursday, March 29, 2012 10:59:41 AM > Subject: [Swift-devel] Data-aware scheduling in Swift ? > Hi All, > > > I have a question about how swift schedules computations. > > > Suppose there are two computation stages namely Stage-A and Stage-B in > an application. Stage-A produces the data and Stage-B consumes the > data . Could you please tell me how swift schedules these > computations? Does it schedules Stage-A and Stage-B on the same node > or on multiple nodes? > Is it possible to configure the swift to schedules these computations > on the same node (or is this the default behavior of swift ) ? > > > > > Thank you > Emalayan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- You received this message because you are subscribed to the Google Groups "MosaStore" group. To post to this group, send email to mosastore at googlegroups.com. To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/mosastore?hl=en. -------------- next part -------------- An HTML attachment was scrubbed... URL: