From svemalayan at yahoo.com Tue Jan 10 20:50:21 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Tue, 10 Jan 2012 18:50:21 -0800 (PST)
Subject: [Swift-user] Running swift application on BG/P
Message-ID: <1326250221.99895.YahooMailNeo@web39503.mail.mud.yahoo.com>
Dear All,
I am trying to run a simple Swift program (which prints "Hello World") on BG/P (Surveyor). I modified site-catalog, transformation-catalog and config files and launched the swift application from head-node using the command below.
swift -config cf? -tc.file tc -sites.file sites.xml first.swift
I can see that the Swift allocates some nodes and trying to run the application but the result was not generated. Also I couldn't see any error in the log-files / stdout.
I am suspecting it might be because of the workers cannot connect to the coaster-service. Or may be a problem with catalog files?
Please find my site.xml, tc.data, cf and application files below. Please let me know if I am making any mistakes.
Thank you
Emalayan
first.swift
type messagefile;
app (messagefile t) greeting() {
??? echo "Hello, world!" stdout=@filename(t);
}
messagefile outfile <"hello.txt">;
outfile = greeting();
site.xml
?
???
???
???
???
??? MTCScienceApps
??? default
??? zeptoos
??? true
??? 21
??? 10000
??? 1
??? DEBUG
??? 1
??? 900
??? 64
??? 64
??? /home/emalayan/app/forEmalayan_ccGrdid/swift.workdir
?
tc:
surveyor??????? echo??????????? /bin/echo?????? INSTALLED?????? INTEL32::LINUX
surveyor??????? cat???????????? /bin/cat??????? INSTALLED?????? INTEL32::LINUX
surveyor??????? ls????????????? /bin/ls???????? INSTALLED?????? INTEL32::LINUX
surveyor??????? grep??????????? /bin/grep?????? INSTALLED?????? INTEL32::LINUX
surveyor??????? sort??????????? /bin/sort?????? INSTALLED?????? INTEL32::LINUX
surveyor??????? paste?????????? /bin/paste????? INSTALLED?????? INTEL32::LINUX
surveyor??????? wc????????????? /usr/bin/wc???? INSTALLED?????? INTEL32::LINUX
surveyor??????? perl??????????? /usr/bin/perl?? INSTALLED?????? INTEL32::LINUX
#surveyor do_merge /home/emalayan/app/forEmalayan_ccGrdid/app/modmerge null null null
#surveyor score /home/emalayan/app/forEmalayan_ccGrdid/app/Scoring/scoredat.exe null null null
surveyor??????? modftdock?????? /home/emalayan/app/forEmalayan_ccGrdid/app/modftdock.sh null null null
cf:
wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=1
lazy.errors=true
status.mode=provider
use.provider.staging=true
provider.staging.pin.swiftfiles=false
foreach.max.threads=100
provenance.log=false
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From iraicu at cs.iit.edu Sat Jan 14 08:10:30 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Sat, 14 Jan 2012 08:10:30 -0600
Subject: [Swift-user] CFP: ACM HPDC 2012, abstracts due January 16th, 2012
Message-ID: <4F118CD6.9090905@cs.iit.edu>
**** CALL FOR PAPERS ****
The 21st International ACM Symposium on
High-Performance Parallel and Distributed Computing
(HPDC'12)
Delft University of Technology, Delft, the Netherlands
June 18-22, 2012
http://www.hpdc.org/2012
The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)
is the premier annual conference on the design, the implementation, the evaluation, and
the use of parallel and distributed systems for high-end computing. HPDC'12 will take place
in Delft, the Netherlands, a historical, picturesque city that is less than one hour away
from Amsterdam-Schiphol airport. The conference will be held on June 20-22 (Wednesday to
Friday), with affiliated workshops taking place on June 18-19 (Monday and Tuesday).
**** SUBMISSION DEADLINES ****
Abstracts: 16 January 2012
Papers: 23 January 2012 (No extensions!)
**** HPDC'12 GENERAL CHAIR ****
Dick Epema, Delft University of Technology, Delft, the Netherlands
**** HPDC'12 PROGRAM CO-CHAIRS ****
Thilo Kielmann, Vrije Universiteit, Amsterdam, the Netherlands
Matei Ripeanu, The University of British Columbia, Vancouver, Canada
**** HPDC'12 WORKSHOPS CHAIR ****
Alexandru Iosup, Delft University of Technology, Delft, the Netherlands
**** SCOPE AND TOPICS ****
Submissions are welcomed on all forms of high-performance parallel and distributed computing,
including but not limited to clusters, clouds, grids, utility computing, data-intensive
computing, and massively multicore systems. Submissions that explore solutions to estimate
and reduce the energy footprint of such systems are particularly encouraged. All papers
will be evaluated for their originality, potential impact, correctness, quality of
presentation, appropriate presentation of related work, and relevance to the conference,
with a strong preference for rigorous results obtained in operational parallel and
distributed systems.
The topics of interest of the conference include, but are not limited to, the following,
in the context of high-performance parallel and distributed computing:
- Systems, networks, and architectures for high-end computing
- Massively multicore systems
- Virtualization of machines, networks, and storage
- Programming languages and environments
- I/O, storage systems, and data management
- Resource management, energy and cost minimizations
- Performance modeling and analysis
- Fault tolerance, reliability, and availability
- Data-intensive computing
- Applications of parallel and distributed computing
**** PAPER SUBMISSION GUIDELINES ****
Authors are invited to submit technical papers of at most 12 pages in PDF format, including
figures and references. Papers should be formatted in the ACM Proceedings Style and
submitted via the conference web site. No changes to the margins, spacing, or font sizes as
specified by the style file are allowed. Accepted papers will appear in the conference
proceedings, and will be incorporated into the ACM Digital Library. A limited number of
papers will be accepted as posters.
Papers must be self-contained and provide the technical substance required for the program
committee to evaluate their contributions. Submitted papers must be original work that has
not appeared in and is not under consideration for another conference or a journal. See the
ACM Prior Publication Policy for more details.
**** IMPORTANT DATES ****
Abstracts Due: 16 January 2012
Papers Due: 23 January 2012 (No extensions!)
Reviews Released to Authors: 8 March 2012
Author Rebuttals Due: 12 March 2012
Author Notifications: 19 March 2012
Final Papers Due: 16 April 2012
Conference Dates: 18-22 June 2012
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel: 1-847-722-0876
Office: 1-312-567-5704
Email: iraicu at cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
Web: http://datasys.cs.iit.edu/
=================================================================
=================================================================
From iraicu at cs.iit.edu Sat Jan 14 12:00:43 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Sat, 14 Jan 2012 12:00:43 -0600
Subject: [Swift-user] CFP: IEEE eScience 2012 in Chicago IL USA
Message-ID: <4F11C2CB.1070809@cs.iit.edu>
Call for Papers
8th IEEE International Conference on eScience
October 8-12, 2012
Chicago, IL, USA
Researchers in all disciplines are increasingly adopting digital tools,
techniques and practices, often in communities and projects that span
disciplines, laboratories, organizations, and national boundaries. The
eScience 2012 conference is designed to bring together leading
international and interdisciplinary research communities, developers,
and users of eScience applications and enabling IT technologies. The
conference serves as a forum to present the results of the latest
applications research and product/tool developments and to highlight
related activities from around the world. Also, we are now entering the
second decade of eScience and the 2012 conference gives an opportunity
to take stock of what has been achieved so far and look forward to the
challenges and opportunities the next decade will bring.
A special emphasis of the 2012 conference is on advances in the
application of technology in a particular discipline. Accordingly,
significant advances in applications science and technology will be
considered as important as the development of new technologies
themselves. Further, we welcome contributions in educational activities
under any of these disciplines.
As a result, the conference will be structured around two e-Science tracks:
* *eScience Algorithms and Applications*
o eScience application areas, including:
+ Physical sciences
+ Biomedical sciences
+ Social sciences and humanities
o Data-oriented approaches and applications
o Compute-oriented approaches and applications
o Extreme scale approaches and applications
* *Cyberinfrastructure to support eScience*
o Novel hardware
o Novel uses of production infrastructure
o Software and services
o Tools
The conference proceedings will be published by the IEEE Computer
Society Press, USA and will be made available online through the IEEE
Digital Library. Selected papers will be invited to submit extended
versions to a special issue of the Future Generation Computer Systems
(FGCS)
journal.
SUBMISSION PROCESS
Authors are invited to submit papers with unpublished, original work of
not more than 8 pages of double column text using single spaced 10 point
size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines.
(Up to 2 additional pages may be purchased for US$150/page)
Templates are available from
http://www.ieee.org/conferences_events/conferences/publishing/templates.html.
Authors should submit a PDF file that will print on a PostScript printer
to https://www.easychair.org/conferences/?conf=escience2012
(Note that paper submitters also must submit an abstract in advance of
the paper deadline. This should be done through the same site where
papers are submitted.)
It is a requirement that at least one author of each accepted paper
attend the conference.
ORGANIZATION
General Chair
* *Ian Foster*, University of Chicago & Argonne National Laboratory, USA
Program Co-Chairs
* *Daniel S. Katz*, University of Chicago & Argonne National
Laboratory, USA
* *Heinz Stockinger*, SIB Swiss Institute of Bioinformatics, Switzerland
Program Vice Co-Chairs
* eScience Algorithms and Applications Track
o *David Abramson*, Monash University, Australia
o *Gabrielle Allen*, Louisiana State University, USA
* Cyberinfrastructure to support eScience Track
o *Rosa M. Badia*, Barcelona Supercomputing Center / CSIC, Spain
o *Geoffrey Fox*, Indiana University, USA
Sponsorship Chair
* *Charlie Catlett*, Argonne National Laboratory, USA
Conference Manager and Finance Chair
* *Julie Wulf-Knoerzer*, University of Chicago & Argonne National
Laboratory, USA
Publicity Chairs
* *Kento Aida*, National Institute of Informatics, Japan
* *Ioan Raicu*, Illinois Institute of Technology, USA
* *David Wallom*, Oxford e-Research Centre, UK
Local Organizing Committee
* *Ninfa Mayorga*, University of Chicago, USA
* *Evelyn Rayburn*, University of Chicago, USA
* *Lynn Valentini*, Argonne National Laboratory, USA
Program Committee
* eScience Algorithms and Applications Track
o *Srinivas Aluru*, Iowa State University, USA
o *Ashiq Anjum*, University of Derby, UK
o *David A. Bader*, Georgia Institute of Technology, USA
o *Jon Blower*, University of Reading, UK
o *Paul Bonnington*, Monash University, Australia
o *Simon Cox*, University of Southampton, UK
o *David De Roure*, Oxford e-Research Centre, UK
o *George Djorgovski*, California Institute of Technology, USA
o *Anshu Dubey*, University of Chicago & Argonne National
Laboratory, USA
o *Yuri Estrin*, Monash University, Australia
o *Dan Fay*, Microsoft, USA
o *Jeremy Frey*, University of Southampton, UK
o *Wolfgang Gentzsch*, HPC Consultant, Germany
o *Lutz Gross*, The University of Queensland, Austrialia
o *Sverker Holmgren*, Uppsala University, Sweden
o *Bill Howe*, University of Washington, USA
o *Marina Jirotka*, University of Oxford, UK
o *Timoleon Kipouros*, University of Cambridge, UK
o *Kerstin Kleese van Dam*, Pacific Northwest National Laboratory, USA
o *Arun S. Konagurthu*, Monash University, Australia
o *Peter Kunszt*, SystemsX.ch, Switzerland
o *Alexey Lastovetsky*, University College Dublin, Ireland
o *Andrew Lewis*, Griffith University, Australia
o *Sergio Maffioletti*, University of Zurich, Switzerland
o *Amitava Majumdar*, San Diego Supercomputer Center, University
of California at San Diego, USA
o *Rui Mao*, Shenzhen University, China
o *Madhav V. Marathe*, Virginia Tech, USA
o *Maryann Martone*, University of California at San Diego, USA
o *Louis Moresi*, Monash University, Australia
o *Riccardo Murri*, University of Zurich, Switzerland
o *Silvia D. Olabarriaga*, Academic Medical Center of the
University of Amsterdam, Netherlands
o *Enrique S. Quintana-Ort?*, Universidad Jaume I, Spain
o *Abani Patra*, University at Buffalo, USA
o *Rob Pennington*, NSF, USA
o *Andrew Perry*, Monash University, Australia
o *Beth Plale*, Indiana University, USA
o *Michael Resch*, University of Stuttgart, Germany
o *Adrian Sandu*, Virginia Tech, USA
o *Mark Savill*, Cranfield University, UK
o *Erik Schnetter*, Perimeter Institute for Theoretical Physics,
Canada
o *Edward Seidel*, Louisiana State University, USA
o *Suzanne M. Shontz*, The Pennsylvania State University, USA
o *David Skinner*, Lawrence Berkeley National Laboratory, USA
o *Alan Sussman*, University of Maryland, USA
o *Alex Szalay*, Johns Hopkins University, USA
o *Domenico Talia*, ICAR-CNR & University of Calabria, Italy
o *Jian Tao*, Louisiana State University, USA
o *David Wallom*, Oxford e-Research Centre, UK
o *Shaowen Wang*, University of Illinois at Urbana-Champaign, USA
o *Michael Wilde*, Argonne National Laboratory & University of
Chicago, USA
o *Nancy Wilkins-Diehr*, San Diego Supercomputer Center,
University of California at San Diego, USA
o *Wu Zhang*, Shanghai University, China
o *Yunquan Zhang*, Chinese Academy of Sciences, China
* Cyberinfrastructure to support eScience Track
o *Deb Agarwal*, Lawrence Berkeley National Laboratory, USA
o *Ilkay Altintas*, San Diego Supercomputer Center, University of
California at San Diego, USA
o *Henri Bal*, Vrije Universiteit, Netherlands
o *Roger Barga*, Microsoft, USA
o *Martin Berzins*, University of Utah, USA
o *John Brooke*, University of Manchester, UK
o *Thomas Fahringer*, University of Innsbruck, Austria
o *Gilles Fedak*, INRIA, France
o *Jos? A. B. Fortes*, University of Florida, USA
o *Yolanda Gil*, ISI/USC, USA
o *Madhusudhan Govindaraju*, SUNY Binghamton, USA
o *Thomas Hacker*, Purdue University, USA
o *Ken Hawick*, Massey University, New Zealand
o *Marty Humphrey*, University of Virginia, USA
o *Hai Jin*, Huazhong University of Science and Technology, China
o *Thilo Kielmann*, Vrije Universiteit, Netherlands
o *Scott Klasky*, Oak Ridge National Laboratory, USA
o *Isao Kojima*, AIST, Japan
o *Tevfik Kosar*, University at Buffalo, USA
o *Dieter Kranzlmueller*, LMU & LRZ Munich, Germany
o *Erwin Laure*, KTH, Sweden
o *Jysoo Lee*, KISTI, Korea
o *Li Xiaoming*, Peking University, China
o *Bertram Lud?scher*, University of California, Davis, USA
o *Andrew Lumsdaine*, Indiana University, USA
o *Tanu Malik*, University of Chicago, USA
o *Satoshi Matsuoka*, Tokyo Institute of Technology, Japan
o *Reagan Moore*, University of North Carolina at Chapel Hill, USA
o *Shirley Moore*, University of Kentucky, USA
o *Steven Newhouse*, EGI, Netherlands
o *Dhabaleswar K. (DK) Panda*, The Ohio State University, USA
o *Manish Parashar*, Rutgers University, USA
o *Ron Perrott*, University of Oxford, UK
o *Depei Qian*, Beihang University, China
o *Judy Qui*, Indiana University, USA
o *Ioan Raicu*, Illinois Institute of Technology, USA
o *Lavanya Ramakrishnan*, Lawrence Berkeley National Laboratory, USA
o *Omer Rana*, Cardiff University, UK
o *Paul Roe*, Queensland University of Technology, Australia
o *Bruno Schulze*, LNCC, Brazil
o *Marc Snir*, Argonne National Laboratory & University of
Illinois at Urbana-Champaign, USA
o *Xian-He Sun*, Illinois Institute of Technology, USA
o *Yoshio Tanaka*, AIST, Japan
o *Michela Taufer*, University of Delaware, USA
o *Kerry Taylor*, CSIRO, Australia
o *Douglas Thain*, University of Notre Dame, USA
o *Paul Watson*, Newcastle University, UK
o *Jun Zhao*, University of Oxford, UK
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel: 1-847-722-0876
Office: 1-312-567-5704
Email: iraicu at cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
Web: http://datasys.cs.iit.edu/
=================================================================
=================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From iraicu at cs.iit.edu Sat Jan 14 21:58:00 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Sat, 14 Jan 2012 21:58:00 -0600
Subject: [Swift-user] Call for Workshops at IEEE eScience, due January 23,
2012
Message-ID: <4F124EC8.2050308@cs.iit.edu>
Call for Workshops
8th IEEE International Conference on eScience
October 8-12, 2012
Chicago, IL, USA
The 8th IEEE eScience conference (e-Science 2012), sponsored by the IEEE
Computer Society's Technical Committee for Scalable Computing (TCSC),
will be held in Chicago Illinois from 8-12th October 2012. The eScience
2011 conference is designed to bring together leading international and
interdisciplinary research communities, developers, and users of
eScience applications and enabling IT technologies.
Multiple e-Science 2012 Workshops will be held on Monday and Tuesday,
8th and 9th October, co-located with the main conference.
Workshops are an important part of the conference in providing
opportunity for researchers to present their work in a more focused way
than the conference itself and to have discussion of particular topics
of interest to the community. We cordially invite you to submit workshop
proposals on any eScience related topic to the Workshop Chair.
To help those interested know their purpose and scope, workshop
proposals should include:
* A description of the workshop, its focus, goals, and outcome
* A draft call for papers
* Names and affiliations of the organizers and tentative composition
of the committees
* Expected numbers of submissions and accepted papers
* Prior history of this workshop, if any. Please include: number of
submissions, number of accepted papers, and attendee count.
Workshop organizers are responsible for establishing a program
committee, collecting and evaluating submissions, notifying authors of
acceptance or rejection in due time, ensuring a transparent and fair
selection process, organizing selected papers into sessions, and
assigning session chairs. Proposals will be selected that show clear
focus and objectives in areas of emerging or developing interest
guaranteed to generate significant interest in the community.
Once accepted, the workshop should establish its own paper submission
system. For each paper selected for publication, an author must be
registered for eScience 2012. Each paper must be presented in person by
at least one of the authors. It is expected that the proceedings of the
eScience 2012 workshops will be published by the IEEE Computer Society
Press, USA and will be made available online through the IEEE Digital
Library.
SUBMISSION PROCESS
Workshop proposals should be emailed to escience2012-workshops at fnal.gov
ORGANIZATION
General Chair
* *Ian Foster*, University of Chicago & Argonne National Laboratory, USA
Program Co-Chairs
* *Daniel S. Katz*, University of Chicago & Argonne National
Laboratory, USA
* *Heinz Stockinger*, SIB Swiss Institute of Bioinformatics, Switzerland
Workshops Chair
* *Ruth Pordes*, FNAL, USA
Sponsorship Chair
* *Charlie Catlett*, Argonne National Laboratory, USA
Conference Manager and Finance Chair
* *Julie Wulf-Knoerzer*, University of Chicago & Argonne National
Laboratory, USA
Publicity Chairs
* *Kento Aida*, National Institute of Informatics, Japan
* *Ioan Raicu*, Illinois Institute of Technology, USA
* *David Wallom*, Oxford e-Research Centre, UK
Local Organizing Committee
* *Ninfa Mayorga*, University of Chicago, USA
* *Evelyn Rayburn*, University of Chicago, USA
* *Lynn Valentini*, Argonne National Laboratory, USA
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel: 1-847-722-0876
Office: 1-312-567-5704
Email: iraicu at cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
Web: http://datasys.cs.iit.edu/
=================================================================
=================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Mon Jan 16 04:08:25 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 16 Jan 2012 02:08:25 -0800 (PST)
Subject: [Swift-user] Running Swift+ModFTDock+MosaStore on BG/P
Message-ID: <1326708505.28876.YahooMailNeo@web39507.mail.mud.yahoo.com>
Dear All,
I am trying to run swift+ModFTDock with MosaStore. But I have some problems in deploying MosaStore with Swift.
MosaStore should be deployed before application starts. But currently swift-script is launched from the head-node and the rest is taken care by the swift run-time.
Is there a way to deploy MosaStore firstly and then launch swift+modftdock on the nodes ?
??? Or
Should I incorporate MosaStore deployment into the swift-scripts?
Thank you very much
Emalayan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From wozniak at mcs.anl.gov Mon Jan 16 14:36:50 2012
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Mon, 16 Jan 2012 14:36:50 -0600 (Central Standard Time)
Subject: [Swift-user] Running Swift+ModFTDock+MosaStore on BG/P
In-Reply-To: <1326708505.28876.YahooMailNeo@web39507.mail.mud.yahoo.com>
References: <1326708505.28876.YahooMailNeo@web39507.mail.mud.yahoo.com>
Message-ID:
On Mon, 16 Jan 2012, Emalayan Vairavanathan wrote:
> I am trying to run swift+ModFTDock with MosaStore. But I have some
> problems in deploying MosaStore with Swift. MosaStore should be deployed
> before application starts. But currently swift-script is launched from
> the head-node and the rest is taken care by the swift run-time.
>
> Is there a way to deploy MosaStore firstly and then launch swift+modftdock on the nodes ?
We do not currently have a technique to launch Coasters workers and
MosaStore simultaneously.
Justin
--
Justin M Wozniak
From turam at mcs.anl.gov Wed Jan 18 12:33:31 2012
From: turam at mcs.anl.gov (Thomas Uram)
Date: Wed, 18 Jan 2012 12:33:31 -0600
Subject: [Swift-user] Question re: reliance on proxy cert
Message-ID: <83743945-35DD-4AB2-A934-E97C5744916B@mcs.anl.gov>
I'm using coasters with ssh:pbs and have the proper bits in ~/.ssh/auth.defaults to support authentication, but when I run the script it fails due to a missing or expired proxy cert:
Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service
Caused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file (/tmp/x509up_u1154) not found.
Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file (/tmp/x509up_u1154) not found.
Why does it fail when an alternative authentication mechanism is available that would succeed? Is there an option to control this?
The complete log is here, which includes the Swift script and the sites file.
http://www.ci.uchicago.edu/~turam/hostname-20120118-1128-z3xo7eg9.log
Thanks,
Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jonmon at mcs.anl.gov Wed Jan 18 12:42:09 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Wed, 18 Jan 2012 12:42:09 -0600
Subject: [Swift-user] Question re: reliance on proxy cert
In-Reply-To: <83743945-35DD-4AB2-A934-E97C5744916B@mcs.anl.gov>
References: <83743945-35DD-4AB2-A934-E97C5744916B@mcs.anl.gov>
Message-ID: <5AF5DF5B-A50D-44CC-AFA1-368DE67BEC6D@mcs.anl.gov>
Using jobmanager="ssh:pbs" starts coasters on the remote side using ssh to get to the remote machine. Coasters still needs a proxy to validate with.
On Jan 18, 2012, at 12:33 PM, Thomas Uram wrote:
> I'm using coasters with ssh:pbs and have the proper bits in ~/.ssh/auth.defaults to support authentication, but when I run the script it fails due to a missing or expired proxy cert:
>
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service
> Caused by: org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file (/tmp/x509up_u1154) not found.
> Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file (/tmp/x509up_u1154) not found.
>
> Why does it fail when an alternative authentication mechanism is available that would succeed? Is there an option to control this?
>
> The complete log is here, which includes the Swift script and the sites file.
>
> http://www.ci.uchicago.edu/~turam/hostname-20120118-1128-z3xo7eg9.log
>
> Thanks,
> Tom
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Wed Jan 18 12:43:40 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Jan 2012 10:43:40 -0800
Subject: [Swift-user] Question re: reliance on proxy cert
In-Reply-To: <83743945-35DD-4AB2-A934-E97C5744916B@mcs.anl.gov>
References: <83743945-35DD-4AB2-A934-E97C5744916B@mcs.anl.gov>
Message-ID: <1326912220.12093.7.camel@blabla>
On Wed, 2012-01-18 at 12:33 -0600, Thomas Uram wrote:
> I'm using coasters with ssh:pbs and have the proper bits in
> ~/.ssh/auth.defaults to support authentication, but when I run the
> script it fails due to a missing or expired proxy cert:
[...]
> Why does it fail when an alternative authentication mechanism is
> available that would succeed? Is there an option to control this?
It fails because while ssh is used to start the coaster service
executable, the connection between client and service is secured by
GSI.
This model was just peachy in the Globus scenario, where you would need
a proxy anyway to start the service and delegation could be used to
supply credentials to the coaster service.
Not so much with ssh. I've been thinking about a way to deal with this,
and I think I'm leaning towards some shared secret that could be used as
a one-time authentication token by the service. The problem is making
sure that whatever provider is used to communicate said secret to the
service remains a secret (i.e. passing it on any command line is out of
the question). But that seems to require the use of an encrypted file
transfer provider, which breaks the abstraction we have a bit, so it
might require more changes than I want to see.
So suggestions are welcome.
Mihael
From wilde at mcs.anl.gov Wed Jan 18 12:49:56 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 18 Jan 2012 12:49:56 -0600 (CST)
Subject: [Swift-user] Question re: reliance on proxy cert
In-Reply-To: <1326912220.12093.7.camel@blabla>
Message-ID: <1645351678.154107.1326912596272.JavaMail.root@zimbra.anl.gov>
Tom, for now, this means that when using automatic coasters over ssh, you need to (manually, out of band) create an x509 proxy on both sides. Either securely copy one, or run grid-proxy-init on both the client side (where you are running Swift) and on each site on which Swift will start a coaster service.
One can bypass the need for proxies when setting up manual coaster configurations with the -nosec argument to the coaster-service command.
- Mike
----- Original Message -----
> From: "Mihael Hategan"
> To: "Thomas Uram"
> Cc: "swift user"
> Sent: Wednesday, January 18, 2012 12:43:40 PM
> Subject: Re: [Swift-user] Question re: reliance on proxy cert
> On Wed, 2012-01-18 at 12:33 -0600, Thomas Uram wrote:
> > I'm using coasters with ssh:pbs and have the proper bits in
> > ~/.ssh/auth.defaults to support authentication, but when I run the
> > script it fails due to a missing or expired proxy cert:
>
> [...]
>
> > Why does it fail when an alternative authentication mechanism is
> > available that would succeed? Is there an option to control this?
>
> It fails because while ssh is used to start the coaster service
> executable, the connection between client and service is secured by
> GSI.
>
> This model was just peachy in the Globus scenario, where you would
> need
> a proxy anyway to start the service and delegation could be used to
> supply credentials to the coaster service.
>
> Not so much with ssh. I've been thinking about a way to deal with
> this,
> and I think I'm leaning towards some shared secret that could be used
> as
> a one-time authentication token by the service. The problem is making
> sure that whatever provider is used to communicate said secret to the
> service remains a secret (i.e. passing it on any command line is out
> of
> the question). But that seems to require the use of an encrypted file
> transfer provider, which breaks the abstraction we have a bit, so it
> might require more changes than I want to see.
>
> So suggestions are welcome.
>
> Mihael
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
From jonmon at mcs.anl.gov Wed Jan 18 12:55:31 2012
From: jonmon at mcs.anl.gov (jonmon at mcs.anl.gov)
Date: Wed, 18 Jan 2012 18:55:31 +0000
Subject: [Swift-user] Question re: reliance on proxy cert
In-Reply-To: <1645351678.154107.1326912596272.JavaMail.root@zimbra.anl.gov>
References: <1326912220.12093.7.camel@blabla>
<1645351678.154107.1326912596272.JavaMail.root@zimbra.anl.gov>
Message-ID: <2145510777-1326912933-cardhu_decombobulator_blackberry.rim.net-495983841-@b15.c3.bise6.blackberry>
Is the proxy on the execution site necessary? I thought it was just the client side. I don't think I create a proxy on the execution site...I only request one using my-proxy on the client side.
-----Original Message-----
From: Michael Wilde
Sender: swift-user-bounces at ci.uchicago.edu
Date: Wed, 18 Jan 2012 12:49:56
To: Mihael Hategan; Thomas Uram
Cc: swift user
Subject: Re: [Swift-user] Question re: reliance on proxy cert
Tom, for now, this means that when using automatic coasters over ssh, you need to (manually, out of band) create an x509 proxy on both sides. Either securely copy one, or run grid-proxy-init on both the client side (where you are running Swift) and on each site on which Swift will start a coaster service.
One can bypass the need for proxies when setting up manual coaster configurations with the -nosec argument to the coaster-service command.
- Mike
----- Original Message -----
> From: "Mihael Hategan"
> To: "Thomas Uram"
> Cc: "swift user"
> Sent: Wednesday, January 18, 2012 12:43:40 PM
> Subject: Re: [Swift-user] Question re: reliance on proxy cert
> On Wed, 2012-01-18 at 12:33 -0600, Thomas Uram wrote:
> > I'm using coasters with ssh:pbs and have the proper bits in
> > ~/.ssh/auth.defaults to support authentication, but when I run the
> > script it fails due to a missing or expired proxy cert:
>
> [...]
>
> > Why does it fail when an alternative authentication mechanism is
> > available that would succeed? Is there an option to control this?
>
> It fails because while ssh is used to start the coaster service
> executable, the connection between client and service is secured by
> GSI.
>
> This model was just peachy in the Globus scenario, where you would
> need
> a proxy anyway to start the service and delegation could be used to
> supply credentials to the coaster service.
>
> Not so much with ssh. I've been thinking about a way to deal with
> this,
> and I think I'm leaning towards some shared secret that could be used
> as
> a one-time authentication token by the service. The problem is making
> sure that whatever provider is used to communicate said secret to the
> service remains a secret (i.e. passing it on any command line is out
> of
> the question). But that seems to require the use of an encrypted file
> transfer provider, which breaks the abstraction we have a bit, so it
> might require more changes than I want to see.
>
> So suggestions are welcome.
>
> Mihael
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
From hategan at mcs.anl.gov Wed Jan 18 13:37:23 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 18 Jan 2012 11:37:23 -0800
Subject: [Swift-user] Question re: reliance on proxy cert
In-Reply-To: <2145510777-1326912933-cardhu_decombobulator_blackberry.rim.net-495983841-@b15.c3.bise6.blackberry>
References: <1326912220.12093.7.camel@blabla>
<1645351678.154107.1326912596272.JavaMail.root@zimbra.anl.gov>
<2145510777-1326912933-cardhu_decombobulator_blackberry.rim.net-495983841-@b15.c3.bise6.blackberry>
Message-ID: <1326915443.13122.1.camel@blabla>
On Wed, 2012-01-18 at 18:55 +0000, jonmon at mcs.anl.gov wrote:
> Is the proxy on the execution site necessary? I thought it was just
> the client side. I don't think I create a proxy on the execution
> site...I only request one using my-proxy on the client side.
It's not if you use Globus to start the coaster service (or if you run
locally). That's because Globus gets a proxy on the remote site for you.
It is with any other providers because the client needs to be able to
tell that whatever coaster service is trying to connect to it is legit.
From turam at mcs.anl.gov Thu Jan 19 14:20:27 2012
From: turam at mcs.anl.gov (Thomas Uram)
Date: Thu, 19 Jan 2012 14:20:27 -0600
Subject: [Swift-user] GSISSH in Swift [was Fwd: [Swift-devel]
Documentation of sites.xml]
In-Reply-To: <1326483206.31692.2.camel@blabla>
References: <903303148.134563.1326323490348.JavaMail.root@zimbra.anl.gov>
<1326483206.31692.2.camel@blabla>
Message-ID: <980E744D-AA8E-4014-90B1-701A7D03F421@mcs.anl.gov>
Here's a sites.xml file that specifies gsissh as the executable to use for ssh. This did not work. In other words, the gsissh executable was not used in place of ssh. I've modified Swift to use the gsissh executable, until we can resolve how this should be done within the sites.xml file.
gsissh
/home/turam/tmp
Tom
On Jan 13, 2012, at 1:33 PM, Mihael Hategan wrote:
> On Fri, 2012-01-13 at 13:28 -0600, Thomas Uram wrote:
>> Hey Mihael:
>>
>>
>> I wouldn't prod you to respond after only two days, except that you
>> usually respond within minutes!
>
> Minutes it is.
>>
>> GSISSH provider availability will be important to me quite soon, so
>> I'm interested in your answer....
>
> If your gsissh happens to be named "ssh" and is in the path, it should
> just work. Otherwise you need to pass the "ssh" attribute to the
> provider with the name of the executable. In swift that would be
> gsissh.
>
> That's in theory at least. Let me know if it works in practice.
>
> Mihael
>
>
From svemalayan at yahoo.com Thu Jan 19 17:51:15 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Thu, 19 Jan 2012 15:51:15 -0800 (PST)
Subject: [Swift-user] Montage+Swift+Coasters
Message-ID: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
Dear All,
I have a problem in running Montage with Coasters (in our local cluster - no batch schedulers). After few stages the swift run-time continuously prints the warnings below. Any ideas ? Should I increase the heartbeat count ?
Everything works fine when I try to run the same montage-scripts with swift on a single machine.
Thank you
Emalayan
2012-01-19 15:38:09,207-0800 WARN? Command Command(119, HEARTBEAT): handling reply timeout; sendReqTime=120119-153609.206, sendTime=120119-153609.206,
now=120119-153809.207
2012-01-19 15:38:09,207-0800 INFO? Command Command(119, HEARTBEAT): re-sending
2012-01-19 15:38:09,209-0800 WARN? Command Command(119, HEARTBEAT)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
??????? at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
??????? at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
??????? at java.util.TimerThread.mainLoop(Timer.java:534)
??????? at java.util.TimerThread.run(Timer.java:484)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ketancmaheshwari at gmail.com Thu Jan 19 18:49:45 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Thu, 19 Jan 2012 18:49:45 -0600
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
Message-ID:
Emalayan,
>From your symptoms, it seems you are facing the same issue as I've been.
Could you tell more about the amount of data that needs to be staged to run
the Montage stages during which these warnings turn up? How much time
elapses since the start of your workflow after which you see these messages?
Also, what version of Swift is this?
Regards,
Ketan
On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Thu Jan 19 19:07:24 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Thu, 19 Jan 2012 17:07:24 -0800 (PST)
Subject: [Swift-user] Swift 0.93 + Coasters - Configuration issues ?
Message-ID: <1327021644.63695.YahooMailNeo@web39503.mail.mud.yahoo.com>
Dear All,
I tried to run a simple helloworld.swift application with coasters (with the setup below).
Two machines: Machine-1 (for coaster-service) and Machine-2 (for workers) respectively.
I started the coaster-service in Machine-1 and also started the helloworld.swift from Machine-1. I observed the following with two different swift versions.
With swift-0.92.1 - helloworld.swift was waiting until the worker on Machine-2 is started and then it returned the result.
With swift-0.93??? - helloworld.swift immediately completed and provided the correct results even before starting the worker.
I suspect there might be some configuration issues / bug with swift-0.93 (may be in site catalog).
Any ideas / suggestions ?
Please kindly let me know if you have questions.
Regards
Emalayan
============================ Please find my site.catalog before =================================
???
??? passive
??? 4
??? 10000
??? 100
??? 100
??? 100
??? 1
??? 10
??? 25.00
??? 10000
??? proxy
???
??? /home/emalayan/App/forEmalayan/swift.workdir
?
====================================================================================?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Thu Jan 19 19:09:22 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Thu, 19 Jan 2012 17:09:22 -0800 (PST)
Subject: [Swift-user] Swift 0.93 + Coasters - Configuration issues ?
In-Reply-To: <1327021644.63695.YahooMailNeo@web39503.mail.mud.yahoo.com>
References: <1327021644.63695.YahooMailNeo@web39503.mail.mud.yahoo.com>
Message-ID: <1327021762.47671.YahooMailNeo@web39504.mail.mud.yahoo.com>
I used the same site and tc catalogs with both swift versions.
Thank you
Emalayan
________________________________
From: Emalayan Vairavanathan
To: swift user
Sent: Thursday, 19 January 2012 5:07 PM
Subject: [Swift-user] Swift 0.93 + Coasters - Configuration issues ?
Dear All,
I tried to run a simple helloworld.swift application with coasters (with the setup below).
Two machines: Machine-1 (for coaster-service) and Machine-2 (for workers) respectively.
I started the coaster-service in Machine-1 and also started the helloworld.swift from Machine-1. I observed the following with two different swift versions.
With swift-0.92.1 - helloworld.swift was waiting until the worker on Machine-2 is started and then it returned the result.
With swift-0.93??? - helloworld.swift immediately completed and provided the correct results even before starting the worker.
I suspect there might be some configuration issues / bug with swift-0.93 (may be in site catalog).
Any ideas / suggestions ?
Please kindly let me know if you have questions.
Regards
Emalayan
============================ Please find my site.catalog before =================================
???
??? passive
??? 4
??? 10000
??? 100
??? 100
??? 100
??? 1
??? 10
??? 25.00
??? 10000
??? proxy
???
??? /home/emalayan/App/forEmalayan/swift.workdir
?
====================================================================================?
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ketancmaheshwari at gmail.com Thu Jan 19 19:10:56 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Thu, 19 Jan 2012 19:10:56 -0600
Subject: [Swift-user] Swift 0.93 + Coasters - Configuration issues ?
In-Reply-To: <1327021644.63695.YahooMailNeo@web39503.mail.mud.yahoo.com>
References: <1327021644.63695.YahooMailNeo@web39503.mail.mud.yahoo.com>
Message-ID:
On Thu, Jan 19, 2012 at 7:07 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> Dear All,
>
> I tried to run a simple helloworld.swift application with coasters (with
> the setup below).
>
> Two machines: Machine-1 (for coaster-service) and Machine-2 (for workers)
> respectively.
>
> I started the coaster-service in Machine-1 and also started the
> helloworld.swift from Machine-1. I observed the following with two
> different swift versions.
>
> With swift-0.92.1 - helloworld.swift was waiting until the worker on
> Machine-2 is started and then it returned the result.
> With swift-0.93 - helloworld.swift immediately completed and provided
> the correct results even before starting the worker.
>
This is because, coaster-service in 0.93 onwards is configured to launch
workers automatically, while this is not the case with 0.92.1 in which
workers needs to be started manually.
>
> I suspect there might be some configuration issues / bug with swift-0.93
> (may be in site catalog).
>
> Any ideas / suggestions ?
>
> Please kindly let me know if you have questions.
>
> Regards
> Emalayan
>
> ============================ Please find my site.catalog before
> =================================
>
>
> jobmanager="local:local"/>
> passive
>
> 4
> 10000
> 100
> 100
> 100
> 1
> 10
> 25.00
> 10000
> proxy
>
>
> /home/emalayan/App/forEmalayan/swift.workdir
>
>
>
>
> ====================================================================================
>
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Thu Jan 19 19:14:42 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Thu, 19 Jan 2012 17:14:42 -0800 (PST)
Subject: [Swift-user] Swift 0.93 + Coasters - Configuration issues ?
In-Reply-To:
References: <1327021644.63695.YahooMailNeo@web39503.mail.mud.yahoo.com>
Message-ID: <1327022082.94703.YahooMailNeo@web39506.mail.mud.yahoo.com>
Thank you Ketan. By the way how coster-service finds the ip-address of the worker nodes ?
________________________________
From: Ketan Maheshwari
To: Emalayan Vairavanathan
Cc: swift user
Sent: Thursday, 19 January 2012 5:10 PM
Subject: Re: [Swift-user] Swift 0.93 + Coasters - Configuration issues ?
On Thu, Jan 19, 2012 at 7:07 PM, Emalayan Vairavanathan wrote:
Dear All,
>
>
>I tried to run a simple helloworld.swift application with coasters (with the setup below).
>
>
>Two machines: Machine-1 (for coaster-service) and Machine-2 (for workers) respectively.
>
>
>
>I started the coaster-service in Machine-1 and also started the helloworld.swift from Machine-1. I observed the following with two different swift versions.
>
>
>With swift-0.92.1 - helloworld.swift was waiting until the worker on Machine-2 is started and then it returned the result.
>With swift-0.93??? - helloworld.swift immediately completed and provided the correct results even before starting the worker.
This is because, coaster-service in 0.93 onwards is configured to launch workers automatically, while this is not the case with 0.92.1 in which workers needs to be started manually.
?
>
>I suspect there might be some configuration issues / bug with swift-0.93 (may be in site catalog).
>
>
>Any ideas / suggestions ?
>
>
>Please kindly let me know if you have questions.
>
>
>
>Regards
>Emalayan
>
>
>============================ Please find my site.catalog before =================================
>
>
>???
>??? passive
>
>??? 4
>??? 10000
>??? 100
>??? 100
>??? 100
>??? 1
>??? 10
>??? 25.00
>??? 10000
>??? proxy
>???
>??? /home/emalayan/App/forEmalayan/swift.workdir
>?
>
>
>
>
>
>
>====================================================================================?
>
>
>
>_______________________________________________
>Swift-user mailing list
>Swift-user at ci.uchicago.edu
>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ketancmaheshwari at gmail.com Thu Jan 19 19:16:51 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Thu, 19 Jan 2012 19:16:51 -0600
Subject: [Swift-user] Swift 0.93 + Coasters - Configuration issues ?
In-Reply-To: <1327022082.94703.YahooMailNeo@web39506.mail.mud.yahoo.com>
References: <1327021644.63695.YahooMailNeo@web39503.mail.mud.yahoo.com>
<1327022082.94703.YahooMailNeo@web39506.mail.mud.yahoo.com>
Message-ID:
Workers start only on the localhost.
On Thu, Jan 19, 2012 at 7:14 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> Thank you Ketan. By the way how coster-service finds the ip-address of the
> worker nodes ?
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* swift user
> *Sent:* Thursday, 19 January 2012 5:10 PM
> *Subject:* Re: [Swift-user] Swift 0.93 + Coasters - Configuration issues ?
>
>
>
> On Thu, Jan 19, 2012 at 7:07 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
> I tried to run a simple helloworld.swift application with coasters (with
> the setup below).
>
> Two machines: Machine-1 (for coaster-service) and Machine-2 (for workers)
> respectively.
>
> I started the coaster-service in Machine-1 and also started the
> helloworld.swift from Machine-1. I observed the following with two
> different swift versions.
>
> With swift-0.92.1 - helloworld.swift was waiting until the worker on
> Machine-2 is started and then it returned the result.
> With swift-0.93 - helloworld.swift immediately completed and provided
> the correct results even before starting the worker.
>
>
> This is because, coaster-service in 0.93 onwards is configured to launch
> workers automatically, while this is not the case with 0.92.1 in which
> workers needs to be started manually.
>
>
>
> I suspect there might be some configuration issues / bug with swift-0.93
> (may be in site catalog).
>
> Any ideas / suggestions ?
>
> Please kindly let me know if you have questions.
>
> Regards
> Emalayan
>
> ============================ Please find my site.catalog before
> =================================
>
>
> jobmanager="local:local"/>
> passive
>
> 4
> 10000
> 100
> 100
> 100
> 1
> 10
> 25.00
> 10000
> proxy
>
>
> /home/emalayan/App/forEmalayan/swift.workdir
>
>
>
>
> ====================================================================================
>
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Thu Jan 19 19:55:20 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Thu, 19 Jan 2012 17:55:20 -0800 (PST)
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To:
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
Message-ID: <1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
Hi Ketan,
This was with swift-0.92.1.Now I have downloaded the latest swift 0.93 and getting totally different error messages with swift 0.93. I can ask Jon about these messages. (These scripts was working well with only Swift)
Please let me know if you have any idea.
Regards
Emalayan
===============================================================================================
Swift 0.93 swift-r5501 cog-r3350
RunID: 20120119-1749-rjshh1r9
?(input): found 10 files
Progress:? time: Thu, 19 Jan 2012 17:49:20 -0800
Find: http://localhost:1984
Find:? keepalive(120), reconnect - http://localhost:1984
Progress:? time: Thu, 19 Jan 2012 17:49:22 -0800? Stage in:1? Submitted:9
Progress:? time: Thu, 19 Jan 2012 17:49:25 -0800? Active:9? Stage out:1
Progress:? time: Thu, 19 Jan 2012 17:49:26 -0800? Stage out:3? Finished successfully:7
Progress:? time: Thu, 19 Jan 2012 17:49:28 -0800? Active:1? Finished successfully:10
Progress:? time: Thu, 19 Jan 2012 17:49:29 -0800? Stage in:1? Submitting:11? Submitted:6? Finished successfully:12
Progress:? time: Thu, 19 Jan 2012 17:49:30 -0800? Stage in:4? Submitted:1? Active:6? Stage out:2? Finished successfully:17
Progress:? time: Thu, 19 Jan 2012 17:49:31 -0800? Active:1? Finished successfully:30
Exception in mConcatFit:
Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5, fits.tbl, stat_dir]
Host: localhost
Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
- - -
Caused by: null
Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
Execution failed:
??? back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
________________________________
From: Ketan Maheshwari
To: Emalayan Vairavanathan
Cc: swift user
Sent: Thursday, 19 January 2012 4:49 PM
Subject: Re: [Swift-user] Montage+Swift+Coasters
Emalayan,
From your symptoms, it seems you are facing the same issue as I've been. Could you tell more about the amount of data that needs to be staged to run the Montage stages during which these warnings turn up? How much time elapses since the start of your workflow after which you see these messages?
Also, what version of Swift is this?
Regards,
Ketan
On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan wrote:
Dear All,
>
>
>I have a problem in running Montage with Coasters (in our local cluster - no batch schedulers). After few stages the swift run-time continuously prints the warnings below. Any ideas ? Should I increase the heartbeat count ?
>
>
>
>Everything works fine when I try to run the same montage-scripts with swift on a single machine.
>
>
>
>Thank you
>Emalayan
>
>
>
>
>
>2012-01-19 15:38:09,207-0800 WARN? Command Command(119, HEARTBEAT): handling reply timeout; sendReqTime=120119-153609.206, sendTime=120119-153609.206,
now=120119-153809.207
>2012-01-19 15:38:09,207-0800 INFO? Command Command(119, HEARTBEAT): re-sending
>2012-01-19 15:38:09,209-0800 WARN? Command Command(119, HEARTBEAT)fault was: Reply timeout
>org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>??????? at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>??????? at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
>??????? at java.util.TimerThread.mainLoop(Timer.java:534)
>??????? at java.util.TimerThread.run(Timer.java:484)
>_______________________________________________
>Swift-user mailing list
>Swift-user at ci.uchicago.edu
>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ketancmaheshwari at gmail.com Fri Jan 20 09:39:34 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Fri, 20 Jan 2012 09:39:34 -0600
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To: <1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
Message-ID:
Emalayan,
I would check all the mappers and the resulting paths in the Swift source.
Also try running the failed job something like this:
cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-
b1sa4vlk
*
*
mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5
fits.tbl stat_dir
error 520 indicates workers are not able to reach the data.
Also check if swift.workdir is writable on the site by the worker nodes.
On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93
> and getting totally different error messages with swift 0.93. I can ask
> Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
> (input): found 10 files
> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished
> successfully:7
> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished
> successfully:10
> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1
> Submitting:11 Submitted:6 Finished successfully:12
> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1
> Active:6 Stage out:2 Finished successfully:17
> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished
> successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* swift user
> *Sent:* Thursday, 19 January 2012 4:49 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been.
> Could you tell more about the amount of data that needs to be staged to run
> the Montage stages during which these warnings turn up? How much time
> elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From benc at hawaga.org.uk Fri Jan 20 15:52:56 2012
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 20 Jan 2012 22:52:56 +0100
Subject: [Swift-user] Question re: reliance on proxy cert
In-Reply-To: <1326912220.12093.7.camel@blabla>
References: <83743945-35DD-4AB2-A934-E97C5744916B@mcs.anl.gov>
<1326912220.12093.7.camel@blabla>
Message-ID: <3E8936F5-C4B5-4FBF-9AAA-F5529232E331@hawaga.org.uk>
in the ssh case, you should have a secure standard in/standard out over which you can send securely and so do either something like a gsi delegation or a shared secret transmission or whatever.
that doesn't apply to arbitrary cog providers though, I think.
so maybe its yet another growth of the configuration option space...?
On Jan 18, 2012, at 7:43 PM, Mihael Hategan wrote:
> On Wed, 2012-01-18 at 12:33 -0600, Thomas Uram wrote:
>> I'm using coasters with ssh:pbs and have the proper bits in
>> ~/.ssh/auth.defaults to support authentication, but when I run the
>> script it fails due to a missing or expired proxy cert:
>
> [...]
>
>> Why does it fail when an alternative authentication mechanism is
>> available that would succeed? Is there an option to control this?
>
> It fails because while ssh is used to start the coaster service
> executable, the connection between client and service is secured by
> GSI.
>
> This model was just peachy in the Globus scenario, where you would need
> a proxy anyway to start the service and delegation could be used to
> supply credentials to the coaster service.
>
> Not so much with ssh. I've been thinking about a way to deal with this,
> and I think I'm leaning towards some shared secret that could be used as
> a one-time authentication token by the service. The problem is making
> sure that whatever provider is used to communicate said secret to the
> service remains a secret (i.e. passing it on any command line is out of
> the question). But that seems to require the use of an encrypted file
> transfer provider, which breaks the abstraction we have a bit, so it
> might require more changes than I want to see.
>
> So suggestions are welcome.
>
> Mihael
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
From hategan at mcs.anl.gov Fri Jan 20 16:48:25 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 20 Jan 2012 14:48:25 -0800
Subject: [Swift-user] Question re: reliance on proxy cert
In-Reply-To: <3E8936F5-C4B5-4FBF-9AAA-F5529232E331@hawaga.org.uk>
References: <83743945-35DD-4AB2-A934-E97C5744916B@mcs.anl.gov>
<1326912220.12093.7.camel@blabla>
<3E8936F5-C4B5-4FBF-9AAA-F5529232E331@hawaga.org.uk>
Message-ID: <1327099705.7785.6.camel@blabla>
On Fri, 2012-01-20 at 22:52 +0100, Ben Clifford wrote:
> in the ssh case, you should have a secure standard in/standard out
> over which you can send securely and so do either something like a gsi
> delegation or a shared secret transmission or whatever.
Right. Though there's some care to be taken there. echo "secret" >
secretfile is something that can be seen in ps. Can you think of
anything that could go wrong with cat > secretfile?
>
> that doesn't apply to arbitrary cog providers though, I think.
Right. And in the shared secret case, there would have to be an
additional security mechanism (e.g. some key exchange + symmetric
encryption without host certificate checks).
>
> so maybe its yet another growth of the configuration option space...?
Right. That's another reason that gives me a bit of pause here. But too
much pause isn't good either.
From benc at hawaga.org.uk Sat Jan 21 03:56:10 2012
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 21 Jan 2012 10:56:10 +0100
Subject: [Swift-user] Question re: reliance on proxy cert
In-Reply-To: <1327099705.7785.6.camel@blabla>
References: <83743945-35DD-4AB2-A934-E97C5744916B@mcs.anl.gov>
<1326912220.12093.7.camel@blabla>
<3E8936F5-C4B5-4FBF-9AAA-F5529232E331@hawaga.org.uk>
<1327099705.7785.6.camel@blabla>
Message-ID:
On Jan 20, 2012, at 11:48 PM, Mihael Hategan wrote:
> On Fri, 2012-01-20 at 22:52 +0100, Ben Clifford wrote:
>> in the ssh case, you should have a secure standard in/standard out
>> over which you can send securely and so do either something like a gsi
>> delegation or a shared secret transmission or whatever.
>
> Right. Though there's some care to be taken there. echo "secret" >
> secretfile is something that can be seen in ps. Can you think of
> anything that could go wrong with cat > secretfile?
secretfile exsts on the remote filesystem in a way that is possibly publicly visible; touch secretfile ; chmod go-rwx secretfile cat >> secretfile might be better. or you could feed it into the program that wants the secret directly and forget the filesystem entirely.
> Right. And in the shared secret case, there would have to be an
> additional security mechanism (e.g. some key exchange + symmetric
> encryption without host certificate checks).
You get a bunch of that from ssh already. There's probably more elaborate stuff that can be done - eg rather than having a shared secret at all, use the ssh channel to exchange two public keys, one from each end; or make sure that the wire protocol never sends the whole secret over the out-of-ssh channel, just some proof that it knows it.
Depends how crazy you want to go on security...
--
From hategan at mcs.anl.gov Sat Jan 21 20:04:47 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 21 Jan 2012 18:04:47 -0800
Subject: [Swift-user] GSISSH in Swift [was Fwd: [Swift-devel]
Documentation of sites.xml]
In-Reply-To: <980E744D-AA8E-4014-90B1-701A7D03F421@mcs.anl.gov>
References: <903303148.134563.1326323490348.JavaMail.root@zimbra.anl.gov>
<1326483206.31692.2.camel@blabla>
<980E744D-AA8E-4014-90B1-701A7D03F421@mcs.anl.gov>
Message-ID: <1327197887.7405.2.camel@blabla>
My bad. Those profile entries don't get propagated to the task that
starts the coaster service.
Change of plans. The latest trunk adds an etc/provider-sshcl.properties
file, the contents of which are self explanatory.
Mihael
On Thu, 2012-01-19 at 14:20 -0600, Thomas Uram wrote:
> Here's a sites.xml file that specifies gsissh as the executable to use for ssh. This did not work. In other words, the gsissh executable was not used in place of ssh. I've modified Swift to use the gsissh executable, until we can resolve how this should be done within the sites.xml file.
>
>
>
>
> gsissh
>
> /home/turam/tmp
>
>
>
>
> Tom
>
>
>
> On Jan 13, 2012, at 1:33 PM, Mihael Hategan wrote:
>
> > On Fri, 2012-01-13 at 13:28 -0600, Thomas Uram wrote:
> >> Hey Mihael:
> >>
> >>
> >> I wouldn't prod you to respond after only two days, except that you
> >> usually respond within minutes!
> >
> > Minutes it is.
> >>
> >> GSISSH provider availability will be important to me quite soon, so
> >> I'm interested in your answer....
> >
> > If your gsissh happens to be named "ssh" and is in the path, it should
> > just work. Otherwise you need to pass the "ssh" attribute to the
> > provider with the name of the executable. In swift that would be
> > gsissh.
> >
> > That's in theory at least. Let me know if it works in practice.
> >
> > Mihael
> >
> >
>
From jonmon at mcs.anl.gov Sat Jan 21 20:20:26 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Sat, 21 Jan 2012 20:20:26 -0600
Subject: [Swift-user] GSISSH in Swift [was Fwd: [Swift-devel]
Documentation of sites.xml]
In-Reply-To: <1327197887.7405.2.camel@blabla>
References: <903303148.134563.1326323490348.JavaMail.root@zimbra.anl.gov>
<1326483206.31692.2.camel@blabla>
<980E744D-AA8E-4014-90B1-701A7D03F421@mcs.anl.gov>
<1327197887.7405.2.camel@blabla>
Message-ID: <33E27556-DD8E-407A-9610-94F8DA1AAD70@mcs.anl.gov>
So we use the etc./provider-sscl.properties file instead of an .ssh/config file? I ask this because it looks like the code that allows you to specify a username and private key file in the sites file is still there. I know from past testing that at least specifying a username in the sites file was not being honored.
On Jan 21, 2012, at 8:04 PM, Mihael Hategan wrote:
> My bad. Those profile entries don't get propagated to the task that
> starts the coaster service.
>
> Change of plans. The latest trunk adds an etc/provider-sshcl.properties
> file, the contents of which are self explanatory.
>
> Mihael
>
> On Thu, 2012-01-19 at 14:20 -0600, Thomas Uram wrote:
>> Here's a sites.xml file that specifies gsissh as the executable to use for ssh. This did not work. In other words, the gsissh executable was not used in place of ssh. I've modified Swift to use the gsissh executable, until we can resolve how this should be done within the sites.xml file.
>>
>>
>>
>>
>> gsissh
>>
>> /home/turam/tmp
>>
>>
>>
>>
>> Tom
>>
>>
>>
>> On Jan 13, 2012, at 1:33 PM, Mihael Hategan wrote:
>>
>>> On Fri, 2012-01-13 at 13:28 -0600, Thomas Uram wrote:
>>>> Hey Mihael:
>>>>
>>>>
>>>> I wouldn't prod you to respond after only two days, except that you
>>>> usually respond within minutes!
>>>
>>> Minutes it is.
>>>>
>>>> GSISSH provider availability will be important to me quite soon, so
>>>> I'm interested in your answer....
>>>
>>> If your gsissh happens to be named "ssh" and is in the path, it should
>>> just work. Otherwise you need to pass the "ssh" attribute to the
>>> provider with the name of the executable. In swift that would be
>>> gsissh.
>>>
>>> That's in theory at least. Let me know if it works in practice.
>>>
>>> Mihael
>>>
>>>
>>
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
From jonmon at mcs.anl.gov Mon Jan 23 13:08:29 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Mon, 23 Jan 2012 13:08:29 -0600
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To:
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
Message-ID: <4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
Emalayan,
So I have ran the scripts with some of my own test cases and do not see it failing. Could you provide your config files? Please provide the tc, sites, and config file(if you use a config file).
On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
> Emalayan,
>
> I would check all the mappers and the resulting paths in the Swift source.
>
> Also try running the failed job something like this:
>
> cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>
> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5 fits.tbl stat_dir
>
> error 520 indicates workers are not able to reach the data.
>
> Also check if swift.workdir is writable on the site by the worker nodes.
>
> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan wrote:
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93 and getting totally different error messages with swift 0.93. I can ask Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
> (input): found 10 files
> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished successfully:7
> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished successfully:10
> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1 Submitting:11 Submitted:6 Finished successfully:12
> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1 Active:6 Stage out:2 Finished successfully:17
> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5, fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
>
> From: Ketan Maheshwari
> To: Emalayan Vairavanathan
> Cc: swift user
> Sent: Thursday, 19 January 2012 4:49 PM
> Subject: Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been. Could you tell more about the amount of data that needs to be staged to run the Montage stages during which these warnings turn up? How much time elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan wrote:
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster - no batch schedulers). After few stages the swift run-time continuously prints the warnings below. Any ideas ? Should I increase the heartbeat count ?
>
> Everything works fine when I try to run the same montage-scripts with swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT): handling reply timeout; sendReqTime=120119-153609.206, sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT): re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Mon Jan 23 13:25:56 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 23 Jan 2012 11:25:56 -0800 (PST)
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To: <4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
Message-ID: <1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
Jon,
Please find the detail below and let me know if you have any questions about my setup.
Thank you
Emalayan
==========================================================
site.xml
???
??? passive
??? 4
??? 100000
??? 100
??? 100
??? 100
??? 1
??? 10
??? 25.00
??? 10000
??? proxy
???
??? /tmp/swift.workdir
?
=======================================================
tc
localhost sh /bin/sh null null null
localhost cat /bin/cat null null null
localhost echo /bin/echo null null null
localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null null
localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null null null
localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null null
localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null null null
localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec null null null
localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null null
localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null null
localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null null
localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null null nul
localhost Background_list /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null null null
localhost create_status_table /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py null null null
localhost mProjectPP_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null null null
localhost mProject_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null null null
localhost mBackground_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null null null
localhost mDiffFit_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null null null
=================================================================
cf
wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=1
lazy.errors=true
status.mode=provider
use.provider.staging=true
provider.staging.pin.swiftfiles=false
foreach.max.threads=100
provenance.log=false
===================================================================
________________________________
From: Jonathan Monette
To: Ketan Maheshwari
Cc: Emalayan Vairavanathan ; swift user
Sent: Monday, 23 January 2012 11:08 AM
Subject: Re: [Swift-user] Montage+Swift+Coasters
Emalayan,
? ?So I have ran the scripts with some of my own test cases and do not see it failing. ?Could you provide your config files? ?Please provide the tc, sites, and config file(if you use a config file).
On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
Emalayan,
>
>
>I would check all the mappers and the resulting paths in the Swift source.?
>
>
>Also try running the failed job something like this:?
>
>
>cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>
>
>mConcatFit?_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5 fits.tbl stat_dir
>
>
>error 520 indicates workers are not able to reach the data.
>
>
>Also check if swift.workdir is writable on the site by the worker nodes.
>
>
>On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan wrote:
>
>Hi Ketan,
>>
>>
>>This was with swift-0.92.1.Now I have downloaded the latest swift 0.93 and getting totally different error messages with swift 0.93. I can ask Jon about these messages. (These scripts was working well with only Swift)
>>
>>
>>
>>Please let me know if you have any idea.
>>
>>
>>
>>Regards
>>Emalayan
>>
>>
>>
>>===============================================================================================
>>
>>Swift 0.93 swift-r5501 cog-r3350
>>
>>RunID: 20120119-1749-rjshh1r9
>>?(input): found 10 files
>>Progress:? time: Thu, 19 Jan 2012 17:49:20 -0800
>>Find: http://localhost:1984
>>Find:? keepalive(120), reconnect - http://localhost:1984
>>Progress:? time: Thu, 19 Jan 2012 17:49:22 -0800? Stage in:1? Submitted:9
>>Progress:? time: Thu, 19 Jan 2012 17:49:25 -0800? Active:9? Stage out:1
>>Progress:? time: Thu, 19 Jan 2012 17:49:26 -0800? Stage out:3? Finished successfully:7
>>Progress:? time: Thu, 19 Jan 2012 17:49:28 -0800? Active:1? Finished successfully:10
>>Progress:? time: Thu, 19 Jan 2012 17:49:29 -0800? Stage in:1? Submitting:11? Submitted:6? Finished successfully:12
>>Progress:? time: Thu, 19 Jan 2012 17:49:30 -0800? Stage in:4? Submitted:1? Active:6? Stage out:2? Finished successfully:17
>>Progress:? time: Thu, 19 Jan 2012 17:49:31 -0800? Active:1? Finished successfully:30
>>Exception in mConcatFit:
>>Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5, fits.tbl, stat_dir]
>>Host: localhost
>>Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>>- - -
>>
>>Caused by: null
>>Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
>>Execution failed:
>>??? back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
>>
>>
>>
>>
>>________________________________
>> From: Ketan Maheshwari
>>To: Emalayan Vairavanathan
>>Cc: swift user
>>Sent: Thursday, 19 January 2012 4:49 PM
>>Subject: Re: [Swift-user] Montage+Swift+Coasters
>>
>>
>>
>>Emalayan,
>>
>>
>>From your symptoms, it seems you are facing the same issue as I've been. Could you tell more about the amount of data that needs to be staged to run the Montage stages during which these warnings turn up? How much time elapses since the start of your workflow after which you see these messages?
>>
>>Also, what version of Swift is this?
>>
>>
>>Regards,
>>Ketan
>>
>>
>>On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan wrote:
>>
>>Dear All,
>>>
>>>
>>>I have a problem in running Montage with Coasters (in our local cluster - no batch schedulers). After few stages the swift run-time continuously prints the warnings below. Any ideas ? Should I increase the heartbeat count ?
>>>
>>>
>>>
>>>Everything works fine when I try to run the same montage-scripts with swift on a single machine.
>>>
>>>
>>>
>>>Thank you
>>>Emalayan
>>>
>>>
>>>
>>>
>>>
>>>2012-01-19 15:38:09,207-0800 WARN? Command Command(119, HEARTBEAT): handling reply timeout; sendReqTime=120119-153609.206, sendTime=120119-153609.206,
now=120119-153809.207
>>>2012-01-19 15:38:09,207-0800 INFO? Command Command(119, HEARTBEAT): re-sending
>>>2012-01-19 15:38:09,209-0800 WARN? Command Command(119, HEARTBEAT)fault was: Reply timeout
>>>org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>??????? at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>>>??????? at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
>>>??????? at java.util.TimerThread.mainLoop(Timer.java:534)
>>>??????? at java.util.TimerThread.run(Timer.java:484)
>>>_______________________________________________
>>>Swift-user mailing list
>>>Swift-user at ci.uchicago.edu
>>>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>
>>
>>
>>
>>--
>>Ketan
>>
>>
>>
>>
>>
>
>
>
>--
>Ketan
>
>
>
_______________________________________________
>Swift-user mailing list
>Swift-user at ci.uchicago.edu
>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ketancmaheshwari at gmail.com Mon Jan 23 13:36:08 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Mon, 23 Jan 2012 13:36:08 -0600
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To: <1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
<1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
Message-ID:
Emalayan,
Likely, /tmp is not readable/writable across the machines. Could you try
changing workdir to your /home
On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> Jon,
>
> Please find the detail below and let me know if you have any questions
> about my setup.
>
> Thank you
> Emalayan
>
> ==========================================================
> site.xml
>
>
>
> jobmanager="local:local"/>
> passive
>
> 4
> 100000
> 100
> 100
> 100
> 1
> 10
> 25.00
> 10000
> proxy
>
> /tmp/swift.workdir
>
>
>
> =======================================================
>
> tc
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
> localhost echo /bin/echo null null null
> localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null
> null
> localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null
> null null
> localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null
> null
> localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
> localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null
> null null
> localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
> localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec
> null null null
> localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null
> null
> localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null
> null
> localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null
> null
> localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null
> null nul
>
> localhost Background_list
> /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null
> null null
> localhost create_status_table
> /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py
> null null null
> localhost mProjectPP_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null
> null null
> localhost mProject_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null
> null null
> localhost mBackground_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null
> null null
> localhost mDiffFit_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null
> null null
>
> =================================================================
>
> cf
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=100
> provenance.log=false
>
> ===================================================================
>
> ------------------------------
> *From:* Jonathan Monette
> *To:* Ketan Maheshwari
> *Cc:* Emalayan Vairavanathan ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:08 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
> So I have ran the scripts with some of my own test cases and do not see
> it failing. Could you provide your config files? Please provide the tc,
> sites, and config file(if you use a config file).
>
> On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
> Emalayan,
>
> I would check all the mappers and the resulting paths in the Swift source.
>
> Also try running the failed job something like this:
>
> cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-
> b1sa4vlk
> *
> *
> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5
> fits.tbl stat_dir
>
> error 520 indicates workers are not able to reach the data.
>
> Also check if swift.workdir is writable on the site by the worker nodes.
>
> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93
> and getting totally different error messages with swift 0.93. I can ask
> Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
> (input): found 10 files
> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished
> successfully:7
> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished
> successfully:10
> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1
> Submitting:11 Submitted:6 Finished successfully:12
> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1
> Active:6 Stage out:2 Finished successfully:17
> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished
> successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* swift user
> *Sent:* Thursday, 19 January 2012 4:49 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been.
> Could you tell more about the amount of data that needs to be staged to run
> the Montage stages during which these warnings turn up? How much time
> elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Mon Jan 23 13:50:46 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 23 Jan 2012 11:50:46 -0800 (PST)
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To:
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
<1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
Message-ID: <1327348246.89718.YahooMailNeo@web39507.mail.mud.yahoo.com>
Thanks Ketan and Jon. I tried but it is still giving error. I have attached the log file.
Thank you
Emalayan
________________________________
From: Ketan Maheshwari
To: Emalayan Vairavanathan
Cc: Jonathan Monette ; swift user
Sent: Monday, 23 January 2012 11:36 AM
Subject: Re: [Swift-user] Montage+Swift+Coasters
Emalayan,
Likely, /tmp is not readable/writable across the machines. Could you try changing workdir to your /home
On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan wrote:
Jon,
>
>
>Please find the detail below and let me know if you have any questions about my setup.
>
>
>
>Thank you
>Emalayan
>
>
>
>==========================================================
>
>site.xml
>
>
>
>
>???
>??? passive
>
>??? 4
>??? 100000
>??? 100
>??? 100
>??? 100
>??? 1
>??? 10
>??? 25.00
>??? 10000
>??? proxy
>???
>??? /tmp/swift.workdir
>?
>
>
>
>
>=======================================================
>
>
>tc
>
>
>localhost sh /bin/sh null null null
>localhost cat /bin/cat null null null
>localhost echo /bin/echo null null null
>localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null null
>localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null null null
>localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null null
>localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
>localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null null null
>localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
>localhost mDiffExec_wrap
/home/emalayan/App/Montage_v3.3/bin/mDiffExec null null null
>localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null null
>localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null null
>localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null null
>localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null null nul
>
>localhost Background_list /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null null null
>localhost create_status_table /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py null null null
>localhost mProjectPP_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null null null
>localhost mProject_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null null null
>localhost mBackground_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null null
null
>localhost mDiffFit_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null null null
>
>
>=================================================================
>
>
>cf
>
>
>wrapperlog.always.transfer=true
>sitedir.keep=true
>execution.retries=1
>lazy.errors=true
>status.mode=provider
>use.provider.staging=true
>provider.staging.pin.swiftfiles=false
>foreach.max.threads=100
>provenance.log=false
>
>
>===================================================================
>
>
>
>
>________________________________
> From: Jonathan Monette
>To: Ketan Maheshwari
>Cc: Emalayan Vairavanathan ; swift user
>Sent: Monday, 23 January 2012 11:08 AM
>Subject: Re: [Swift-user] Montage+Swift+Coasters
>
>
>
>Emalayan,
>? ?So I have ran the scripts with some of my own test cases and do not see it failing. ?Could you provide your config files? ?Please provide the tc, sites, and config file(if you use a config file).
>
>
>On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
>Emalayan,
>>
>>
>>I would check all the mappers and the resulting paths in the Swift source.?
>>
>>
>>Also try running the failed job something like this:?
>>
>>
>>cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>>
>>
>>mConcatFit?_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5 fits.tbl stat_dir
>>
>>
>>error 520 indicates workers are not able to reach the data.
>>
>>
>>Also check if swift.workdir is writable on the site by the worker nodes.
>>
>>
>>On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan wrote:
>>
>>Hi Ketan,
>>>
>>>
>>>This was with swift-0.92.1.Now I have downloaded the latest swift 0.93 and getting totally different error messages with swift 0.93. I can ask Jon about these messages. (These scripts was working well with only Swift)
>>>
>>>
>>>
>>>Please let me know if you have any idea.
>>>
>>>
>>>
>>>Regards
>>>Emalayan
>>>
>>>
>>>
>>>===============================================================================================
>>>
>>>Swift 0.93 swift-r5501 cog-r3350
>>>
>>>RunID: 20120119-1749-rjshh1r9
>>>?(input): found 10 files
>>>Progress:? time: Thu, 19 Jan 2012 17:49:20 -0800
>>>Find: http://localhost:1984
>>>Find:? keepalive(120), reconnect - http://localhost:1984
>>>Progress:? time: Thu, 19 Jan 2012 17:49:22 -0800? Stage in:1? Submitted:9
>>>Progress:? time: Thu, 19 Jan 2012 17:49:25 -0800? Active:9? Stage out:1
>>>Progress:? time: Thu, 19 Jan 2012 17:49:26 -0800? Stage out:3? Finished successfully:7
>>>Progress:? time: Thu, 19 Jan 2012 17:49:28 -0800? Active:1? Finished successfully:10
>>>Progress:? time: Thu, 19 Jan 2012 17:49:29 -0800? Stage in:1? Submitting:11? Submitted:6? Finished successfully:12
>>>Progress:? time: Thu, 19 Jan 2012 17:49:30 -0800? Stage in:4? Submitted:1? Active:6? Stage out:2? Finished successfully:17
>>>Progress:? time: Thu, 19 Jan 2012 17:49:31 -0800? Active:1? Finished successfully:30
>>>Exception in mConcatFit:
>>>Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5, fits.tbl, stat_dir]
>>>Host: localhost
>>>Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>>>- - -
>>>
>>>Caused by: null
>>>Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
>>>Execution failed:
>>>??? back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
>>>
>>>
>>>
>>>
>>>________________________________
>>> From: Ketan Maheshwari
>>>To: Emalayan Vairavanathan
>>>Cc: swift user
>>>Sent: Thursday, 19 January 2012 4:49 PM
>>>Subject: Re: [Swift-user] Montage+Swift+Coasters
>>>
>>>
>>>
>>>Emalayan,
>>>
>>>
>>>From your symptoms, it seems you are facing the same issue as I've been. Could you tell more about the amount of data that needs to be staged to run the Montage stages during which these warnings turn up? How much time elapses since the start of your workflow after which you see these messages?
>>>
>>>Also, what version of Swift is this?
>>>
>>>
>>>Regards,
>>>Ketan
>>>
>>>
>>>On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan wrote:
>>>
>>>Dear All,
>>>>
>>>>
>>>>I have a problem in running Montage with Coasters (in our local cluster - no batch schedulers). After few stages the swift run-time continuously prints the warnings below. Any ideas ? Should I increase the heartbeat count ?
>>>>
>>>>
>>>>
>>>>Everything works fine when I try to run the same montage-scripts with swift on a single machine.
>>>>
>>>>
>>>>
>>>>Thank you
>>>>Emalayan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>2012-01-19 15:38:09,207-0800 WARN? Command Command(119, HEARTBEAT): handling reply timeout; sendReqTime=120119-153609.206, sendTime=120119-153609.206,
now=120119-153809.207
>>>>2012-01-19 15:38:09,207-0800 INFO? Command Command(119, HEARTBEAT): re-sending
>>>>2012-01-19 15:38:09,209-0800 WARN? Command Command(119, HEARTBEAT)fault was: Reply timeout
>>>>org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>>??????? at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>>>>??????? at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
>>>>??????? at java.util.TimerThread.mainLoop(Timer.java:534)
>>>>??????? at java.util.TimerThread.run(Timer.java:484)
>>>>_______________________________________________
>>>>Swift-user mailing list
>>>>Swift-user at ci.uchicago.edu
>>>>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>
>>>
>>>
>>>
>>>--
>>>Ketan
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>--
>>Ketan
>>
>>
>>
_______________________________________________
>>Swift-user mailing list
>>Swift-user at ci.uchicago.edu
>>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SwiftMontage-20120123-1147-9ik4yhc0.log.gz
Type: application/x-gzip
Size: 5050 bytes
Desc: not available
URL:
From ketancmaheshwari at gmail.com Mon Jan 23 13:55:45 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Mon, 23 Jan 2012 13:55:45 -0600
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To: <1327348246.89718.YahooMailNeo@web39507.mail.mud.yahoo.com>
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
<1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
<1327348246.89718.YahooMailNeo@web39507.mail.mud.yahoo.com>
Message-ID:
How are you starting the service? Are you starting workers manually? if
yes, could you paste commandlines for both?
On Mon, Jan 23, 2012 at 1:50 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> Thanks Ketan and Jon. I tried but it is still giving error. I have
> attached the log file.
>
> Thank you
> Emalayan
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* Jonathan Monette ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:36 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> Likely, /tmp is not readable/writable across the machines. Could you try
> changing workdir to your /home
>
> On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Jon,
>
> Please find the detail below and let me know if you have any questions
> about my setup.
>
> Thank you
> Emalayan
>
> ==========================================================
> site.xml
>
>
>
> jobmanager="local:local"/>
> passive
>
> 4
> 100000
> 100
> 100
> 100
> 1
> 10
> 25.00
> 10000
> proxy
>
> /tmp/swift.workdir
>
>
>
> =======================================================
>
> tc
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
> localhost echo /bin/echo null null null
> localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null
> null
> localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null
> null null
> localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null
> null
> localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
> localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null
> null null
> localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
> localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec
> null null null
> localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null
> null
> localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null
> null
> localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null
> null
> localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null
> null nul
>
> localhost Background_list
> /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null
> null null
> localhost create_status_table
> /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py
> null null null
> localhost mProjectPP_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null
> null null
> localhost mProject_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null
> null null
> localhost mBackground_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null
> null null
> localhost mDiffFit_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null
> null null
>
> =================================================================
>
> cf
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=100
> provenance.log=false
>
> ===================================================================
>
> ------------------------------
> *From:* Jonathan Monette
> *To:* Ketan Maheshwari
> *Cc:* Emalayan Vairavanathan ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:08 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
> So I have ran the scripts with some of my own test cases and do not see
> it failing. Could you provide your config files? Please provide the tc,
> sites, and config file(if you use a config file).
>
> On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
> Emalayan,
>
> I would check all the mappers and the resulting paths in the Swift source.
>
> Also try running the failed job something like this:
>
> cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-
> b1sa4vlk
> *
> *
> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5
> fits.tbl stat_dir
>
> error 520 indicates workers are not able to reach the data.
>
> Also check if swift.workdir is writable on the site by the worker nodes.
>
> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93
> and getting totally different error messages with swift 0.93. I can ask
> Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
> (input): found 10 files
> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished
> successfully:7
> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished
> successfully:10
> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1
> Submitting:11 Submitted:6 Finished successfully:12
> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1
> Active:6 Stage out:2 Finished successfully:17
> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished
> successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* swift user
> *Sent:* Thursday, 19 January 2012 4:49 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been.
> Could you tell more about the amount of data that needs to be staged to run
> the Montage stages during which these warnings turn up? How much time
> elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Mon Jan 23 14:25:50 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 23 Jan 2012 12:25:50 -0800 (PST)
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To:
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
<1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
<1327348246.89718.YahooMailNeo@web39507.mail.mud.yahoo.com>
Message-ID: <1327350350.29451.YahooMailNeo@web39501.mail.mud.yahoo.com>
I am using swift-0.93. I started only the coaster-service manually using following command (workers were started automatically).
coaster-service -port 1984 -localport 35753 -nosec
Then application prints following output and terminates. (I have attached the log file with this mail. Please discard the previous log file because system was not configured properly)
Please let me know if you need more information.
Thank you
Emalayan
====================================================================================
Swift 0.93 swift-r5501 (swift modified locally) cog-r3350
RunID: 20120123-1219-zj95uaye
?(input): found 10 files
Progress:? time: Mon, 23 Jan 2012 12:19:39 -0800
Find: http://localhost:1984
Find:? keepalive(120), reconnect - http://localhost:1984
Progress:? time: Mon, 23 Jan 2012 12:19:41 -0800? Stage in:1? Submitted:9
Progress:? time: Mon, 23 Jan 2012 12:19:45 -0800? Active:9? Stage out:1
Progress:? time: Mon, 23 Jan 2012 12:19:46 -0800? Active:6? Stage out:2? Finished successfully:2
Progress:? time: Mon, 23 Jan 2012 12:19:47 -0800? Submitted:1? Finished successfully:10
Progress:? time: Mon, 23 Jan 2012 12:19:49 -0800? Active:1? Finished successfully:10
Progress:? time: Mon, 23 Jan 2012 12:19:50 -0800? Submitted:1? Finished successfully:12
Progress:? time: Mon, 23 Jan 2012 12:19:51 -0800? Stage in:12? Submitted:5? Finished successfully:13
Progress:? time: Mon, 23 Jan 2012 12:19:52 -0800? Stage in:1? Submitted:5? Active:9? Stage out:2? Finished successfully:13
Progress:? time: Mon, 23 Jan 2012 12:19:53 -0800? Active:5? Finished successfully:25
Exception in mConcatFit:
Arguments: [_concurrent/status_tbl-bf92dd4d-ecf0-490e-ab93-cf7863688950-5, fits.tbl, stat_dir]
Host: localhost
Directory: SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk
- - -
Caused by: null
Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
Execution failed:
??? back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
[emalayan at node090 scripts]$
________________________________
From: Ketan Maheshwari
To: Emalayan Vairavanathan
Cc: Jonathan Monette ; swift user
Sent: Monday, 23 January 2012 11:55 AM
Subject: Re: [Swift-user] Montage+Swift+Coasters
How are you starting the service? Are you starting workers manually? if yes, could you paste commandlines for both?
On Mon, Jan 23, 2012 at 1:50 PM, Emalayan Vairavanathan wrote:
Thanks Ketan and Jon. I tried but it is still giving error. I have attached the log file.
>
>
>Thank you
>Emalayan
>
>
>
>
>________________________________
> From: Ketan Maheshwari
>To: Emalayan Vairavanathan
>Cc: Jonathan Monette ; swift user
>Sent: Monday, 23 January 2012 11:36 AM
>Subject: Re: [Swift-user] Montage+Swift+Coasters
>
>
>
>Emalayan,
>
>
>Likely, /tmp is not readable/writable across the machines. Could you try changing workdir to your /home
>
>
>On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan wrote:
>
>Jon,
>>
>>
>>Please find the detail below and let me know if you have any questions about my setup.
>>
>>
>>
>>Thank you
>>Emalayan
>>
>>
>>
>>==========================================================
>>
>>site.xml
>>
>>
>>
>>
>>???
>>??? passive
>>
>>??? 4
>>??? 100000
>>??? 100
>>??? 100
>>??? 100
>>??? 1
>>??? 10
>>??? 25.00
>>??? 10000
>>??? proxy
>>???
>>??? /tmp/swift.workdir
>>?
>>
>>
>>
>>
>>=======================================================
>>
>>
>>tc
>>
>>
>>localhost sh /bin/sh null null null
>>localhost cat /bin/cat null null null
>>localhost echo /bin/echo null null null
>>localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null null
>>localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null null null
>>localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null null
>>localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
>>localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null null null
>>localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
>>localhost mDiffExec_wrap
/home/emalayan/App/Montage_v3.3/bin/mDiffExec null null null
>>localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null null
>>localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null null
>>localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null null
>>localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null null nul
>>
>>localhost Background_list /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null null null
>>localhost create_status_table /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py null null null
>>localhost mProjectPP_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null null null
>>localhost mProject_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null null null
>>localhost mBackground_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null null
null
>>localhost mDiffFit_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null null null
>>
>>
>>=================================================================
>>
>>
>>cf
>>
>>
>>wrapperlog.always.transfer=true
>>sitedir.keep=true
>>execution.retries=1
>>lazy.errors=true
>>status.mode=provider
>>use.provider.staging=true
>>provider.staging.pin.swiftfiles=false
>>foreach.max.threads=100
>>provenance.log=false
>>
>>
>>===================================================================
>>
>>
>>
>>
>>________________________________
>> From: Jonathan Monette
>>To: Ketan Maheshwari
>>Cc: Emalayan Vairavanathan ; swift user
>>Sent: Monday, 23 January 2012 11:08 AM
>>Subject: Re: [Swift-user] Montage+Swift+Coasters
>>
>>
>>
>>Emalayan,
>>? ?So I have ran the scripts with some of my own test cases and do not see it failing. ?Could you provide your config files? ?Please provide the tc, sites, and config file(if you use a config file).
>>
>>
>>On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>>
>>Emalayan,
>>>
>>>
>>>I would check all the mappers and the resulting paths in the Swift source.?
>>>
>>>
>>>Also try running the failed job something like this:?
>>>
>>>
>>>cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>>>
>>>
>>>mConcatFit?_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5 fits.tbl stat_dir
>>>
>>>
>>>error 520 indicates workers are not able to reach the data.
>>>
>>>
>>>Also check if swift.workdir is writable on the site by the worker nodes.
>>>
>>>
>>>On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan wrote:
>>>
>>>Hi Ketan,
>>>>
>>>>
>>>>This was with swift-0.92.1.Now I have downloaded the latest swift 0.93 and getting totally different error messages with swift 0.93. I can ask Jon about these messages. (These scripts was working well with only Swift)
>>>>
>>>>
>>>>
>>>>Please let me know if you have any idea.
>>>>
>>>>
>>>>
>>>>Regards
>>>>Emalayan
>>>>
>>>>
>>>>
>>>>===============================================================================================
>>>>
>>>>Swift 0.93 swift-r5501 cog-r3350
>>>>
>>>>RunID: 20120119-1749-rjshh1r9
>>>>?(input): found 10 files
>>>>Progress:? time: Thu, 19 Jan 2012 17:49:20 -0800
>>>>Find: http://localhost:1984
>>>>Find:? keepalive(120), reconnect - http://localhost:1984
>>>>Progress:? time: Thu, 19 Jan 2012 17:49:22 -0800? Stage in:1? Submitted:9
>>>>Progress:? time: Thu, 19 Jan 2012 17:49:25 -0800? Active:9? Stage out:1
>>>>Progress:? time: Thu, 19 Jan 2012 17:49:26 -0800? Stage out:3? Finished successfully:7
>>>>Progress:? time: Thu, 19 Jan 2012 17:49:28 -0800? Active:1? Finished successfully:10
>>>>Progress:? time: Thu, 19 Jan 2012 17:49:29 -0800? Stage in:1? Submitting:11? Submitted:6? Finished successfully:12
>>>>Progress:? time: Thu, 19 Jan 2012 17:49:30 -0800? Stage in:4? Submitted:1? Active:6? Stage out:2? Finished successfully:17
>>>>Progress:? time: Thu, 19 Jan 2012 17:49:31 -0800? Active:1? Finished successfully:30
>>>>Exception in mConcatFit:
>>>>Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5, fits.tbl, stat_dir]
>>>>Host: localhost
>>>>Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>>>>- - -
>>>>
>>>>Caused by: null
>>>>Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
>>>>Execution failed:
>>>>??? back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
>>>>
>>>>
>>>>
>>>>
>>>>________________________________
>>>> From: Ketan Maheshwari
>>>>To: Emalayan Vairavanathan
>>>>Cc: swift user
>>>>Sent: Thursday, 19 January 2012 4:49 PM
>>>>Subject: Re: [Swift-user] Montage+Swift+Coasters
>>>>
>>>>
>>>>
>>>>Emalayan,
>>>>
>>>>
>>>>From your symptoms, it seems you are facing the same issue as I've been. Could you tell more about the amount of data that needs to be staged to run the Montage stages during which these warnings turn up? How much time elapses since the start of your workflow after which you see these messages?
>>>>
>>>>Also, what version of Swift is this?
>>>>
>>>>
>>>>Regards,
>>>>Ketan
>>>>
>>>>
>>>>On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan wrote:
>>>>
>>>>Dear All,
>>>>>
>>>>>
>>>>>I have a problem in running Montage with Coasters (in our local cluster - no batch schedulers). After few stages the swift run-time continuously prints the warnings below. Any ideas ? Should I increase the heartbeat count ?
>>>>>
>>>>>
>>>>>
>>>>>Everything works fine when I try to run the same montage-scripts with swift on a single machine.
>>>>>
>>>>>
>>>>>
>>>>>Thank you
>>>>>Emalayan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>2012-01-19 15:38:09,207-0800 WARN? Command Command(119, HEARTBEAT): handling reply timeout; sendReqTime=120119-153609.206, sendTime=120119-153609.206,
now=120119-153809.207
>>>>>2012-01-19 15:38:09,207-0800 INFO? Command Command(119, HEARTBEAT): re-sending
>>>>>2012-01-19 15:38:09,209-0800 WARN? Command Command(119, HEARTBEAT)fault was: Reply timeout
>>>>>org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>>>??????? at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>>>>>??????? at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
>>>>>??????? at java.util.TimerThread.mainLoop(Timer.java:534)
>>>>>??????? at java.util.TimerThread.run(Timer.java:484)
>>>>>_______________________________________________
>>>>>Swift-user mailing list
>>>>>Swift-user at ci.uchicago.edu
>>>>>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>
>>>>
>>>>
>>>>
>>>>--
>>>>Ketan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>--
>>>Ketan
>>>
>>>
>>>
_______________________________________________
>>>Swift-user mailing list
>>>Swift-user at ci.uchicago.edu
>>>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>
>>
>>
>
>
>
>--
>Ketan
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SwiftMontage-20120123-1219-zj95uaye.log.gz
Type: application/x-gzip
Size: 17804 bytes
Desc: not available
URL:
From ketancmaheshwari at gmail.com Mon Jan 23 14:57:45 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Mon, 23 Jan 2012 14:57:45 -0600
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To: <1327350350.29451.YahooMailNeo@web39501.mail.mud.yahoo.com>
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
<1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
<1327348246.89718.YahooMailNeo@web39507.mail.mud.yahoo.com>
<1327350350.29451.YahooMailNeo@web39501.mail.mud.yahoo.com>
Message-ID:
Emalayan, Could you also send your swift source.
Have you tried running mConcatFit from within the
SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk directory?
On Mon, Jan 23, 2012 at 2:25 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> I am using swift-0.93. I started only the coaster-service manually using
> following command (workers were started automatically).
>
> coaster-service -port 1984 -localport 35753 -nosec
>
> Then application prints following output and terminates. (I have attached
> the log file with this mail. Please discard the previous log file because
> system was not configured properly)
>
> Please let me know if you need more information.
>
> Thank you
> Emalayan
>
>
> ====================================================================================
> Swift 0.93 swift-r5501 (swift modified locally) cog-r3350
>
> RunID: 20120123-1219-zj95uaye
> (input): found 10 files
> Progress: time: Mon, 23 Jan 2012 12:19:39 -0800
>
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Mon, 23 Jan 2012 12:19:41 -0800 Stage in:1 Submitted:9
> Progress: time: Mon, 23 Jan 2012 12:19:45 -0800 Active:9 Stage out:1
> Progress: time: Mon, 23 Jan 2012 12:19:46 -0800 Active:6 Stage out:2
> Finished successfully:2
> Progress: time: Mon, 23 Jan 2012 12:19:47 -0800 Submitted:1 Finished
> successfully:10
> Progress: time: Mon, 23 Jan 2012 12:19:49 -0800 Active:1 Finished
> successfully:10
> Progress: time: Mon, 23 Jan 2012 12:19:50 -0800 Submitted:1 Finished
> successfully:12
> Progress: time: Mon, 23 Jan 2012 12:19:51 -0800 Stage in:12
> Submitted:5 Finished successfully:13
> Progress: time: Mon, 23 Jan 2012 12:19:52 -0800 Stage in:1 Submitted:5
> Active:9 Stage out:2 Finished successfully:13
> Progress: time: Mon, 23 Jan 2012 12:19:53 -0800 Active:5 Finished
> successfully:25
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-bf92dd4d-ecf0-490e-ab93-cf7863688950-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk
>
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
> [emalayan at node090 scripts]$
>
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* Jonathan Monette ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:55 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> How are you starting the service? Are you starting workers manually? if
> yes, could you paste commandlines for both?
>
> On Mon, Jan 23, 2012 at 1:50 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Thanks Ketan and Jon. I tried but it is still giving error. I have
> attached the log file.
>
> Thank you
> Emalayan
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* Jonathan Monette ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:36 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> Likely, /tmp is not readable/writable across the machines. Could you try
> changing workdir to your /home
>
> On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Jon,
>
> Please find the detail below and let me know if you have any questions
> about my setup.
>
> Thank you
> Emalayan
>
> ==========================================================
> site.xml
>
>
>
> jobmanager="local:local"/>
> passive
>
> 4
> 100000
> 100
> 100
> 100
> 1
> 10
> 25.00
> 10000
> proxy
>
> /tmp/swift.workdir
>
>
>
> =======================================================
>
> tc
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
> localhost echo /bin/echo null null null
> localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null
> null
> localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null
> null null
> localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null
> null
> localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
> localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null
> null null
> localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
> localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec
> null null null
> localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null
> null
> localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null
> null
> localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null
> null
> localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null
> null nul
>
> localhost Background_list
> /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null
> null null
> localhost create_status_table
> /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py
> null null null
> localhost mProjectPP_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null
> null null
> localhost mProject_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null
> null null
> localhost mBackground_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null
> null null
> localhost mDiffFit_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null
> null null
>
> =================================================================
>
> cf
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=100
> provenance.log=false
>
> ===================================================================
>
> ------------------------------
> *From:* Jonathan Monette
> *To:* Ketan Maheshwari
> *Cc:* Emalayan Vairavanathan ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:08 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
> So I have ran the scripts with some of my own test cases and do not see
> it failing. Could you provide your config files? Please provide the tc,
> sites, and config file(if you use a config file).
>
> On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
> Emalayan,
>
> I would check all the mappers and the resulting paths in the Swift source.
>
> Also try running the failed job something like this:
>
> cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-
> b1sa4vlk
> *
> *
> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5
> fits.tbl stat_dir
>
> error 520 indicates workers are not able to reach the data.
>
> Also check if swift.workdir is writable on the site by the worker nodes.
>
> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93
> and getting totally different error messages with swift 0.93. I can ask
> Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
> (input): found 10 files
> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished
> successfully:7
> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished
> successfully:10
> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1
> Submitting:11 Submitted:6 Finished successfully:12
> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1
> Active:6 Stage out:2 Finished successfully:17
> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished
> successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* swift user
> *Sent:* Thursday, 19 January 2012 4:49 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been.
> Could you tell more about the amount of data that needs to be staged to run
> the Montage stages during which these warnings turn up? How much time
> elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Mon Jan 23 15:13:58 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 23 Jan 2012 13:13:58 -0800
Subject: [Swift-user] GSISSH in Swift [was Fwd: [Swift-devel]
Documentation of sites.xml]
In-Reply-To: <33E27556-DD8E-407A-9610-94F8DA1AAD70@mcs.anl.gov>
References: <903303148.134563.1326323490348.JavaMail.root@zimbra.anl.gov>
<1326483206.31692.2.camel@blabla>
<980E744D-AA8E-4014-90B1-701A7D03F421@mcs.anl.gov>
<1327197887.7405.2.camel@blabla>
<33E27556-DD8E-407A-9610-94F8DA1AAD70@mcs.anl.gov>
Message-ID: <1327353238.24069.3.camel@blabla>
On Sat, 2012-01-21 at 20:20 -0600, Jonathan Monette wrote:
> So we use the etc./provider-sscl.properties file instead of
> an .ssh/config file?
No. That's strictly for the ssh executable.
> I ask this because it looks like the code that allows you to specify
> a username and private key file in the sites file is still there. I
> know from past testing that at least specifying a username in the
> sites file was not being honored.
>
Right. Not when starting coasters. Jobs run through the ssh-cl provider
alone should honor those.
From jonmon at mcs.anl.gov Mon Jan 23 15:48:22 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Mon, 23 Jan 2012 15:48:22 -0600
Subject: [Swift-user] GSISSH in Swift [was Fwd: [Swift-devel]
Documentation of sites.xml]
In-Reply-To: <1327353238.24069.3.camel@blabla>
References: <903303148.134563.1326323490348.JavaMail.root@zimbra.anl.gov>
<1326483206.31692.2.camel@blabla>
<980E744D-AA8E-4014-90B1-701A7D03F421@mcs.anl.gov>
<1327197887.7405.2.camel@blabla>
<33E27556-DD8E-407A-9610-94F8DA1AAD70@mcs.anl.gov>
<1327353238.24069.3.camel@blabla>
Message-ID:
On Jan 23, 2012, at 3:13 PM, Mihael Hategan wrote:
> On Sat, 2012-01-21 at 20:20 -0600, Jonathan Monette wrote:
>> So we use the etc./provider-sscl.properties file instead of
>> an .ssh/config file?
>
> No. That's strictly for the ssh executable.
>
>> I ask this because it looks like the code that allows you to specify
>> a username and private key file in the sites file is still there. I
>> know from past testing that at least specifying a username in the
>> sites file was not being honored.
>>
>
> Right. Not when starting coasters. Jobs run through the ssh-cl provider
> alone should honor those.
>
So even if ssh-cl is going to start coasters, wouldn't you still want to honor those variables?
What if the machine I am using sshcl to start coasters on requires a different username than the one that it is defaulted to? I know you can use a .ssh/config file for that but then why even have the option to specify it in Swift?
>
From ketancmaheshwari at gmail.com Mon Jan 23 16:28:46 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Mon, 23 Jan 2012 16:28:46 -0600
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To: <1327350350.29451.YahooMailNeo@web39501.mail.mud.yahoo.com>
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
<1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
<1327348246.89718.YahooMailNeo@web39507.mail.mud.yahoo.com>
<1327350350.29451.YahooMailNeo@web39501.mail.mud.yahoo.com>
Message-ID:
On Mon, Jan 23, 2012 at 2:25 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> I am using swift-0.93. I started only the coaster-service manually using
> following command (workers were started automatically).
>
Are you aware that workers will start automatically *only* on the localhost
where the service is running and not on the remote nodes.
>
> coaster-service -port 1984 -localport 35753 -nosec
>
> Then application prints following output and terminates. (I have attached
> the log file with this mail. Please discard the previous log file because
> system was not configured properly)
>
> Please let me know if you need more information.
>
> Thank you
> Emalayan
>
>
> ====================================================================================
> Swift 0.93 swift-r5501 (swift modified locally) cog-r3350
>
> RunID: 20120123-1219-zj95uaye
> (input): found 10 files
> Progress: time: Mon, 23 Jan 2012 12:19:39 -0800
>
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Mon, 23 Jan 2012 12:19:41 -0800 Stage in:1 Submitted:9
> Progress: time: Mon, 23 Jan 2012 12:19:45 -0800 Active:9 Stage out:1
> Progress: time: Mon, 23 Jan 2012 12:19:46 -0800 Active:6 Stage out:2
> Finished successfully:2
> Progress: time: Mon, 23 Jan 2012 12:19:47 -0800 Submitted:1 Finished
> successfully:10
> Progress: time: Mon, 23 Jan 2012 12:19:49 -0800 Active:1 Finished
> successfully:10
> Progress: time: Mon, 23 Jan 2012 12:19:50 -0800 Submitted:1 Finished
> successfully:12
> Progress: time: Mon, 23 Jan 2012 12:19:51 -0800 Stage in:12
> Submitted:5 Finished successfully:13
> Progress: time: Mon, 23 Jan 2012 12:19:52 -0800 Stage in:1 Submitted:5
> Active:9 Stage out:2 Finished successfully:13
> Progress: time: Mon, 23 Jan 2012 12:19:53 -0800 Active:5 Finished
> successfully:25
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-bf92dd4d-ecf0-490e-ab93-cf7863688950-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk
>
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
> [emalayan at node090 scripts]$
>
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* Jonathan Monette ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:55 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> How are you starting the service? Are you starting workers manually? if
> yes, could you paste commandlines for both?
>
> On Mon, Jan 23, 2012 at 1:50 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Thanks Ketan and Jon. I tried but it is still giving error. I have
> attached the log file.
>
> Thank you
> Emalayan
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* Jonathan Monette ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:36 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> Likely, /tmp is not readable/writable across the machines. Could you try
> changing workdir to your /home
>
> On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Jon,
>
> Please find the detail below and let me know if you have any questions
> about my setup.
>
> Thank you
> Emalayan
>
> ==========================================================
> site.xml
>
>
>
> jobmanager="local:local"/>
> passive
>
> 4
> 100000
> 100
> 100
> 100
> 1
> 10
> 25.00
> 10000
> proxy
>
> /tmp/swift.workdir
>
>
>
> =======================================================
>
> tc
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
> localhost echo /bin/echo null null null
> localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null
> null
> localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null
> null null
> localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null
> null
> localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
> localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null
> null null
> localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
> localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec
> null null null
> localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null
> null
> localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null
> null
> localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null
> null
> localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null
> null nul
>
> localhost Background_list
> /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null
> null null
> localhost create_status_table
> /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py
> null null null
> localhost mProjectPP_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null
> null null
> localhost mProject_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null
> null null
> localhost mBackground_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null
> null null
> localhost mDiffFit_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null
> null null
>
> =================================================================
>
> cf
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=100
> provenance.log=false
>
> ===================================================================
>
> ------------------------------
> *From:* Jonathan Monette
> *To:* Ketan Maheshwari
> *Cc:* Emalayan Vairavanathan ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:08 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
> So I have ran the scripts with some of my own test cases and do not see
> it failing. Could you provide your config files? Please provide the tc,
> sites, and config file(if you use a config file).
>
> On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
> Emalayan,
>
> I would check all the mappers and the resulting paths in the Swift source.
>
> Also try running the failed job something like this:
>
> cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-
> b1sa4vlk
> *
> *
> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5
> fits.tbl stat_dir
>
> error 520 indicates workers are not able to reach the data.
>
> Also check if swift.workdir is writable on the site by the worker nodes.
>
> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93
> and getting totally different error messages with swift 0.93. I can ask
> Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
> (input): found 10 files
> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished
> successfully:7
> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished
> successfully:10
> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1
> Submitting:11 Submitted:6 Finished successfully:12
> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1
> Active:6 Stage out:2 Finished successfully:17
> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished
> successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* swift user
> *Sent:* Thursday, 19 January 2012 4:49 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been.
> Could you tell more about the amount of data that needs to be staged to run
> the Montage stages during which these warnings turn up? How much time
> elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Mon Jan 23 16:52:38 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 23 Jan 2012 14:52:38 -0800 (PST)
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To:
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
<1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
<1327348246.89718.YahooMailNeo@web39507.mail.mud.yahoo.com>
<1327350350.29451.YahooMailNeo@web39501.mail.mud.yahoo.com>
Message-ID: <1327359158.12794.YahooMailNeo@web39505.mail.mud.yahoo.com>
Hi Ketan,
Please find my answers below.
[Ketan] Emalayan, Could you also send your swift source.
[Emalayan] did you ask for the Montage swift scripts ? / swift-0.93 source code ?
[Ketan] Have you tried running mConcatFit from within the?SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk directory?
[Emalayan] There were not such directory created.
[Ketan] Are you aware that workers will start automatically *only* on the
localhost where the service is running and not on the remote nodes.
[Emalayan] Yes, I am aware about this. I ran both coaster-service and application scripts on the same node.But would like to know about setting up workers on other nodes too.
Thank you
Emalayan
________________________________
From: Ketan Maheshwari
To: Emalayan Vairavanathan
Cc: Jonathan Monette ; swift user
Sent: Monday, 23 January 2012 12:57 PM
Subject: Re: [Swift-user] Montage+Swift+Coasters
Emalayan, Could you also send your swift source.?
Have you tried running mConcatFit from within the?SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk directory?
On Mon, Jan 23, 2012 at 2:25 PM, Emalayan Vairavanathan wrote:
I am using swift-0.93. I started only the coaster-service manually using following command (workers were started automatically).
>
>
>coaster-service -port 1984 -localport 35753 -nosec
>
>
>Then application prints following output and terminates. (I have attached the log file with this mail. Please discard the previous log file because system was not configured properly)
>
>
>Please let me know if you need more information.
>
>
>Thank you
>Emalayan
>
>
>
>====================================================================================
>
>Swift 0.93 swift-r5501 (swift modified locally) cog-r3350
>
>RunID: 20120123-1219-zj95uaye
>?(input): found 10 files
>Progress:? time: Mon, 23 Jan 2012 12:19:39 -0800
>
>Find: http://localhost:1984
>Find:? keepalive(120), reconnect - http://localhost:1984
>Progress:? time: Mon, 23 Jan 2012 12:19:41 -0800? Stage in:1? Submitted:9
>Progress:? time: Mon, 23 Jan 2012 12:19:45 -0800? Active:9? Stage out:1
>Progress:? time: Mon, 23 Jan 2012 12:19:46 -0800? Active:6? Stage out:2? Finished successfully:2
>Progress:? time: Mon, 23 Jan 2012 12:19:47 -0800? Submitted:1? Finished successfully:10
>Progress:? time: Mon, 23 Jan 2012 12:19:49 -0800? Active:1? Finished successfully:10
>Progress:? time: Mon, 23 Jan 2012 12:19:50 -0800? Submitted:1? Finished successfully:12
>Progress:? time: Mon, 23 Jan 2012 12:19:51 -0800? Stage in:12? Submitted:5? Finished successfully:13
>Progress:? time: Mon, 23 Jan 2012 12:19:52 -0800? Stage in:1? Submitted:5? Active:9? Stage out:2? Finished successfully:13
>Progress:? time: Mon, 23 Jan 2012 12:19:53 -0800? Active:5? Finished successfully:25
>Exception in mConcatFit:
>Arguments: [_concurrent/status_tbl-bf92dd4d-ecf0-490e-ab93-cf7863688950-5, fits.tbl, stat_dir]
>Host: localhost
>Directory: SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk
>
>- - -
>
>Caused by: null
>Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
>Execution failed:
>??? back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
>[emalayan at node090 scripts]$
>
>
>
>
>
>________________________________
> From: Ketan Maheshwari
>To: Emalayan Vairavanathan
>Cc: Jonathan Monette ; swift user
>Sent: Monday, 23 January 2012 11:55 AM
>Subject: Re: [Swift-user] Montage+Swift+Coasters
>
>
>
>How are you starting the service? Are you starting workers manually? if yes, could you paste commandlines for both?
>
>
>On Mon, Jan 23, 2012 at 1:50 PM, Emalayan Vairavanathan wrote:
>
>Thanks Ketan and Jon. I tried but it is still giving error. I have attached the log file.
>>
>>
>>Thank you
>>Emalayan
>>
>>
>>
>>
>>________________________________
>> From: Ketan Maheshwari
>>To: Emalayan Vairavanathan
>>Cc: Jonathan Monette ; swift user
>>Sent: Monday, 23 January 2012 11:36 AM
>>Subject: Re: [Swift-user] Montage+Swift+Coasters
>>
>>
>>
>>Emalayan,
>>
>>
>>Likely, /tmp is not readable/writable across the machines. Could you try changing workdir to your /home
>>
>>
>>On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan wrote:
>>
>>Jon,
>>>
>>>
>>>Please find the detail below and let me know if you have any questions about my setup.
>>>
>>>
>>>
>>>Thank you
>>>Emalayan
>>>
>>>
>>>
>>>==========================================================
>>>
>>>site.xml
>>>
>>>
>>>
>>>
>>>???
>>>??? passive
>>>
>>>??? 4
>>>??? 100000
>>>??? 100
>>>??? 100
>>>??? 100
>>>??? 1
>>>??? 10
>>>??? 25.00
>>>??? 10000
>>>??? proxy
>>>???
>>>??? /tmp/swift.workdir
>>>?
>>>
>>>
>>>
>>>
>>>=======================================================
>>>
>>>
>>>tc
>>>
>>>
>>>localhost sh /bin/sh null null null
>>>localhost cat /bin/cat null null null
>>>localhost echo /bin/echo null null null
>>>localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null null
>>>localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null null null
>>>localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null null
>>>localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
>>>localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null null null
>>>localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
>>>localhost mDiffExec_wrap
/home/emalayan/App/Montage_v3.3/bin/mDiffExec null null null
>>>localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null null
>>>localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null null
>>>localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null null
>>>localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null null nul
>>>
>>>localhost Background_list /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null null null
>>>localhost create_status_table /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py null null null
>>>localhost mProjectPP_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null null null
>>>localhost mProject_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null null null
>>>localhost mBackground_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null null
null
>>>localhost mDiffFit_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null null null
>>>
>>>
>>>=================================================================
>>>
>>>
>>>cf
>>>
>>>
>>>wrapperlog.always.transfer=true
>>>sitedir.keep=true
>>>execution.retries=1
>>>lazy.errors=true
>>>status.mode=provider
>>>use.provider.staging=true
>>>provider.staging.pin.swiftfiles=false
>>>foreach.max.threads=100
>>>provenance.log=false
>>>
>>>
>>>===================================================================
>>>
>>>
>>>
>>>
>>>________________________________
>>> From: Jonathan Monette
>>>To: Ketan Maheshwari
>>>Cc: Emalayan Vairavanathan ; swift user
>>>Sent: Monday, 23 January 2012 11:08 AM
>>>Subject: Re: [Swift-user] Montage+Swift+Coasters
>>>
>>>
>>>
>>>Emalayan,
>>>? ?So I have ran the scripts with some of my own test cases and do not see it failing. ?Could you provide your config files? ?Please provide the tc, sites, and config file(if you use a config file).
>>>
>>>
>>>On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>>>
>>>Emalayan,
>>>>
>>>>
>>>>I would check all the mappers and the resulting paths in the Swift source.?
>>>>
>>>>
>>>>Also try running the failed job something like this:?
>>>>
>>>>
>>>>cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>>>>
>>>>
>>>>mConcatFit?_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5 fits.tbl stat_dir
>>>>
>>>>
>>>>error 520 indicates workers are not able to reach the data.
>>>>
>>>>
>>>>Also check if swift.workdir is writable on the site by the worker nodes.
>>>>
>>>>
>>>>On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan wrote:
>>>>
>>>>Hi Ketan,
>>>>>
>>>>>
>>>>>This was with swift-0.92.1.Now I have downloaded the latest swift 0.93 and getting totally different error messages with swift 0.93. I can ask Jon about these messages. (These scripts was working well with only Swift)
>>>>>
>>>>>
>>>>>
>>>>>Please let me know if you have any idea.
>>>>>
>>>>>
>>>>>
>>>>>Regards
>>>>>Emalayan
>>>>>
>>>>>
>>>>>
>>>>>===============================================================================================
>>>>>
>>>>>Swift 0.93 swift-r5501 cog-r3350
>>>>>
>>>>>RunID: 20120119-1749-rjshh1r9
>>>>>?(input): found 10 files
>>>>>Progress:? time: Thu, 19 Jan 2012 17:49:20 -0800
>>>>>Find: http://localhost:1984
>>>>>Find:? keepalive(120), reconnect - http://localhost:1984
>>>>>Progress:? time: Thu, 19 Jan 2012 17:49:22 -0800? Stage in:1? Submitted:9
>>>>>Progress:? time: Thu, 19 Jan 2012 17:49:25 -0800? Active:9? Stage out:1
>>>>>Progress:? time: Thu, 19 Jan 2012 17:49:26 -0800? Stage out:3? Finished successfully:7
>>>>>Progress:? time: Thu, 19 Jan 2012 17:49:28 -0800? Active:1? Finished successfully:10
>>>>>Progress:? time: Thu, 19 Jan 2012 17:49:29 -0800? Stage in:1? Submitting:11? Submitted:6? Finished successfully:12
>>>>>Progress:? time: Thu, 19 Jan 2012 17:49:30 -0800? Stage in:4? Submitted:1? Active:6? Stage out:2? Finished successfully:17
>>>>>Progress:? time: Thu, 19 Jan 2012 17:49:31 -0800? Active:1? Finished successfully:30
>>>>>Exception in mConcatFit:
>>>>>Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5, fits.tbl, stat_dir]
>>>>>Host: localhost
>>>>>Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>>>>>- - -
>>>>>
>>>>>Caused by: null
>>>>>Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
>>>>>Execution failed:
>>>>>??? back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>________________________________
>>>>> From: Ketan Maheshwari
>>>>>To: Emalayan Vairavanathan
>>>>>Cc: swift user
>>>>>Sent: Thursday, 19 January 2012 4:49 PM
>>>>>Subject: Re: [Swift-user] Montage+Swift+Coasters
>>>>>
>>>>>
>>>>>
>>>>>Emalayan,
>>>>>
>>>>>
>>>>>From your symptoms, it seems you are facing the same issue as I've been. Could you tell more about the amount of data that needs to be staged to run the Montage stages during which these warnings turn up? How much time elapses since the start of your workflow after which you see these messages?
>>>>>
>>>>>Also, what version of Swift is this?
>>>>>
>>>>>
>>>>>Regards,
>>>>>Ketan
>>>>>
>>>>>
>>>>>On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan wrote:
>>>>>
>>>>>Dear All,
>>>>>>
>>>>>>
>>>>>>I have a problem in running Montage with Coasters (in our local cluster - no batch schedulers). After few stages the swift run-time continuously prints the warnings below. Any ideas ? Should I increase the heartbeat count ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>Everything works fine when I try to run the same montage-scripts with swift on a single machine.
>>>>>>
>>>>>>
>>>>>>
>>>>>>Thank you
>>>>>>Emalayan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>2012-01-19 15:38:09,207-0800 WARN? Command Command(119, HEARTBEAT): handling reply timeout; sendReqTime=120119-153609.206, sendTime=120119-153609.206,
now=120119-153809.207
>>>>>>2012-01-19 15:38:09,207-0800 INFO? Command Command(119, HEARTBEAT): re-sending
>>>>>>2012-01-19 15:38:09,209-0800 WARN? Command Command(119, HEARTBEAT)fault was: Reply timeout
>>>>>>org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>>>>??????? at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>>>>>>??????? at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
>>>>>>??????? at java.util.TimerThread.mainLoop(Timer.java:534)
>>>>>>??????? at java.util.TimerThread.run(Timer.java:484)
>>>>>>_______________________________________________
>>>>>>Swift-user mailing list
>>>>>>Swift-user at ci.uchicago.edu
>>>>>>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>Ketan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>--
>>>>Ketan
>>>>
>>>>
>>>>
_______________________________________________
>>>>Swift-user mailing list
>>>>Swift-user at ci.uchicago.edu
>>>>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>
>>>
>>>
>>
>>
>>
>>--
>>Ketan
>>
>>
>>
>>
>>
>
>
>
>--
>Ketan
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From ketancmaheshwari at gmail.com Mon Jan 23 20:38:38 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Mon, 23 Jan 2012 20:38:38 -0600
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To: <1327359158.12794.YahooMailNeo@web39505.mail.mud.yahoo.com>
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>
<4D008EF1-8072-4A18-8A5A-18D74A2ACDA2@mcs.anl.gov>
<1327346756.49322.YahooMailNeo@web39506.mail.mud.yahoo.com>
<1327348246.89718.YahooMailNeo@web39507.mail.mud.yahoo.com>
<1327350350.29451.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327359158.12794.YahooMailNeo@web39505.mail.mud.yahoo.com>
Message-ID:
On Mon, Jan 23, 2012 at 4:52 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> Hi Ketan,
>
> Please find my answers below.
>
> [Ketan] Emalayan, Could you also send your swift source.
> [Emalayan] did you ask for the Montage swift scripts ? / swift-0.93
> source code ?
>
Montage script
>
>
> [Ketan] Have you tried running mConcatFit from within the
> SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk directory?
> [Emalayan] There were not such directory created.
>
should be in your workdir.
>
>
> [Ketan] Are you aware that workers will start automatically *only* on the
> localhost where the service is running and not on the remote nodes.
> [Emalayan] Yes, I am aware about this. I ran both coaster-service and
> application scripts on the same node. But would like to know about
> setting up workers on other nodes too.
>
you may run worker.pl manually. or better put in a for loop in a simple
shell script to run multiple workers. commandline is something like:
worker.pl label /path/to/log
>
> Thank you
> Emalayan
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* Jonathan Monette ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 12:57 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan, Could you also send your swift source.
>
> Have you tried running mConcatFit from within the
> SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk directory?
>
> On Mon, Jan 23, 2012 at 2:25 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> I am using swift-0.93. I started only the coaster-service manually using
> following command (workers were started automatically).
>
> coaster-service -port 1984 -localport 35753 -nosec
>
> Then application prints following output and terminates. (I have attached
> the log file with this mail. Please discard the previous log file because
> system was not configured properly)
>
> Please let me know if you need more information.
>
> Thank you
> Emalayan
>
>
> ====================================================================================
> Swift 0.93 swift-r5501 (swift modified locally) cog-r3350
>
> RunID: 20120123-1219-zj95uaye
> (input): found 10 files
> Progress: time: Mon, 23 Jan 2012 12:19:39 -0800
>
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Mon, 23 Jan 2012 12:19:41 -0800 Stage in:1 Submitted:9
> Progress: time: Mon, 23 Jan 2012 12:19:45 -0800 Active:9 Stage out:1
> Progress: time: Mon, 23 Jan 2012 12:19:46 -0800 Active:6 Stage out:2
> Finished successfully:2
> Progress: time: Mon, 23 Jan 2012 12:19:47 -0800 Submitted:1 Finished
> successfully:10
> Progress: time: Mon, 23 Jan 2012 12:19:49 -0800 Active:1 Finished
> successfully:10
> Progress: time: Mon, 23 Jan 2012 12:19:50 -0800 Submitted:1 Finished
> successfully:12
> Progress: time: Mon, 23 Jan 2012 12:19:51 -0800 Stage in:12
> Submitted:5 Finished successfully:13
> Progress: time: Mon, 23 Jan 2012 12:19:52 -0800 Stage in:1 Submitted:5
> Active:9 Stage out:2 Finished successfully:13
> Progress: time: Mon, 23 Jan 2012 12:19:53 -0800 Active:5 Finished
> successfully:25
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-bf92dd4d-ecf0-490e-ab93-cf7863688950-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk
>
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
> [emalayan at node090 scripts]$
>
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* Jonathan Monette ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:55 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> How are you starting the service? Are you starting workers manually? if
> yes, could you paste commandlines for both?
>
> On Mon, Jan 23, 2012 at 1:50 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Thanks Ketan and Jon. I tried but it is still giving error. I have
> attached the log file.
>
> Thank you
> Emalayan
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* Jonathan Monette ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:36 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> Likely, /tmp is not readable/writable across the machines. Could you try
> changing workdir to your /home
>
> On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Jon,
>
> Please find the detail below and let me know if you have any questions
> about my setup.
>
> Thank you
> Emalayan
>
> ==========================================================
> site.xml
>
>
>
> jobmanager="local:local"/>
> passive
>
> 4
> 100000
> 100
> 100
> 100
> 1
> 10
> 25.00
> 10000
> proxy
>
> /tmp/swift.workdir
>
>
>
> =======================================================
>
> tc
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
> localhost echo /bin/echo null null null
> localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null
> null
> localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null
> null null
> localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null
> null
> localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
> localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null
> null null
> localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
> localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec
> null null null
> localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null
> null
> localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null
> null
> localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null
> null
> localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null
> null nul
>
> localhost Background_list
> /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null
> null null
> localhost create_status_table
> /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py
> null null null
> localhost mProjectPP_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null
> null null
> localhost mProject_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null
> null null
> localhost mBackground_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null
> null null
> localhost mDiffFit_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null
> null null
>
> =================================================================
>
> cf
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=100
> provenance.log=false
>
> ===================================================================
>
> ------------------------------
> *From:* Jonathan Monette
> *To:* Ketan Maheshwari
> *Cc:* Emalayan Vairavanathan ; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:08 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
> So I have ran the scripts with some of my own test cases and do not see
> it failing. Could you provide your config files? Please provide the tc,
> sites, and config file(if you use a config file).
>
> On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
> Emalayan,
>
> I would check all the mappers and the resulting paths in the Swift source.
>
> Also try running the failed job something like this:
>
> cd /SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-
> b1sa4vlk
> *
> *
> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5
> fits.tbl stat_dir
>
> error 520 indicates workers are not able to reach the data.
>
> Also check if swift.workdir is writable on the site by the worker nodes.
>
> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93
> and getting totally different error messages with swift 0.93. I can ask
> Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
> (input): found 10 files
> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished
> successfully:7
> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished
> successfully:10
> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1
> Submitting:11 Submitted:6 Finished successfully:12
> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1
> Active:6 Stage out:2 Finished successfully:17
> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished
> successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
>
> ------------------------------
> *From:* Ketan Maheshwari
> *To:* Emalayan Vairavanathan
> *Cc:* swift user
> *Sent:* Thursday, 19 January 2012 4:49 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been.
> Could you tell more about the amount of data that needs to be staged to run
> the Montage stages during which these warnings turn up? How much time
> elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From svemalayan at yahoo.com Mon Jan 23 21:21:02 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 23 Jan 2012 19:21:02 -0800 (PST)
Subject: [Swift-user] Montage+Swift+Coasters
In-Reply-To:
References: <1327017075.46246.YahooMailNeo@web39501.mail.mud.yahoo.com>
<1327024520.62801.YahooMailNeo@web39505.mail.mud.yahoo.com>