From benc at hawaga.org.uk Fri Aug 1 02:12:25 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Aug 2008 07:12:25 +0000 (GMT) Subject: [Swift-user] Re: swift script calling procedure In-Reply-To: <489202BE.8090606@mcs.anl.gov> References: <4891F94F.6090004@uchicago.edu> <489202BE.8090606@mcs.anl.gov> Message-ID: On Thu, 31 Jul 2008, Michael Wilde wrote: > Perhaps the code he tried was wrong or never worked, or perhaps the case of > the function or the case checking rules changed between the time this last > worked and now. The checking rules are definitely stronger than they were before - this is likely from Milena's checks. Whether Swift should be case sensitive on identifiers seems poorly defined; karajan will take either case, but the XML-related history would suggest that case should be significiant (specifically a QName is case sensitive). I'm not terribly fussed either way (though I'd probably go for case sensitivity to be more C/Java-like) but I think it would be good to be defined more than as accidents of the compile-time and runtime layer. -- From wilde at mcs.anl.gov Fri Aug 1 06:37:31 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 01 Aug 2008 06:37:31 -0500 Subject: [Swift-user] Re: swift script calling procedure In-Reply-To: References: <4891F94F.6090004@uchicago.edu> <489202BE.8090606@mcs.anl.gov> Message-ID: <4892F57B.2040406@mcs.anl.gov> I vote in favor of case sensitivity. On 8/1/08 2:12 AM, Ben Clifford wrote: > On Thu, 31 Jul 2008, Michael Wilde wrote: > >> Perhaps the code he tried was wrong or never worked, or perhaps the case of >> the function or the case checking rules changed between the time this last >> worked and now. > > The checking rules are definitely stronger than they were before - this is > likely from Milena's checks. > > Whether Swift should be case sensitive on identifiers seems poorly > defined; karajan will take either case, but the XML-related history would > suggest that case should be significiant (specifically a QName is case > sensitive). > > I'm not terribly fussed either way (though I'd probably go for case > sensitivity to be more C/Java-like) but I think it would be good to be > defined more than as accidents of the compile-time and runtime layer. > From zhaozhang at uchicago.edu Fri Aug 1 09:37:56 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 Aug 2008 09:37:56 -0500 Subject: [Swift-user] problem starting swift Message-ID: <48931FC4.4050902@uchicago.edu> Hi I started swift to run 15352 jobs, then swift failed to start with this message Execution failed: java.util.ConcurrentModificationException The log file is at http://www.ci.uchicago.edu/~zzhang/dock2-20080801-0923-menbb7zg.log Thus I started the first 12000 tasks, it is ok for now, jobs are going through, and I saw some return successful. best wishes zhangzhao From benc at hawaga.org.uk Fri Aug 1 09:42:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Aug 2008 14:42:13 +0000 (GMT) Subject: [Swift-user] problem starting swift In-Reply-To: <48931FC4.4050902@uchicago.edu> References: <48931FC4.4050902@uchicago.edu> Message-ID: On Fri, 1 Aug 2008, Zhao Zhang wrote: > I started swift to run 15352 jobs, then swift failed to start with this > message > > Execution failed: > java.util.ConcurrentModificationException ok, I see what causes that. -- From zhaozhang at uchicago.edu Fri Aug 1 10:29:36 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 Aug 2008 10:29:36 -0500 Subject: [Swift-user] problem starting swift In-Reply-To: References: <48931FC4.4050902@uchicago.edu> Message-ID: <48932BE0.6010206@uchicago.edu> Hi, Ben could you tell me more in details? Thanks zhao Ben Clifford wrote: > On Fri, 1 Aug 2008, Zhao Zhang wrote: > > >> I started swift to run 15352 jobs, then swift failed to start with this >> message >> >> Execution failed: >> java.util.ConcurrentModificationException >> > > ok, I see what causes that. > > From benc at hawaga.org.uk Fri Aug 1 10:25:36 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Aug 2008 15:25:36 +0000 (GMT) Subject: [Swift-user] problem starting swift In-Reply-To: References: <48931FC4.4050902@uchicago.edu> Message-ID: try swift r2168 From benc at hawaga.org.uk Fri Aug 1 10:38:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Aug 2008 15:38:52 +0000 (GMT) Subject: [Swift-user] problem starting swift In-Reply-To: <48932BE0.6010206@uchicago.edu> References: <48931FC4.4050902@uchicago.edu> <48932BE0.6010206@uchicago.edu> Message-ID: On Fri, 1 Aug 2008, Zhao Zhang wrote: > could you tell me more in details? There is a map of job statuses maintained for the progress status display. Every time the progress line is displayed all of these statuses are counted. If any status is changed while that count is happening, then the exception you see is raised. -- From fedorov at cs.wm.edu Fri Aug 1 13:31:51 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Fri, 1 Aug 2008 14:31:51 -0400 Subject: [Swift-user] Swift scheduler Message-ID: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com> Hi, I have some general questions about the scheduling policy Swift is using. For example, suppose I have an application, which has multiple mappings to different remote sites. How is the submission site going to be selected? In case I have long queueing delays on the selected site, can Swift detect that, and submit job to a different site? Can any of the developers point me to the specific part of the source that is responsible for scheduling, so that I could try to figure this out myself? Thanks! -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From zhaozhang at uchicago.edu Fri Aug 1 13:37:45 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 Aug 2008 13:37:45 -0500 Subject: [Swift-user] problem starting swift In-Reply-To: References: <48931FC4.4050902@uchicago.edu> <48932BE0.6010206@uchicago.edu> Message-ID: <489357F9.9040707@uchicago.edu> Thanks, Ben I rebuild swift, and ran a small scale test, it works ok. I am ready for a larger test, still waiting for resources. will post the result as soon I got them zhao Ben Clifford wrote: > On Fri, 1 Aug 2008, Zhao Zhang wrote: > > >> could you tell me more in details? >> > > There is a map of job statuses maintained for the progress status display. > Every time the progress line is displayed all of these statuses are > counted. If any status is changed while that count is happening, then the > exception you see is raised. > > From benc at hawaga.org.uk Fri Aug 1 13:39:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Aug 2008 18:39:46 +0000 (GMT) Subject: [Swift-user] Swift scheduler In-Reply-To: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com> References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com> Message-ID: On Fri, 1 Aug 2008, Andriy Fedorov wrote: > For example, suppose I have an application, which has multiple > mappings to different remote sites. How is the submission site going > to be selected? Each site has a score which reflects how many jobs will be sent to that site at once. The score goes up as the site is used successfully and goes down as there are problems with the site. When its time to submit a job, one of the sites which has free space (score - actual load) will get the job. > In case I have long queueing delays on the selected site, can Swift > detect that, and submit job to a different site? yes. In recent trunk code there is a feature called 'replication' whereby jobs will be submitted to up to two (three?) more times if they take more than three times the average time for jobs. Look in swift.proeprties for the three replication.* properties. In the past we've discussed more complicated selection algorithms than 3 * mean. > Can any of the developers point me to the specific part of the source > that is responsible for scheduling, so that I could try to figure this > out myself? Start here: cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java. -- From hategan at mcs.anl.gov Fri Aug 1 13:46:29 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 Aug 2008 13:46:29 -0500 Subject: [Swift-user] Swift scheduler In-Reply-To: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com> References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com> Message-ID: <1217616389.22481.3.camel@localhost> On Fri, 2008-08-01 at 14:31 -0400, Andriy Fedorov wrote: > Hi, > > I have some general questions about the scheduling policy Swift is using. > > For example, suppose I have an application, which has multiple > mappings to different remote sites. How is the submission site going > to be selected? In principle a score is kept for each site. The score varies based on the number of successful submissions to that site and (negatively) with current load. Sites are picked using a weighted random out of the pool of sites (the weights being the scores). > In case I have long queueing delays on the selected > site, can Swift detect that, and submit job to a different site? There's something called replication which can be enabled in swift.properties to do that. > > Can any of the developers point me to the specific part of the source > that is responsible for scheduling, so that I could try to figure this > out myself? Things start around here in principle: http://cogkit.svn.sourceforge.net/viewvc/cogkit/trunk/current/src/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java?view=log > > Thanks! > > -- > Andrey Fedorov > > Center for Real-Time Computing > College of William and Mary > http://www.cs.wm.edu/~fedorov > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From iraicu at cs.uchicago.edu Fri Aug 1 13:19:42 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 01 Aug 2008 13:19:42 -0500 Subject: [Swift-user] CFP: Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08) co-located with IEEE/ACM SC08 Message-ID: <489353BE.8000205@cs.uchicago.edu> Dear all, This is our final CFP for MTAGS08. Note that the submission guidelines have changed. The relevant change is: *A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2008/ before the deadline of August 15th, 2008 at 11:59PM PST; the final 6/10 page papers in PDF format will be due on September 6th, 2008 at 11:59PM PST.* We look forward to a successful workshop! Cheers, Ioan Raicu http://dsl.cs.uchicago.edu/MTAGS08/ ================================================================================ Call for Papers -------------------------------------------------------------------------------- The 1st IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) http://dsl.cs.uchicago.edu/MTAGS08/ -------------------------------------------------------------------------------- November 17, 2008 Austin, Texas, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC08) ================================================================================ The 1st workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of loosely coupled large scale applications on large scale clusters, Grids, and/or Supercomputers. Many-task computing, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. We welcome paper submissions on all topics related to MTC on large scale systems. Papers will be peer-reviewed, and accepted papers will be published by IEEE/ACM through the SC08 proceedings (pending approval). For more information, please visit http://dsl.cs.uchicago.edu/MTAGS08/. Scope -------------------------------------------------------------------------------- This workshop will focus on the ability to manage and execute large scale applications on today's largest clusters, Grids, and Supercomputers. Clusters with 50K+ processor cores are beginning to come online (i.e. TACC Sun Constellation System - Ranger), Grids (i.e. TeraGrid) with a dozen sites and 100K+ processors, and supercomputers with 160K processors (i.e. IBM BlueGene/P). Large clusters and supercomputers have traditionally been high performance computing (HPC) systems, as they are efficient at executing tightly coupled parallel jobs within a particular machine with low-latency interconnects; the applications typically use message passing interface (MPI) to achieve the needed inter-process communication. On the other hand, Grids have been the preferred platform for more loosely coupled applications that tend to be managed and executed through workflow systems. In contrast to HPC (tightly coupled applications), these loosely coupled applications make up a new class of applications as what we call Many-Task Computing (MTC). MTC systems generally involve the execution of independent, sequential jobs that can be individually scheduled on many different computing resources across multiple administrative boundaries. MTC systems typically achieve this using various grid computing technologies and techniques, and often times use files to achieve the inter-process communication as alternative communication mechanisms than MPI. MTC is reminiscent to High Throughput Computing (HTC); however, MTC differs from HTC in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks, where the primary metrics are measured in seconds (e.g. FLOPS, tasks/sec, MB/s I/O rates). HTC on the other hand requires large amounts of computing for longer times (months and years, rather than hours and days, and are generally measured in operations per month). Today's existing HPC systems are a viable platform to host MTC applications. However, some challenges arise in large scale applications when run on large scale systems, which can hamper the efficiency and utilization of these large scale systems. These challenges vary from local resource manager scalability and granularity, efficient utilization of the raw hardware, shared file system contention and scalability, reliability at scale, application scalability, and understanding the limitations of the HPC systems in order to identify good candidate MTC applications. For more information, please visit http://dsl.cs.uchicago.edu/MTAGS08/. Topics -------------------------------------------------------------------------------- MTAGS 2008 topics of interest include, but are not limited to: * Compute Resource Management in large scale clusters, large Grids, and Supercomputers o Scheduling o Job execution frameworks o Local resource manager extensions o Performance evaluation of resource managers in use on large scale systems o Challenges in running many-task workloads on HPC systems * Data Management in large scale Grid and Supercomputer environments: o Data-Aware Scheduling o Shared File System performance and scalability in large deployments o Distributed file systems o Data caching frameworks and techniques * Large-Scale Workflow Systems o Workflow system performance and scalability analysis o Scalability of workflow systems o Workflow infrastructure and e-Science middleware o Programming Paradigms and Models * Large-Scale Many-Task Applications o Large-scale many-task applications o Large-scale many-task data-intensive applications o Large-scale high throughput computing (HTC) applications o Quasi-supercomputing applications, deployments, and experiences Paper Submission and Publication -------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 6/10 pages (6 pages for short papers, and 10 pages for standard papers) of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines (ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct.pdf or ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct.doc). A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2008/ before the deadline of August 15th, 2008 at 11:59PM PST; the final 6/10 page papers in PDF format will be due on September 6th, 2008 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library. Notifications of the paper decisions will be sent out by October 1st. Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visit http://dsl.cs.uchicago.edu/MTAGS08/. Important Dates -------------------------------------------------------------------------------- * Abstract Due: August 15th, 2008 * Papers Due: September 6th, 2008 * Notification of Acceptance: October 1st, 2008 * Camera Ready Papers Due: October 15th, 2008 * Workshop Date: November 17th, 2008 Committee Members -------------------------------------------------------------------------------- Workshop Chairs * Yong Zhao, Microsoft * Ian Foster, University of Chicago & Argonne National Laboratory * Ioan Raicu, University of Chicago Technical Committee * David Abramson, Monash University, Australia * Dan Ardelean, Google, USA * Pete Beckman, Argonne National Laboratory, USA * Peter Dinda, Northwestern University, USA * Ian Foster, University of Chicago & Argonne National Laboratory, USA * Alan Gara, IBM, USA * Bob Grossman, University of Illinois at Chicago, USA * Indranil Gupta, University of Illinois at Urbana Champaign, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Tevfik Kosar, Louisiana State University, USA * Chuang Liu, Ask.com, USA * Shiyong Lu, Wayne State University, USA * Reagan Moore, University of California at San Diego, USA * Steven Newhouse, Microsoft, USA * Cristina Nita-Rotaru, Purdue University, USA * Marlon Pierce, Indiana University, USA * Ioan Raicu, University of Chicago, USA * Dan Reed, Microsoft, USA * Matei Ripeanu, University of British Columbia, Canada * Rick Stevens, University of Chicago & Argonne National Laboratory, USA * Xian-He Sun, Illinois Institute of Technology, USA * Alex Szalay, The Johns Hopkins University, USA * Douglas Thain, Univeristy of Notre Dame, USA * Greg Thain, Univeristy of Wisconsin, USA * Mike Wilde, University of Chicago & Argonne National Laboratory, USA * Matthew Woitaszek, The University Corporation for Atmospheric Research, USA * Lingyun Yang, Yahoo Search, USA * Sherali Zeadally, University of the District of Columbia, USA * Yong Zhao, Microsoft, USA -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Fri Aug 1 13:49:25 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 01 Aug 2008 13:49:25 -0500 Subject: [Swift-user] is there a way to send 256 tasks at a time to one site Message-ID: <48935AB5.2080807@uchicago.edu> Hi, For the purpose of efficiency of swift on BGP, is there a way for us to send 256 tasks at a time to one site? Thanks zhao From fedorov at cs.wm.edu Fri Aug 1 13:53:50 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Fri, 1 Aug 2008 14:53:50 -0400 Subject: [Swift-user] Swift scheduler In-Reply-To: <1217616389.22481.3.camel@localhost> References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com> <1217616389.22481.3.camel@localhost> Message-ID: <82f536810808011153t4f2fbd60pccff7c912b997b36@mail.gmail.com> Ben, Michael -- thank you for your quick and complete answers! On Fri, Aug 1, 2008 at 2:46 PM, Mihael Hategan wrote: > On Fri, 2008-08-01 at 14:31 -0400, Andriy Fedorov wrote: >> Hi, >> >> I have some general questions about the scheduling policy Swift is using. >> >> For example, suppose I have an application, which has multiple >> mappings to different remote sites. How is the submission site going >> to be selected? > > In principle a score is kept for each site. The score varies based on > the number of successful submissions to that site and (negatively) with > current load. Sites are picked using a weighted random out of the pool > of sites (the weights being the scores). > >> In case I have long queueing delays on the selected >> site, can Swift detect that, and submit job to a different site? > > There's something called replication which can be enabled in > swift.properties to do that. > >> >> Can any of the developers point me to the specific part of the source >> that is responsible for scheduling, so that I could try to figure this >> out myself? > > Things start around here in principle: > http://cogkit.svn.sourceforge.net/viewvc/cogkit/trunk/current/src/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java?view=log > >> >> Thanks! >> >> -- >> Andrey Fedorov >> >> Center for Real-Time Computing >> College of William and Mary >> http://www.cs.wm.edu/~fedorov >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > From benc at hawaga.org.uk Fri Aug 1 13:54:09 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Aug 2008 18:54:09 +0000 (GMT) Subject: [Swift-user] is there a way to send 256 tasks at a time to one site In-Reply-To: <48935AB5.2080807@uchicago.edu> References: <48935AB5.2080807@uchicago.edu> Message-ID: On Fri, 1 Aug 2008, Zhao Zhang wrote: > For the purpose of efficiency of swift on BGP, is there a way for us to send > 256 tasks at a time to one site? Do you mean in one single job submission call? Swift will send the tasks separately to the provider layer (which probably in your case is provider-deef), so no. How the provider-layer gets tasks to the execution site is up to the provider - so potentially yes there. It might involve changing the falkon wire protocol though. -- From iraicu at cs.uchicago.edu Fri Aug 1 13:56:59 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 01 Aug 2008 13:56:59 -0500 Subject: [Swift-user] is there a way to send 256 tasks at a time to one site In-Reply-To: <48935AB5.2080807@uchicago.edu> References: <48935AB5.2080807@uchicago.edu> Message-ID: <48935C7B.8060807@cs.uchicago.edu> Zhao, The Falkon provider has a queue which has tasks pile up on from Karajan, and once every so often (on some polling interval), the provider will submit tasks to Falkon (from this queue). I think the polling interval is set to 1 second, so whatever tasks end up in the queue every second, will go out in 1 WS call to Falkon. We can make this polling interval longer, in the code, and recompiling. Now, if you are refering to the Swift scheduler, that it doesn't send enough tasks (i.e. 256 of them), which means that you never get to populate all CPUs with work, then that is a different question, which Mihael or Ben can hopefully answer. Ioan Zhao Zhang wrote: > Hi, > > For the purpose of efficiency of swift on BGP, is there a way for us > to send 256 tasks at a time to one site? > Thanks > > zhao > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Fri Aug 1 13:58:58 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 01 Aug 2008 13:58:58 -0500 Subject: [Swift-user] is there a way to send 256 tasks at a time to one site In-Reply-To: References: <48935AB5.2080807@uchicago.edu> Message-ID: <48935CF2.6050601@cs.uchicago.edu> The Falkon provider already does bunching of multiple tasks in a single WS call, as long as they are in the provider queue at the time when it checks the queue, which it does every second... this polling interval can be changed to be higher, if you are finding that only a few tasks get submitted every time. Or we can make it threshold based as well, wait X seconds, or Y tasks... it wouldn't be hard to implement different strategies to wait for more tasks... Ioan Ben Clifford wrote: > On Fri, 1 Aug 2008, Zhao Zhang wrote: > > >> For the purpose of efficiency of swift on BGP, is there a way for us to send >> 256 tasks at a time to one site? >> > > Do you mean in one single job submission call? Swift will send the tasks > separately to the provider layer (which probably in your case is > provider-deef), so no. How the provider-layer gets tasks to the execution > site is up to the provider - so potentially yes there. It might involve > changing the falkon wire protocol though. > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Aug 1 14:08:59 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Aug 2008 19:08:59 +0000 (GMT) Subject: [Swift-user] is there a way to send 256 tasks at a time to one site In-Reply-To: <48935C7B.8060807@cs.uchicago.edu> References: <48935AB5.2080807@uchicago.edu> <48935C7B.8060807@cs.uchicago.edu> Message-ID: On Fri, 1 Aug 2008, Ioan Raicu wrote: > recompiling. Now, if you are refering to the Swift scheduler, that it doesn't > send enough tasks (i.e. 256 of them), which means that you never get to > populate all CPUs with work, then that is a different question, which Mihael > or Ben can hopefully answer. You can make swift send jobs quite fast; fiddle with jobThrottle and initialScore values for your site. If you want swift to peak at sending three times the number of jobs as you have CPUs, set job throttle to 3 * numCPUs / 100 (eg 50 CPUs set it to 1.5). You can set initialScore to make submissions start nearer the full rate rather than starting slowly. Set it high (a few hundred, the exact value doesn't matter so much here). Both of these are keys to set in the karajan namespace in profile entries in your sites file: 1.5 1000 -- From wilde at mcs.anl.gov Fri Aug 1 19:04:17 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 01 Aug 2008 19:04:17 -0500 Subject: [Swift-user] ext mapper int params getting coerced to float? Message-ID: <4893A481.5090505@mcs.anl.gov> It seems as if params to the ext mapper are getting coerced to floats. The mapper is invoked like this: DockOut out < ext; exec="map.dockoutput", d=dir, r=run, n=ndir, o=i >; When the int values i and ndir reach my mapper script they are coming in with ".0" after the int values: map.docoutput: arg: 8 dir: /home/wilde/ligandatlas/dock/ runid: run06 ndir: 5000.0 outid: 3.0 map.docoutput: arg: 8 dir: /home/wilde/ligandatlas/dock/ runid: run06 ndir: 5000.0 outid: 1.0 map.docoutput: arg: 8 dir: /home/wilde/ligandatlas/dock/ runid: run06 ndir: 5000.0 outid: 0.0 map.docoutput: arg: 8 dir: /home/wilde/ligandatlas/dock/ runid: run06 ndir: 5000.0 outid: 2.0 Is thsis a result of the internal numeric types being uplifted to floats for to simplify type checking? Im assuming this does not happen when ints are expanded into a command line inside an app{} construct, so I would have expected it would not happen when passed to a mapper. -- Script is: type File; type DockTarget { File nrg; File bmp; File spheres; } type Mol2; type DockOut; // rundock-core runid ligfile outfile target grid.bmp grid.nrg selected.spheres (DockOut out) rundock ( Mol2 ligand, string targetname, DockTarget t) { app { rundockcore ligand out targetname t.nrg t.bmp t.spheres; } } string targetName = "1K4M"; string dir = "/home/wilde/ligandatlas/dock/"; string run = "run06"; int ndir = 5000; DockTarget targetProtein; Mol2 ligand[] ; foreach compound, i in ligand { DockOut out < ext; exec="map.dockoutput", d=dir, r=run, n=ndir, o=i >; out = rundock( compound, targetName, targetProtein ); } -- mapper map.dockoutpout starts with: #! /bin/bash while getopts ":d:r:n:o:" options; do case $options in d) dir=$OPTARG;; r) runid=$OPTARG;; n) ndir=$OPTARG;; o) outid=$OPTARG;; *) echo $usage exit 1;; esac done echo map.docoutput: $* echo map.docoutput: arg: $# dir: $dir runid: $runid ndir: $ndir outid: $outid >> mapper.log From benc at hawaga.org.uk Sat Aug 2 04:44:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 2 Aug 2008 09:44:31 +0000 (GMT) Subject: [Swift-user] ext mapper int params getting coerced to float? In-Reply-To: <4893A481.5090505@mcs.anl.gov> References: <4893A481.5090505@mcs.anl.gov> Message-ID: On Fri, 1 Aug 2008, Michael Wilde wrote: > When the int values i and ndir reach my mapper script they are coming in with > ".0" after the int values: > Is thsis a result of the internal numeric types being uplifted to floats for > to simplify type checking? Its not a result of recent type checking work (or shouldn't be) - that work is only prohibitive, in as much as Swift programs which work now should behave as they did before, but the set of Swift programs which will be accepted is now smaller. Its likely an artifact of the rather hazy internal handling of numbers in the runtime layer. > Im assuming this does not happen when ints are expanded into a command line > inside an app{} construct No idea, but it would be interesting to try. -- From benc at hawaga.org.uk Tue Aug 5 06:27:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 5 Aug 2008 11:27:40 +0000 (GMT) Subject: [Swift-user] ext mapper int params getting coerced to float? In-Reply-To: <4893A481.5090505@mcs.anl.gov> References: <4893A481.5090505@mcs.anl.gov> Message-ID: On Fri, 1 Aug 2008, Michael Wilde wrote: > When the int values i and ndir reach my mapper script they are coming in with > ".0" after the int values: Swift r2175 changes this so that the same formatting code is used as for application arguments. You should now not see the .0 for integers. -- From zhengxiongh at uchicago.edu Wed Aug 6 16:00:27 2008 From: zhengxiongh at uchicago.edu (Zhengxiong Hou) Date: Wed, 6 Aug 2008 16:00:27 -0500 (CDT) Subject: [Swift-user] job monitor Message-ID: <20080806160027.BJK70244@m4500-01.uchicago.edu> Hi, When executing thousands of independent jobs by Swift on OSG, there are several questions about the job monitoring: (1) After the jobs started, How to monitor the jobs status (such as queueing,running,completed,failed) in real-time? (2) Are the jobs dispatched to the grid sites, and waiting for execution in the LOCAL queues of the local resource manager/job scheduler (such as Condor,PBS,LSF,etc.)? (3) In the standard output of swift execution, there are some information as follow: Sorted: [AGLT2:15.215(39.422):16/16 overload: 0, NYSGRID-CCR- U2:21.317(50.894):1 8/21 overload: 0] Sorted: [GLOW-CMS:14.071(36.751):14/15 overload: 0] Sorted: [GLOW-CMS:14.071(36.751):15/15 overload: 0] Does this mean that 16 jobs were dispatched to grid site "AGLT2", and 15 jobs were dispatched to grid site "GLOW- CMS"? Thanks! B.R. zhengxiong From lixi at uchicago.edu Wed Aug 6 23:18:56 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 6 Aug 2008 23:18:56 -0500 (CDT) Subject: [Swift-user] Swift run: java.io.IOException: Unknown error 512 Message-ID: <20080806231856.BCW02035@m4500-03.uchicago.edu> Hi, I ran a workflow like this: [lixi at communicado test] $ /home/lixi/performancetest/4/cog/modules/vdsk/dist/vdsk- svn/bin/swift - sites.file ../sitesfile/SELECT1/sites2.0808062300.xml - tc.file ../tc.data testworkflow.swift >0808062300.log 2>&1 & During the execution, it stopped suddenly and the stdout and stderr are included in /home/lixi/performancetest/test/0808062300.log. It seems that it stopped due to "java.io.IOException: Unknown error 512" The log file is /home/lixi/performancetest/test/testworkflow- 20080806-2301-m1qbxjr3.log [lixi at communicado test]$ tail -n 20 0808062300.log Sorted: [LIGO_UWM_NEMO:140.112(90.071):37/37 overload: 0] node10 completed Sorted: [FLTECH:144.563(90.361):37/37 overload: 0] node10 completed Sorted: [UTA_SWT2:147.336(90.533):37/37 overload: 0] node10 completed Sorted: [FLTECH:146.739(90.497):37/37 overload: 0] node10 completed Sorted: [TTU-ANTAEUS:21.888(51.767):21/21 overload: 0] Sorted: [TTU-ANTAEUS:22.888(53.230):21/22 overload: 0] Sorted: [TTU-ANTAEUS:22.888(53.230):22/22 overload: 0] node10 completed Progress: Selecting site:1497 Stage in:19 Executing:170 Stage out:165 Finished successfully:106 Initializing site shared directory:2 Failed but can retry:41 java.io.IOException: Unknown error 512 at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read (FileInputStream.java:194) at java.io.BufferedInputStream.fill (BufferedInputStream.java:218) at java.io.BufferedInputStream.read (BufferedInputStream.java:235) at org.griphyn.vdl.karajan.InHook.run(InHook.java:39) at java.lang.Thread.run(Thread.java:595) Would you please tell me why such an error happened and what to do with it? Thanks, Xi From benc at hawaga.org.uk Thu Aug 7 02:52:32 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 Aug 2008 07:52:32 +0000 (GMT) Subject: [Swift-user] job monitor In-Reply-To: <20080806160027.BJK70244@m4500-01.uchicago.edu> References: <20080806160027.BJK70244@m4500-01.uchicago.edu> Message-ID: On Wed, 6 Aug 2008, Zhengxiong Hou wrote: > (1) After the jobs started, How to monitor the jobs status > (such as queueing,running,completed,failed) in real-time? With any swift after r1696 you should see a status line between every 5 and 60 seconds that looks like this: Progress: Selecting site:777 Stage in:80 Executing:83 Stage out:19 Finished successfully:41 Do you want more than this? > (2) Are the jobs dispatched to the grid sites, and waiting > for execution in the LOCAL queues of the local resource > manager/job scheduler (such as Condor,PBS,LSF,etc.)? Pretty much jobs will be queued some before executing, yes. > (3) In the standard output of swift execution, there are > some information as follow: > Sorted: [AGLT2:15.215(39.422):16/16 overload: 0, NYSGRID-CCR- > U2:21.317(50.894):1 > 8/21 overload: 0] > Sorted: [GLOW-CMS:14.071(36.751):14/15 overload: 0] > Sorted: [GLOW-CMS:14.071(36.751):15/15 overload: 0] > > Does this mean that 16 jobs were dispatched to grid > site "AGLT2", and 15 jobs were dispatched to grid site "GLOW- > CMS"? Basically yes. -- From benc at hawaga.org.uk Thu Aug 7 03:09:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 Aug 2008 08:09:53 +0000 (GMT) Subject: [Swift-user] Re: [Swift-devel] Swift run: java.io.IOException: Unknown error 512 In-Reply-To: <20080806231856.BCW02035@m4500-03.uchicago.edu> References: <20080806231856.BCW02035@m4500-03.uchicago.edu> Message-ID: Can you reproduce it? Google shows occurences of that exception (unknown err 512 in FileInputStream.readBytes) happening when the java process has been set to run in the background, when reading from the console. Were you doing anything like that? (eg running with & after the command or pressing ctrl-z) -- From lixi at uchicago.edu Thu Aug 7 08:01:59 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Thu, 7 Aug 2008 08:01:59 -0500 (CDT) Subject: [Swift-user] Fwd: Re: [Swift-devel] Swift run: java.io.IOException: Unknown error 512 Message-ID: <20080807080159.BCW25121@m4500-03.uchicago.edu> -------------- next part -------------- An embedded message was scrubbed... From: Subject: Re: [Swift-devel] Swift run: java.io.IOException: Unknown error 512 Date: Thu, 7 Aug 2008 07:26:49 -0500 (CDT) Size: 971 URL: From lixi at uchicago.edu Thu Aug 7 08:02:17 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Thu, 7 Aug 2008 08:02:17 -0500 (CDT) Subject: [Swift-user] Fwd: Re: [Swift-devel] Swift run: java.io.IOException: Unknown error 512 Message-ID: <20080807080217.BCW25140@m4500-03.uchicago.edu> -------------- next part -------------- An embedded message was scrubbed... From: Subject: Re: [Swift-devel] Swift run: java.io.IOException: Unknown error 512 Date: Thu, 7 Aug 2008 07:38:01 -0500 (CDT) Size: 1075 URL: From fedorov at cs.wm.edu Thu Aug 7 09:47:52 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Thu, 7 Aug 2008 10:47:52 -0400 Subject: [Swift-user] Need help debugging strange problem... Message-ID: <82f536810808070747u39a91838l3b41a01fd5fac602@mail.gmail.com> Hi, I have a Swift script that is running fine on UC TG site, and now I am trying to add NCSA to the set of execution sites, but I have some strange problems, and I am not sure how to debug this. First, I submit a simple script (below) to NCSA Mercury with GT4 Fork jobmanager, and it works. When I change the provider from "fork" to "PBS", the Swift execution does not finish after the PBS job completion. I see the job submitted, queued in PBS, running, completing, I see the output file is produced in the scratch directory, but on the submission site I have "Progress: Executing:1". The submission site is the same as for the example with "fork" jobmanager, so I don't see how firewall can be an issue, and I can telnet to the submission site from NCSA. Note, that I was able to run the same simple test with both fork and PBS providers on the SDSC TG site. How can I figure out what is wrong about NCSA Mercury? sites.xml: (as in http://www.teragrid.org/userinfo/jobs/gram.php) /home/ac/fedorov/scratch tc.data: NCSA-GT4 NCSA_hostname /sbin/ifconfig INSTALLED INTEL32::LINUX null hello.swift: type messagefile{} (messagefile uc_hostname) hostname2(){ app{ NCSA_hostname stdout=@filename(uc_hostname); } } messagefile uc_hostname<"uc_hostname.txt">; messagefile ncsa_hostname<"ncsa_hostname.txt">; ncsa_hostname = hostname2(); -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From benc at hawaga.org.uk Thu Aug 7 10:27:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 7 Aug 2008 15:27:13 +0000 (GMT) Subject: [Swift-user] Need help debugging strange problem... In-Reply-To: <82f536810808070747u39a91838l3b41a01fd5fac602@mail.gmail.com> References: <82f536810808070747u39a91838l3b41a01fd5fac602@mail.gmail.com> Message-ID: there is a somewhat common misconfiguration of gram4 on the server side where it is wired into the local queueing system incorrectly so that completion notifications do not find their way back. this matches the symptoms you describe - that fork works but that pbs doesn't, but that the job apepars to have run. I just tried a submission using the GT4 command line job submission command: $ globusrun-ws -submit -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService -Ft Fork -job-command /bin/hostname Submitting job... but it appears to hang without submitting. not sure what is happening with that site... Aside from that, my advice for diagnosis would be to try the above command with both Fork and PBS and see if you get the same difference in behaviour between the two. -- From fedorov at cs.wm.edu Thu Aug 7 11:23:53 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Thu, 7 Aug 2008 12:23:53 -0400 Subject: [Swift-user] Need help debugging strange problem... In-Reply-To: References: <82f536810808070747u39a91838l3b41a01fd5fac602@mail.gmail.com> Message-ID: <82f536810808070923o7ff2d3a9le86d0f7db41c61dc@mail.gmail.com> Ben, I tried what you suggested, and I have globusrun-ws working from UC submitting to NCSA, using Fork factory type: [fedorov at TG/UC:tg-login1 ~/swiftBiofem] globusrun-ws -submit -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService -Ft Fork -job-command /bin/hostname Submitting job...Done. Job ID: uuid:3b8f1662-649c-11dd-9347-0007e9d811ce Termination time: 08/08/2008 16:16 GMT Current job state: Active Current job state: CleanUp Current job state: Done Destroying job...Done. But it fails when I am using PBS factory. globusrun-ws doesn't exit, while I see job finished on NCSA. [fedorov at TG/UC:tg-login1 ~/swiftBiofem] globusrun-ws -submit -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService -Ft PBS -job-command /bin/hostname Submitting job...Done. Job ID: uuid:dc23433c-649c-11dd-9671-0007e9d811ce Termination time: 08/08/2008 16:21 GMT Current job state: Unsubmitted I am going to report this to TG help. -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov On Thu, Aug 7, 2008 at 11:27 AM, Ben Clifford wrote: > there is a somewhat common misconfiguration of gram4 on the server side > where it is wired into the local queueing system incorrectly so that > completion notifications do not find their way back. this matches the > symptoms you describe - that fork works but that pbs doesn't, but that the > job apepars to have run. > > I just tried a submission using the GT4 command line job submission > command: > > $ globusrun-ws -submit -F > https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService > -Ft Fork -job-command /bin/hostname > Submitting job... > > > > but it appears to hang without submitting. not sure what is happening with > that site... > > Aside from that, my advice for diagnosis would be to try the above command > with both Fork and PBS and see if you get the same difference in behaviour > between the two. > > -- > From feller at mcs.anl.gov Thu Aug 7 11:29:09 2008 From: feller at mcs.anl.gov (Martin Feller) Date: Thu, 07 Aug 2008 11:29:09 -0500 Subject: [Swift-user] Re: Need help debugging strange problem... In-Reply-To: References: Message-ID: <489B22D5.6010500@mcs.anl.gov> Andriy: Can you please try the following: submit a dummy job in batch mode to Fork and PBS and query for job status instead of relying for notifications: globusrun-ws -submit \ -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService -Ft Fork -b -e forkJob.epr -c /bin/hostname then try globusrun-ws -status -j forkJob.epr and see if you see changes in state of your job after a while Same for PBS: globusrun-ws -submit \ -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService -Ft PBS -b -e pbsJob.epr -c /bin/hostname globusrun-ws -status -j pbsJob.epr ( later on remove those jobs calling globusrun-ws -kill -j pbsJob.epr globusrun-ws -kill -j forkJob.epr ) If you see job state changes that had not been reported using globusrun-ws in interactive mode, then it's a notification problem. But i don't think this is the case. I suspect the problem is that Gram4 does not get informed about job state changes by the scheduler event generator (SEG). We once had the problem that the job state changes just didn't show up in the SEG logs, due to SEG <--> filesystem issues (i think it was lustre). Before speculating about this: Please run the batch jobs and tell what you get. Martin >> *From: *Ben Clifford > >> *Date: *August 7, 2008 10:27:13 AM CDT >> *To: *Andriy Fedorov > >> *Cc: *swift-user at ci.uchicago.edu >> *Subject: **Re: [Swift-user] Need help debugging strange problem...* >> >> there is a somewhat common misconfiguration of gram4 on the server side >> where it is wired into the local queueing system incorrectly so that >> completion notifications do not find their way back. this matches the >> symptoms you describe - that fork works but that pbs doesn't, but that >> the >> job apepars to have run. >> >> I just tried a submission using the GT4 command line job submission >> command: >> >> $ globusrun-ws -submit -F >> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService >> >> -Ft Fork -job-command /bin/hostname >> Submitting job... >> >> >> >> but it appears to hang without submitting. not sure what is happening >> with >> that site... >> >> Aside from that, my advice for diagnosis would be to try the above >> command >> with both Fork and PBS and see if you get the same difference in >> behaviour >> between the two. >> >> -- >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From feller at mcs.anl.gov Thu Aug 7 11:32:00 2008 From: feller at mcs.anl.gov (Martin Feller) Date: Thu, 07 Aug 2008 11:32:00 -0500 Subject: [Swift-user] Re: Need help debugging strange problem... In-Reply-To: <489B22D5.6010500@mcs.anl.gov> References: <489B22D5.6010500@mcs.anl.gov> Message-ID: <489B2380.3020804@mcs.anl.gov> oh, i see i did an error here: please replace "-b -e" by "-b -o" in the globusrun-ws options. Martin Martin Feller wrote: > Andriy: > > Can you please try the following: > > submit a dummy job in batch mode to Fork and PBS and query for job status > instead of relying for notifications: > > globusrun-ws -submit \ > -F > https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService > > -Ft Fork > -b -e forkJob.epr > -c /bin/hostname > > then try > > globusrun-ws -status -j forkJob.epr > > and see if you see changes in state of your job after a while > > Same for PBS: > > globusrun-ws -submit \ > -F > https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService > > -Ft PBS > -b -e pbsJob.epr > -c /bin/hostname > > globusrun-ws -status -j pbsJob.epr > > ( > later on remove those jobs calling > globusrun-ws -kill -j pbsJob.epr > globusrun-ws -kill -j forkJob.epr > ) > > If you see job state changes that had not been reported using > globusrun-ws in > interactive mode, then it's a notification problem. But i don't think > this is > the case. > I suspect the problem is that Gram4 does not get informed about job > state changes > by the scheduler event generator (SEG). > We once had the problem that the job state changes just didn't show up > in the > SEG logs, due to SEG <--> filesystem issues (i think it was lustre). > > Before speculating about this: Please run the batch jobs and tell what > you get. > > Martin > > > >>> *From: *Ben Clifford > >>> *Date: *August 7, 2008 10:27:13 AM CDT >>> *To: *Andriy Fedorov > >>> *Cc: *swift-user at ci.uchicago.edu >>> *Subject: **Re: [Swift-user] Need help debugging strange problem...* >>> >>> there is a somewhat common misconfiguration of gram4 on the server side >>> where it is wired into the local queueing system incorrectly so that >>> completion notifications do not find their way back. this matches the >>> symptoms you describe - that fork works but that pbs doesn't, but >>> that the >>> job apepars to have run. >>> >>> I just tried a submission using the GT4 command line job submission >>> command: >>> >>> $ globusrun-ws -submit -F >>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService >>> >>> -Ft Fork -job-command /bin/hostname >>> Submitting job... >>> >>> >>> >>> but it appears to hang without submitting. not sure what is happening >>> with >>> that site... >>> >>> Aside from that, my advice for diagnosis would be to try the above >>> command >>> with both Fork and PBS and see if you get the same difference in >>> behaviour >>> between the two. >>> >>> -- >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From fedorov at cs.wm.edu Thu Aug 7 11:39:30 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Thu, 7 Aug 2008 12:39:30 -0400 Subject: [Swift-user] Re: Need help debugging strange problem... Message-ID: <82f536810808070939s6092a279q7280c03874599640@mail.gmail.com> Martin, I tried what you suggested. The status of the job remains "Unsubmitted" on the submission site, while I see the job completes on NCSA Mercury. I reported this problem to TG help, and will post an update if I hear any explanation from them. Andrey > Date: Thu, 07 Aug 2008 11:29:09 -0500 > From: Martin Feller > Subject: [Swift-user] Re: Need help debugging strange problem... > To: swift-user at ci.uchicago.edu > Message-ID: <489B22D5.6010500 at mcs.anl.gov> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Andriy: > > Can you please try the following: > > submit a dummy job in batch mode to Fork and PBS and query for job status > instead of relying for notifications: > > globusrun-ws -submit \ > -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService > -Ft Fork > -b -e forkJob.epr > -c /bin/hostname > > then try > > globusrun-ws -status -j forkJob.epr > > and see if you see changes in state of your job after a while > > Same for PBS: > > globusrun-ws -submit \ > -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService > -Ft PBS > -b -e pbsJob.epr > -c /bin/hostname > > globusrun-ws -status -j pbsJob.epr > > ( > later on remove those jobs calling > globusrun-ws -kill -j pbsJob.epr > globusrun-ws -kill -j forkJob.epr > ) > > If you see job state changes that had not been reported using globusrun-ws in > interactive mode, then it's a notification problem. But i don't think this is > the case. > I suspect the problem is that Gram4 does not get informed about job state changes > by the scheduler event generator (SEG). > We once had the problem that the job state changes just didn't show up in the > SEG logs, due to SEG <--> filesystem issues (i think it was lustre). > > Before speculating about this: Please run the batch jobs and tell what you get. > > Martin > > > >>> *From: *Ben Clifford > >>> *Date: *August 7, 2008 10:27:13 AM CDT >>> *To: *Andriy Fedorov > >>> *Cc: *swift-user at ci.uchicago.edu >>> *Subject: **Re: [Swift-user] Need help debugging strange problem...* >>> >>> there is a somewhat common misconfiguration of gram4 on the server side >>> where it is wired into the local queueing system incorrectly so that >>> completion notifications do not find their way back. this matches the >>> symptoms you describe - that fork works but that pbs doesn't, but that >>> the >>> job apepars to have run. >>> >>> I just tried a submission using the GT4 command line job submission >>> command: >>> >>> $ globusrun-ws -submit -F >>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService >>> >>> -Ft Fork -job-command /bin/hostname >>> Submitting job... >>> >>> >>> >>> but it appears to hang without submitting. not sure what is happening >>> with >>> that site... >>> >>> Aside from that, my advice for diagnosis would be to try the above >>> command >>> with both Fork and PBS and see if you get the same difference in >>> behaviour >>> between the two. >>> >>> -- >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> > > > > ------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > End of Swift-user Digest, Vol 17, Issue 5 > ***************************************** > From benc at hawaga.org.uk Fri Aug 8 03:38:27 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Aug 2008 08:38:27 +0000 (GMT) Subject: [Swift-user] job monitor In-Reply-To: <20080806160027.BJK70244@m4500-01.uchicago.edu> References: <20080806160027.BJK70244@m4500-01.uchicago.edu> Message-ID: On Wed, 6 Aug 2008, Zhengxiong Hou wrote: > (2) Are the jobs dispatched to the grid sites, and waiting > for execution in the LOCAL queues of the local resource > manager/job scheduler (such as Condor,PBS,LSF,etc.)? Related to this, in swift r2183, I changed the progress ticker a bit to reflect some of this state a bit more. What used to be "Executing" is now three different states: Submitting -> Submitted -> Active 'Submitted' is when the job is in a queue at the remote site, and 'Active' is when the job is running on the remote site (at least as far as the Swift runtime is aware). -- From hategan at mcs.anl.gov Mon Aug 11 17:23:27 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 11 Aug 2008 17:23:27 -0500 Subject: [Swift-user] test; please ignore Message-ID: <1218493407.13994.2.camel@localhost> From fedorov at cs.wm.edu Thu Aug 14 16:16:11 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Thu, 14 Aug 2008 17:16:11 -0400 Subject: [Swift-user] Node specification for local PBS/Torque provider Message-ID: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com> Hi, I was unable to find the details on how nodes should be specified when local PBS provider is used. With sequential jobs, I am able to use GLOBUS attribute maxWallTime. When I am trying to request multiple nodes, I tried to specify the number with the host_types and host_count attributes, but each time I am getting only one node, and also PBS_-variables (like PBS_NODEFILE) are not defined on that single allocated node. Can anyone help me with this? Thanks -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From hategan at mcs.anl.gov Thu Aug 14 18:59:26 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 14 Aug 2008 18:59:26 -0500 Subject: [Swift-user] Node specification for local PBS/Torque provider In-Reply-To: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com> References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com> Message-ID: <1218758366.25840.3.camel@localhost> On Thu, 2008-08-14 at 17:16 -0400, Andriy Fedorov wrote: > Hi, > > I was unable to find the details on how nodes should be specified when > local PBS provider is used. Looking at the code, host_types and host_count don't seem to be handled by the pbs provider. You should file a bug report with cog (http://bugzilla.mcs.anl.gov/globus/enter_bug.cgi?product=CoG%20Kit) (so that I don't forget about it) and it will get fixed eventually. The exact timing depends on how badly you need it. Mihael From fedorov at cs.wm.edu Fri Aug 15 07:51:38 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Fri, 15 Aug 2008 08:51:38 -0400 Subject: [Swift-user] Node specification for local PBS/Torque provider In-Reply-To: <1218758366.25840.3.camel@localhost> References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com> <1218758366.25840.3.camel@localhost> Message-ID: <82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com> Michael, thank you for reply. I looked through the cog source, qsub scripts it generates, and I think I localized the problem. > You should file a bug report with cog > (http://bugzilla.mcs.anl.gov/globus/enter_bug.cgi?product=CoG%20Kit) (so > that I don't forget about it) and it will get fixed eventually. I logged the bug. > The exact timing depends on how badly you need it. Well, I am not sure what you mean by this... I certainly did not find this bug having nothing to do and browsing through the CoG code. I tried to accomplish something, and now I am blocked because of this bug. If it takes too long, I guess I will have to fix it myself. I don't know how to rush a bug fix. I know you guys are very busy and have your own priorities... -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov On Thu, Aug 14, 2008 at 7:59 PM, Mihael Hategan wrote: > On Thu, 2008-08-14 at 17:16 -0400, Andriy Fedorov wrote: >> Hi, >> >> I was unable to find the details on how nodes should be specified when >> local PBS provider is used. > > Looking at the code, host_types and host_count don't seem to be handled > by the pbs provider. > You should file a bug report with cog > (http://bugzilla.mcs.anl.gov/globus/enter_bug.cgi?product=CoG%20Kit) (so > that I don't forget about it) and it will get fixed eventually. The > exact timing depends on how badly you need it. > > Mihael > > > From benc at hawaga.org.uk Fri Aug 15 08:15:01 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Aug 2008 13:15:01 +0000 (GMT) Subject: [Swift-user] Node specification for local PBS/Torque provider In-Reply-To: <82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com> References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com> <1218758366.25840.3.camel@localhost> <82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com> Message-ID: On Fri, 15 Aug 2008, Andriy Fedorov wrote: > If it takes too long, I guess I will have to fix it myself. > I don't know how to rush a bug fix. I know you guys are very busy and > have your own priorities... Bug fixes are pretty much always welcome if you do fix it yourself. My priority at the moment is a week long vacation but if its still broken when I come back, I'll look at it then. -- From fedorov at cs.wm.edu Fri Aug 15 08:32:42 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Fri, 15 Aug 2008 09:32:42 -0400 Subject: [Swift-user] Node specification for local PBS/Torque provider In-Reply-To: References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com> <1218758366.25840.3.camel@localhost> <82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com> Message-ID: <82f536810808150632r2fdb458aoea9553c724498186@mail.gmail.com> Looking at the cog code, I see that I can use the attribute "count" to specify the number and type of nodes for PBS. For me this is good enough. I updated the bug status accordingly to WORKSFORME. -- Fedorov On Fri, Aug 15, 2008 at 9:15 AM, Ben Clifford wrote: > > On Fri, 15 Aug 2008, Andriy Fedorov wrote: > >> If it takes too long, I guess I will have to fix it myself. >> I don't know how to rush a bug fix. I know you guys are very busy and >> have your own priorities... > > Bug fixes are pretty much always welcome if you do fix it yourself. My > priority at the moment is a week long vacation but if its still broken > when I come back, I'll look at it then. > > -- > > From hategan at mcs.anl.gov Fri Aug 15 10:33:31 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 15 Aug 2008 10:33:31 -0500 Subject: [Swift-user] Node specification for local PBS/Torque provider In-Reply-To: <82f536810808150632r2fdb458aoea9553c724498186@mail.gmail.com> References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com> <1218758366.25840.3.camel@localhost> <82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com> <82f536810808150632r2fdb458aoea9553c724498186@mail.gmail.com> Message-ID: <1218814411.31924.2.camel@localhost> On Fri, 2008-08-15 at 09:32 -0400, Andriy Fedorov wrote: > Looking at the cog code, I see that I can use the attribute "count" to > specify the number and type of nodes for PBS. For me this is good > enough. I updated the bug status accordingly to WORKSFORME. There should be a uniform way of specifying this stuff. So I think that should be fixed. Mihael From fedorov at cs.wm.edu Fri Aug 15 10:35:18 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Fri, 15 Aug 2008 11:35:18 -0400 Subject: [Swift-user] Node specification for local PBS/Torque provider In-Reply-To: <1218814411.31924.2.camel@localhost> References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com> <1218758366.25840.3.camel@localhost> <82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com> <82f536810808150632r2fdb458aoea9553c724498186@mail.gmail.com> <1218814411.31924.2.camel@localhost> Message-ID: <82f536810808150835j32cfe274x4b7c94e38837b889@mail.gmail.com> On Fri, Aug 15, 2008 at 11:33 AM, Mihael Hategan wrote: > On Fri, 2008-08-15 at 09:32 -0400, Andriy Fedorov wrote: >> Looking at the cog code, I see that I can use the attribute "count" to >> specify the number and type of nodes for PBS. For me this is good >> enough. I updated the bug status accordingly to WORKSFORME. > > There should be a uniform way of specifying this stuff. Oh, I totally agree with you on this one!... > > Mihael > > > From benc at hawaga.org.uk Mon Aug 25 03:51:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 25 Aug 2008 08:51:31 +0000 (GMT) Subject: [Swift-user] Swift 0.6 released Message-ID: Swift 0.6 is online for download at http://www.ci.uchicago.edu/swift/downloads/ In addition to a bunch of bugfixes, the most interesting changes are: * much more rigourous compile time type checking - this catches many more errors at the start rather than hours into a run, and gives more useful error reports. * better multisite handling: + job replication - when a job has been queued for much longer than average, Swift can launch a replica of the job on another site. This helps when making multisite runs where one site has a much longer queue time than another. + rate limiting for bad sites - poorly scored sites are now rate limited much more than in previous versions of Swift, with very poorly scored sites being delayed between executions. * cog coasters - this is a new execution provider that allows a single 'coaster' job to be submitted per worker node which pulls in Swift jobs. This can greatly reduce the number of jobs submitted to the underlying job submission mechanism (such as GRAM2) allowing more jobs to be submitted; it also can reduce the amount of time jobs spend in the LRM queue by sending them directly to an already-executing coaster. -- From fedorov at cs.wm.edu Mon Aug 25 18:46:46 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Mon, 25 Aug 2008 19:46:46 -0400 Subject: [Swift-user] Returning arrays of files Message-ID: <82f536810808251646h18ada8e1ibb1e08fcb40d7eb0@mail.gmail.com> Hi, I have a procedure that wraps an application, which creates a large number of files. The names of those files are not passed as input arguments, but I know their names in advance. I was trying to handle this by doing the following (this is a fragment of the complete code): (file bImages[], file wImages[]) prepareImages(file fImage, file rImage, file pList){ app { BMPrepareImages @fImage @rImage @pList; } } prepareImages1(file fImage, file rImage, file pList){ app { BMPrepareImages @fImage @rImage @pList; } } string bImageNames[]; string wImageNames[]; iterate i { bImageNames[i] = @strcat("block_",i,".nii.gz"); wImageNames[i] = @strcat("window_",i,".nii.gz"); }until(i==numPoints-1); file bImages[]; file wImages[]; (bImages,wImages) = prepareImages(fImageRsmooth,rImageRsmooth,fImagePointList); But when I run this script, Swift tells me Progress: Progress: Progress: Progress: Progress: ... and nothing happens... When I run "prepareImages1" instead of "prepareImages", it finishes, but of course I don't get the output files. So there cannot be problem in the application. I suspect there's something wrong with the way I specify the output of the procedure. Can anyone help me, what is wrong here? -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From benc at hawaga.org.uk Tue Aug 26 03:34:21 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 Aug 2008 08:34:21 +0000 (GMT) Subject: [Swift-user] Returning arrays of files In-Reply-To: <82f536810808251646h18ada8e1ibb1e08fcb40d7eb0@mail.gmail.com> References: <82f536810808251646h18ada8e1ibb1e08fcb40d7eb0@mail.gmail.com> Message-ID: Probably what is happening is that in the below section, Swift can't deal with mapper parameter inputs being constructed dynamically. One day hopefully it will be able to - it makes sense in the language. This situation is a bit annoying - you could use simple mapper if these were input files... > string bImageNames[]; > > iterate i { > bImageNames[i] = @strcat("block_",i,".nii.gz"); > wImageNames[i] = @strcat("window_",i,".nii.gz"); > }until(i==numPoints-1); > > file bImages[]; -- From fedorov at cs.wm.edu Tue Aug 26 14:29:16 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 26 Aug 2008 15:29:16 -0400 Subject: [Swift-user] NullPointerException Message-ID: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> Hi, Is there any advice on how I can debug the following error while trying to run a Swift script: Could not compile SwiftScript source: java.lang.NullPointerException (this is the only error message I get) -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From hategan at mcs.anl.gov Tue Aug 26 14:45:17 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 26 Aug 2008 14:45:17 -0500 Subject: [Swift-user] NullPointerException In-Reply-To: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> Message-ID: <1219779917.13302.5.camel@localhost> On Tue, 2008-08-26 at 15:29 -0400, Andriy Fedorov wrote: > Hi, > > Is there any advice on how I can debug the following error while > trying to run a Swift script: > > Could not compile SwiftScript source: java.lang.NullPointerException > > (this is the only error message I get) The -d flag to swift should provide more details (perhaps a stack trace). From fedorov at cs.wm.edu Tue Aug 26 15:03:57 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 26 Aug 2008 16:03:57 -0400 Subject: [Swift-user] NullPointerException In-Reply-To: <1219779917.13302.5.camel@localhost> References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> <1219779917.13302.5.camel@localhost> Message-ID: <82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com> On Tue, Aug 26, 2008 at 3:45 PM, Mihael Hategan wrote: > On Tue, 2008-08-26 at 15:29 -0400, Andriy Fedorov wrote: >> Hi, >> >> Is there any advice on how I can debug the following error while >> trying to run a Swift script: >> >> Could not compile SwiftScript source: java.lang.NullPointerException >> >> (this is the only error message I get) > > The -d flag to swift should provide more details (perhaps a stack > trace). > Yes, thanks. Still, no idea how to connect this error with my Swift source: Could not compile SwiftScript source: java.lang.NullPointerException Full parser exception java.lang.NullPointerException at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515) at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458) at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86) at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93) at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693) at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412) at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64) at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207) at org.griphyn.vdl.karajan.Loader.main(Loader.java:123) Exception when compiling ../pe_script_500-1.swift org.griphyn.vdl.toolkit.VDLt2VDLx$ParsingException: Could not compile SwiftScript source: null at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:68) at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207) at org.griphyn.vdl.karajan.Loader.main(Loader.java:123) Caused by: java.lang.NullPointerException at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515) at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458) at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86) at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93) at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693) at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412) at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64) ... 3 more > > From benc at hawaga.org.uk Tue Aug 26 15:10:10 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 Aug 2008 20:10:10 +0000 (GMT) Subject: [Swift-user] NullPointerException In-Reply-To: <82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com> References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> <1219779917.13302.5.camel@localhost> <82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com> Message-ID: something not right in the compiler - it should be either working or giving a more meaningful error message. can you send the source file that causes this? On Tue, 26 Aug 2008, Andriy Fedorov wrote: > On Tue, Aug 26, 2008 at 3:45 PM, Mihael Hategan wrote: > > On Tue, 2008-08-26 at 15:29 -0400, Andriy Fedorov wrote: > >> Hi, > >> > >> Is there any advice on how I can debug the following error while > >> trying to run a Swift script: > >> > >> Could not compile SwiftScript source: java.lang.NullPointerException > >> > >> (this is the only error message I get) > > > > The -d flag to swift should provide more details (perhaps a stack > > trace). > > > > Yes, thanks. Still, no idea how to connect this error with my Swift source: > > Could not compile SwiftScript source: java.lang.NullPointerException > Full parser exception > java.lang.NullPointerException > at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515) > at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458) > at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86) > at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93) > at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693) > at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412) > at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64) > at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46) > at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207) > at org.griphyn.vdl.karajan.Loader.main(Loader.java:123) > Exception when compiling ../pe_script_500-1.swift > org.griphyn.vdl.toolkit.VDLt2VDLx$ParsingException: Could not compile > SwiftScript source: null > at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:68) > at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46) > at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207) > at org.griphyn.vdl.karajan.Loader.main(Loader.java:123) > Caused by: > java.lang.NullPointerException > at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515) > at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458) > at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86) > at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93) > at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693) > at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412) > at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64) > ... 3 more > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > From fedorov at cs.wm.edu Tue Aug 26 15:52:43 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 26 Aug 2008 16:52:43 -0400 Subject: [Swift-user] NullPointerException In-Reply-To: References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> <1219779917.13302.5.camel@localhost> <82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com> Message-ID: <82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com> Just to update the list on the resolution of this problem (thanks, Ben Clifford!). I was not supposed to have ";" at the end of the "foreach" construct: foreach i in [0:numPoints-1] { ... } // <=== NO ";" HERE!!! -- Andrey Fedorov On Tue, Aug 26, 2008 at 4:10 PM, Ben Clifford wrote: > > something not right in the compiler - it should be either working or > giving a more meaningful error message. can you send the source file that > causes this? > > On Tue, 26 Aug 2008, Andriy Fedorov wrote: > >> On Tue, Aug 26, 2008 at 3:45 PM, Mihael Hategan wrote: >> > On Tue, 2008-08-26 at 15:29 -0400, Andriy Fedorov wrote: >> >> Hi, >> >> >> >> Is there any advice on how I can debug the following error while >> >> trying to run a Swift script: >> >> >> >> Could not compile SwiftScript source: java.lang.NullPointerException >> >> >> >> (this is the only error message I get) >> > >> > The -d flag to swift should provide more details (perhaps a stack >> > trace). >> > >> >> Yes, thanks. Still, no idea how to connect this error with my Swift source: >> >> Could not compile SwiftScript source: java.lang.NullPointerException >> Full parser exception >> java.lang.NullPointerException >> at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515) >> at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458) >> at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86) >> at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93) >> at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693) >> at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412) >> at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64) >> at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46) >> at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207) >> at org.griphyn.vdl.karajan.Loader.main(Loader.java:123) >> Exception when compiling ../pe_script_500-1.swift >> org.griphyn.vdl.toolkit.VDLt2VDLx$ParsingException: Could not compile >> SwiftScript source: null >> at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:68) >> at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46) >> at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207) >> at org.griphyn.vdl.karajan.Loader.main(Loader.java:123) >> Caused by: >> java.lang.NullPointerException >> at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515) >> at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458) >> at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86) >> at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93) >> at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693) >> at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412) >> at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64) >> ... 3 more >> >> >> > >> > >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> >> > From benc at hawaga.org.uk Tue Aug 26 15:54:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 Aug 2008 20:54:46 +0000 (GMT) Subject: [Swift-user] NullPointerException In-Reply-To: <82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com> References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> <1219779917.13302.5.camel@localhost> <82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com> <82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com> Message-ID: On Tue, 26 Aug 2008, Andriy Fedorov wrote: > Just to update the list on the resolution of this problem (thanks, Ben > Clifford!). > > I was not supposed to have ";" at the end of the "foreach" construct: > > foreach i in [0:numPoints-1] { > ... > } // <=== NO ";" HERE!!! Though this perhaps should be allowed, in as much as empty statements are allowed. -- From benc at hawaga.org.uk Wed Aug 27 04:49:29 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 Aug 2008 09:49:29 +0000 (GMT) Subject: [Swift-user] NullPointerException In-Reply-To: <82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com> References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> <1219779917.13302.5.camel@localhost> <82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com> <82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com> Message-ID: On Tue, 26 Aug 2008, Andriy Fedorov wrote: > I was not supposed to have ";" at the end of the "foreach" construct: Its actually nothing to do with the foreach - a swift program consisting of only this line: ;; exhibits the same error. I've changed the language grammar to disallow empty statements, in r2205. This will give a different error message now, pointing at the semicolon. -- From fedorov at cs.wm.edu Wed Aug 27 15:08:59 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Wed, 27 Aug 2008 16:08:59 -0400 Subject: [Swift-user] tcp.port.range in Swift 0.6 Message-ID: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> Hello, It appears to me that Swift 0.6 does not handle properly custom GLOBUS_TCP_PORT_RANGE. In the previous release, I used env variable $GLOBUS_TCP_PORT_RANGE to set this range. In the current release, it seems like tcp.port.range can be specified in etc/swift.properties. I still have the env variable, and tried to specify the port range in swift.properties, and I tried passing it as -tcp.port.range, but Swift keeps opening ports outside the specified range, as reported by netstat. The same simple test script works with 0.5, but not with 0.6. This seems like a bug to me. -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From benc at hawaga.org.uk Thu Aug 28 10:09:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 28 Aug 2008 15:09:54 +0000 (GMT) Subject: [Swift-user] tcp.port.range in Swift 0.6 In-Reply-To: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> Message-ID: On Wed, 27 Aug 2008, Andriy Fedorov wrote: > It appears to me that Swift 0.6 does not handle properly custom > GLOBUS_TCP_PORT_RANGE. I just tried 0.6 and a head build from about 24h ago on my laptop, and both appear to respect GLOBUS_TCP_PORT_RANGE for the purposes of gram4 notification sinks Can you give more details about what you see? For example, the output of /usr/bin/env and the lines that you see in netstat -pant, the sites.xml file that you are using. > I still have the env variable, and tried to specify the port range in > swift.properties, and I tried passing it as -tcp.port.range, but Swift > keeps opening ports outside the specified range, as reported by > netstat. The same simple test script works with 0.5, but not with 0.6. Note that GLOBUS_TCP_PORT_RANGE controls which ports are used for server sockets; it does not control which ports are used for outbound connections. There is a different variable, GLOBUS_TCP_SOURCE_PORT_RANGE that should control the latter. -- From fedorov at cs.wm.edu Thu Aug 28 11:10:39 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Thu, 28 Aug 2008 12:10:39 -0400 Subject: [Swift-user] tcp.port.range in Swift 0.6 In-Reply-To: References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> Message-ID: <82f536810808280910t3b4b90a1k69b55400ea67e325@mail.gmail.com> On Thu, Aug 28, 2008 at 11:09 AM, Ben Clifford wrote: > > On Wed, 27 Aug 2008, Andriy Fedorov wrote: > >> It appears to me that Swift 0.6 does not handle properly custom >> GLOBUS_TCP_PORT_RANGE. > > I just tried 0.6 and a head build from about 24h ago on my laptop, and > both appear to respect GLOBUS_TCP_PORT_RANGE for the purposes of gram4 > notification sinks > Ok, I don't understand what's going on. I am sure it worked with 0.5 yesterday, but it doesn't anymore. This happens only for UC TG site though. I checked again netstat, and I do have port 50000 listening. I am also able to telnet to this port from UC, firewall is ok. The job is submitted and executed, but apparently notification doesn't reach my server. The relevant part of my sites.xml has not changed since the last time I had it working: /home/fedorov/scratch From hategan at mcs.anl.gov Thu Aug 28 11:24:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 28 Aug 2008 11:24:43 -0500 Subject: [Swift-user] tcp.port.range in Swift 0.6 In-Reply-To: <82f536810808280910t3b4b90a1k69b55400ea67e325@mail.gmail.com> References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> <82f536810808280910t3b4b90a1k69b55400ea67e325@mail.gmail.com> Message-ID: <1219940683.18551.9.camel@localhost> On Thu, 2008-08-28 at 12:10 -0400, Andriy Fedorov wrote: > On Thu, Aug 28, 2008 at 11:09 AM, Ben Clifford wrote: > > > > On Wed, 27 Aug 2008, Andriy Fedorov wrote: > > > >> It appears to me that Swift 0.6 does not handle properly custom > >> GLOBUS_TCP_PORT_RANGE. > > > > I just tried 0.6 and a head build from about 24h ago on my laptop, and > > both appear to respect GLOBUS_TCP_PORT_RANGE for the purposes of gram4 > > notification sinks > > > > Ok, I don't understand what's going on. I am sure it worked with 0.5 > yesterday, but it doesn't anymore. > > This happens only for UC TG site though. I checked again netstat, and > I do have port 50000 listening. I am also able to telnet to this port > from UC, firewall is ok. The job is submitted and executed, but > apparently notification doesn't reach my server. > > The relevant part of my sites.xml has not changed since the last time > I had it working: There is one change which might affect things, depending on your exact configuration. Previously, in cog.properties or swift.properties, the ip= setting was the one to use to force a specific client IP. This has changed to hostname=. However, a higher priority is given to $GLOBUS_HOSTNAME, which is copied by the startup scripts from $HOSTNAME (if not explicitly set). This was done because Ben noticed that it may be desirable to be able to pass an unresolved DNS name as a callback address, which should be resolved by the servers when trying to... call back. So if you were previously relying on ip= in cog/swift.properties, try setting hostname= instead (it can be a numeric IP). Remove ip=, which has been deprecated in swift.properties. If that doesn't work (i.e. you have an improper $HOSTNAME) set $GLOBUS_HOSTNAME. Mihael From benc at hawaga.org.uk Thu Aug 28 12:11:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 28 Aug 2008 17:11:54 +0000 (GMT) Subject: [Swift-user] tcp.port.range in Swift 0.6 In-Reply-To: References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> Message-ID: tg uc gram4 seems a bit funny at the moment - it hangs like this when I do a manual job submission from a UC machine. That might be the problem, not the Swift version. $ globusrun-ws -submit -F tg-grid.uc.teragrid.org -Ft Fork -c /bin/hostname Submitting job...Done. Job ID: uuid:407bb778-7524-11dd-8991-001a64784960 Termination time: 08/29/2008 17:10 GMT -- From fedorov at cs.wm.edu Thu Aug 28 12:25:09 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Thu, 28 Aug 2008 13:25:09 -0400 Subject: [Swift-user] tcp.port.range in Swift 0.6 In-Reply-To: References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> Message-ID: <82f536810808281025i2f9ec69aj8f4af24b6638c5cb@mail.gmail.com> On Thu, Aug 28, 2008 at 1:11 PM, Ben Clifford wrote: > tg uc gram4 seems a bit funny at the moment - it hangs like this when I do > a manual job submission from a UC machine. That might be the problem, not > the Swift version. > This is what I was going to try right before I received your email. I wanted to comment in general on TeraGrid. I find it to be a very painful experience using it through GRAM. Initially, I wanted to use 4 sites: Linux clusters on SDSC, UC and NCSA, and Abe at NCSA. I found problems with GRAM configuration on Abe and NCSA cluster, I reported that to help about 3 weeks ago, and every time I ask about updates, they tell me they would notify me. Now UC is also out of the loop, and I have only SDSC left, which usually has very tight queue schedule. I assume, these problems would be resolved more quickly if more people were doing something similar to what I am doing. I will keep bugging TG help until I run out of time for this project.... > $ globusrun-ws -submit -F tg-grid.uc.teragrid.org -Ft Fork -c > /bin/hostname > Submitting job...Done. > Job ID: uuid:407bb778-7524-11dd-8991-001a64784960 > Termination time: 08/29/2008 17:10 GMT > > -- > > From benc at hawaga.org.uk Thu Aug 28 12:56:50 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 28 Aug 2008 17:56:50 +0000 (GMT) Subject: [Swift-user] tcp.port.range in Swift 0.6 In-Reply-To: <82f536810808281025i2f9ec69aj8f4af24b6638c5cb@mail.gmail.com> References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> <82f536810808281025i2f9ec69aj8f4af24b6638c5cb@mail.gmail.com> Message-ID: On Thu, 28 Aug 2008, Andriy Fedorov wrote: > I wanted to comment in general on TeraGrid. I find it to be a very > painful experience using it through GRAM. Initially, I wanted to use 4 In cases where GRAM4 isn't working, you can probably submit (at a lower rate) with GRAM2 to the same site - that seems to be more reliable often. --