From iraicu at cs.iit.edu Wed Feb 1 17:43:48 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Wed, 01 Feb 2012 17:43:48 -0600 Subject: [Swift-devel] Call for Workshops: The 9th Int. Conf. on Autonomic Computing (ICAC 2012) Message-ID: <4F29CE34.8020706@cs.iit.edu> CALL FOR WORKSHOP PROPOSALS The 9th International Conference on Autonomic Computing (ICAC 2012) September 17-21, 2012. San Jose, CA, USA http://icac2012.cs.fiu.edu/ ----------------------------------------------------------------- IMPORTANT DATES Workshop Proposal Submission: February 10, 2012 ----------------------------------------------------------------- OVERVIEW ICAC is the leading conference on autonomic computing techniques, foundations, and applications. Autonomic computing refers to methods and means for automated management of performance, fault, security, and configuration with little involvement of users or administrators. Systems introducing new autonomic features are becoming increasingly prevalent, motivating research that spans a variety of areas, from computer systems, networking, software engineering, and data management to machine learning, control theory, and bio-inspired computing. ICAC brings together researchers and practitioners across these disciplines to address multiple facets of adaptation and self-management in computing systems and applications from different perspectives. Autonomic computing solutions are sought for clouds, grids, data centers, enterprise software, internet services, data services, smart phones, embedded systems, and sensor networks. In these environments, resources and applications must be managed to maximize performance and minimize cost, while maintaining predictable and reliable behavior in the face of varying workloads, failures, and malicious threats. ICAC'12 welcomes proposals for co-located workshops on topics of interest to the autonomic computing community. Workshop proposals should be submitted to the Workshop Chair, Fred Douglis (f.douglis at computer.org) by February 10, 2012. Workshops are expected to publish proceedings, and should cover areas that complement the main program. ------------------------------------------------------------------ ORGANIZERS GENERAL CHAIR: Dejan Milojicic, HP Labs WORKSHOPS CHAIR: Fred Douglis, EMC -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From turam at mcs.anl.gov Fri Feb 3 13:20:05 2012 From: turam at mcs.anl.gov (Thomas Uram) Date: Fri, 3 Feb 2012 13:20:05 -0600 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) Message-ID: I'm encountering a problem using coasters with ssh-cl:pbs in trunk. The first error is as follows: 2012-02-03 13:05:29,823-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-k2urbkmk - Application exception: null Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] Caused by: java.net.NoRouteToHostException: No route to host 2012-02-03 13:05:29,875-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-j2urbkmk - Application exception: null Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(3)[1544213635: {}] Caused by: java.net.NoRouteToHostException: No route to host 2012-02-03 13:05:32,585-0600 WARN vdl:transferwrapperlog Failed to transfer wrapper log for job hostname-k2urbkmk 2012-02-03 13:05:32,586-0600 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from hostname-20120203-1305-3q1m7jg3/info/k on Bugaboo: null Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /home/turam/tmp/hostname-20120203-1305-3q1m7jg3/info/k/hostname-k2urbkmk-info Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory 500-A system call failed: No such file or directory 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory 500-A system call failed: No such file or directory 500 End.] The full log file (with embedded sites and tc files) is here: http://www.mcs.anl.gov/~turam/20120203-1308/hostname-20120203-1305-3q1m7jg3.log This same scenario worked with Swift 0.93, using ssh:pbs instead (ssh-cl is only available in trunk). Any help understanding and working around this problem would be great. Thanks, Tom Uram From jonmon at mcs.anl.gov Fri Feb 3 13:27:32 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 3 Feb 2012 13:27:32 -0600 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: References: Message-ID: So I am not sure if this is a similar problem I ran into, but I had to change the X509_USER_PROXY variable. Normally this is set to /tmp/x509up_u. I had to change it(changed it to $HOME/.globus/ For some reason when issuing a command over ssh(example: ssh jonmon at login.pads.ci.uchicago.edu ls /tmp/) my proxy file was not there. But when I would log into the machine before issuing the ls command the proxy file was there. I assumed(not verified) that the /tmp/ directory is not fully configured/mounted properly when issuing a command over ssh. Changing the X509_USER_PROXY variable fixed the issue. On Feb 3, 2012, at 1:20 PM, Thomas Uram wrote: > > I'm encountering a problem using coasters with ssh-cl:pbs in trunk. The first error is as follows: > > 2012-02-03 13:05:29,823-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-k2urbkmk - Application exception: null > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job > Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] > Caused by: java.net.NoRouteToHostException: No route to host > 2012-02-03 13:05:29,875-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-j2urbkmk - Application exception: null > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job > Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(3)[1544213635: {}] > Caused by: java.net.NoRouteToHostException: No route to host > 2012-02-03 13:05:32,585-0600 WARN vdl:transferwrapperlog Failed to transfer wrapper log for job hostname-k2urbkmk > 2012-02-03 13:05:32,586-0600 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from hostname-20120203-1305-3q1m7jg3/info/k on Bugaboo: null > Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile > Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /home/turam/tmp/hostname-20120203-1305-3q1m7jg3/info/k/hostname-k2urbkmk-info > Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory > 500-A system call failed: No such file or directory > 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory > 500-A system call failed: No such file or directory > 500 End.] > > > The full log file (with embedded sites and tc files) is here: > > http://www.mcs.anl.gov/~turam/20120203-1308/hostname-20120203-1305-3q1m7jg3.log > > This same scenario worked with Swift 0.93, using ssh:pbs instead (ssh-cl is only available in trunk). > > Any help understanding and working around this problem would be great. > > Thanks, > Tom Uram > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From turam at mcs.anl.gov Fri Feb 3 13:35:10 2012 From: turam at mcs.anl.gov (Thomas Uram) Date: Fri, 3 Feb 2012 13:35:10 -0600 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: References: Message-ID: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> That doesn't appear to help in my case. Should the hostname in the URL here concern me? >>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] >> On Feb 3, 2012, at 1:27 PM, Jonathan Monette wrote: > So I am not sure if this is a similar problem I ran into, but I had to change the X509_USER_PROXY variable. Normally this is set to /tmp/x509up_u. I had to change it(changed it to $HOME/.globus/ For some reason when issuing a command over ssh(example: ssh jonmon at login.pads.ci.uchicago.edu ls /tmp/) my proxy file was not there. But when I would log into the machine before issuing the ls command the proxy file was there. I assumed(not verified) that the /tmp/ directory is not fully configured/mounted properly when issuing a command over ssh. Changing the X509_USER_PROXY variable fixed the issue. > > On Feb 3, 2012, at 1:20 PM, Thomas Uram wrote: > >> >> I'm encountering a problem using coasters with ssh-cl:pbs in trunk. The first error is as follows: >> >> 2012-02-03 13:05:29,823-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-k2urbkmk - Application exception: null >> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job >> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] >> Caused by: java.net.NoRouteToHostException: No route to host >> 2012-02-03 13:05:29,875-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-j2urbkmk - Application exception: null >> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job >> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(3)[1544213635: {}] >> Caused by: java.net.NoRouteToHostException: No route to host >> 2012-02-03 13:05:32,585-0600 WARN vdl:transferwrapperlog Failed to transfer wrapper log for job hostname-k2urbkmk >> 2012-02-03 13:05:32,586-0600 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from hostname-20120203-1305-3q1m7jg3/info/k on Bugaboo: null >> Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile >> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /home/turam/tmp/hostname-20120203-1305-3q1m7jg3/info/k/hostname-k2urbkmk-info >> Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory >> 500-A system call failed: No such file or directory >> 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory >> 500-A system call failed: No such file or directory >> 500 End.] >> >> >> The full log file (with embedded sites and tc files) is here: >> >> http://www.mcs.anl.gov/~turam/20120203-1308/hostname-20120203-1305-3q1m7jg3.log >> >> This same scenario worked with Swift 0.93, using ssh:pbs instead (ssh-cl is only available in trunk). >> >> Any help understanding and working around this problem would be great. >> >> Thanks, >> Tom Uram >> >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From jonmon at mcs.anl.gov Fri Feb 3 13:36:53 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 3 Feb 2012 13:36:53 -0600 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> References: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> Message-ID: <32EB9F52-B542-4793-B2AF-C65D0803C9B2@mcs.anl.gov> What machine are you executing on? bridled to where? On Feb 3, 2012, at 1:35 PM, Thomas Uram wrote: > That doesn't appear to help in my case. > > Should the hostname in the URL here concern me? > >>>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] > >>> > > > > > > On Feb 3, 2012, at 1:27 PM, Jonathan Monette wrote: > >> So I am not sure if this is a similar problem I ran into, but I had to change the X509_USER_PROXY variable. Normally this is set to /tmp/x509up_u. I had to change it(changed it to $HOME/.globus/ For some reason when issuing a command over ssh(example: ssh jonmon at login.pads.ci.uchicago.edu ls /tmp/) my proxy file was not there. But when I would log into the machine before issuing the ls command the proxy file was there. I assumed(not verified) that the /tmp/ directory is not fully configured/mounted properly when issuing a command over ssh. Changing the X509_USER_PROXY variable fixed the issue. >> >> On Feb 3, 2012, at 1:20 PM, Thomas Uram wrote: >> >>> >>> I'm encountering a problem using coasters with ssh-cl:pbs in trunk. The first error is as follows: >>> >>> 2012-02-03 13:05:29,823-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-k2urbkmk - Application exception: null >>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job >>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] >>> Caused by: java.net.NoRouteToHostException: No route to host >>> 2012-02-03 13:05:29,875-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-j2urbkmk - Application exception: null >>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job >>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(3)[1544213635: {}] >>> Caused by: java.net.NoRouteToHostException: No route to host >>> 2012-02-03 13:05:32,585-0600 WARN vdl:transferwrapperlog Failed to transfer wrapper log for job hostname-k2urbkmk >>> 2012-02-03 13:05:32,586-0600 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from hostname-20120203-1305-3q1m7jg3/info/k on Bugaboo: null >>> Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile >>> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /home/turam/tmp/hostname-20120203-1305-3q1m7jg3/info/k/hostname-k2urbkmk-info >>> Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory >>> 500-A system call failed: No such file or directory >>> 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory >>> 500-A system call failed: No such file or directory >>> 500 End.] >>> >>> >>> The full log file (with embedded sites and tc files) is here: >>> >>> http://www.mcs.anl.gov/~turam/20120203-1308/hostname-20120203-1305-3q1m7jg3.log >>> >>> This same scenario worked with Swift 0.93, using ssh:pbs instead (ssh-cl is only available in trunk). >>> >>> Any help understanding and working around this problem would be great. >>> >>> Thanks, >>> Tom Uram >>> >>> >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > From hategan at mcs.anl.gov Fri Feb 3 13:37:07 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 03 Feb 2012 11:37:07 -0800 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> References: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> Message-ID: <1328297827.22991.0.camel@blabla> On Fri, 2012-02-03 at 13:35 -0600, Thomas Uram wrote: > That doesn't appear to help in my case. > > Should the hostname in the URL here concern me? It should. Or rather said "no route to host" should. Did you set GLOBUS_HOSTNAME on the client side to the public IP of the client machine? Mihael From turam at mcs.anl.gov Fri Feb 3 13:44:16 2012 From: turam at mcs.anl.gov (Thomas Uram) Date: Fri, 3 Feb 2012 13:44:16 -0600 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: <1328297827.22991.0.camel@blabla> References: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> <1328297827.22991.0.camel@blabla> Message-ID: <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov> No I didn't set GLOBUS_HOSTNAME. The address it complains about (206.12.24.2) is publicly reachable. So is the hostname of the machine on which I'm running Swift (fl.ci.uchicago.edu). I was wondering about the jumble that follows the hostname:port in that URL: >> Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] On Feb 3, 2012, at 1:37 PM, Mihael Hategan wrote: > On Fri, 2012-02-03 at 13:35 -0600, Thomas Uram wrote: >> That doesn't appear to help in my case. >> >> Should the hostname in the URL here concern me? > > It should. Or rather said "no route to host" should. Did you set > GLOBUS_HOSTNAME on the client side to the public IP of the client > machine? > > Mihael > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Feb 3 13:54:03 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 03 Feb 2012 11:54:03 -0800 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov> References: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> <1328297827.22991.0.camel@blabla> <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov> Message-ID: <1328298843.3200.2.camel@blabla> On Fri, 2012-02-03 at 13:44 -0600, Thomas Uram wrote: > No I didn't set GLOBUS_HOSTNAME. The address it complains about > (206.12.24.2) is publicly reachable. So is the hostname of the machine > on which I'm running Swift (fl.ci.uchicago.edu). They should be the same! (i.e. the coaster service tries to connect back to the machine you're running Swift on). Can you try setting GLOBUS_HOSTNAME and see what happens? > > > I was wondering about the jumble that follows the hostname:port in > that URL: > > > > > Failed to start channel > > > GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] (2) is the channel ID [15...] is the channel context They are not part of the IP address, but part of GSSChannel.toString(). From turam at mcs.anl.gov Fri Feb 3 14:02:53 2012 From: turam at mcs.anl.gov (Thomas Uram) Date: Fri, 3 Feb 2012 14:02:53 -0600 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: <1328298843.3200.2.camel@blabla> References: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> <1328297827.22991.0.camel@blabla> <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov> <1328298843.3200.2.camel@blabla> Message-ID: I have done this without success: GLOBUS_HOSTNAME=fl.ci.uchicago.edu GLOBUS_TCP_PORT_RANGE=50000,50100 swiftt -sites.file sites.coasters.xml -tc.file tc.data hostname.swift Swift trunk swift-r5501 (swift modified locally) cog-r3350 (cog modified locally) RunID: 20120203-1357-8tekc3f7 Progress: time: Fri, 03 Feb 2012 13:57:24 -0600 Progress: time: Fri, 03 Feb 2012 13:57:31 -0600 Selecting site:4 Initializing site shared directory:1 Stage in:1 ssh not set, setting to 'gsissh' ssh=gsissh Find: https://206.12.24.2:38675 Find: keepalive(120), reconnect - https://206.12.24.2:38675 Progress: time: Fri, 03 Feb 2012 13:57:35 -0600 Selecting site:4 Submitting:1 Submitted:1 Failed to transfer wrapper log for job hostname-1jnudkmk Progress: time: Fri, 03 Feb 2012 13:57:38 -0600 Selecting site:3 Stage in:1 Failed but can retry:2 Failed to transfer wrapper log for job hostname-2jnudkmk Failed to transfer wrapper log for job hostname-4jnudkmk Progress: time: Fri, 03 Feb 2012 13:57:54 -0600 Selecting site:3 Failed but can retry:3 Progress: time: Fri, 03 Feb 2012 13:57:57 -0600 Selecting site:2 Stage in:1 Failed but can retry:3 Failed to transfer wrapper log for job hostname-7jnudkmk No events in 10s. Registered futures: ---- Waiting threads: ---- No events in 10s. Registered futures: ---- Waiting threads: ---- ** Ctrl-C here ** Progress: time: Fri, 03 Feb 2012 13:58:24 -0600 Selecting site:2 Failed but can retry:4 Failed to shut down service https://206.12.24.2:38675 org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:38675(6)[69518356: {}] at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:103) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:62) at org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:55) at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:116) at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:236) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:256) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:217) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager$ServiceReaper.run(ServiceManager.java:430) Caused by: java.net.NoRouteToHostException: No route to host at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at java.net.Socket.(Socket.java:375) at java.net.Socket.(Socket.java:276) at org.globus.net.SocketFactory.createSocket(SocketFactory.java:74) at org.globus.net.SocketFactory.createSocket(SocketFactory.java:53) at org.globus.gsi.gssapi.net.GssSocket.(GssSocket.java:56) at org.globus.gsi.gssapi.net.impl.GSIGssSocket.(GSIGssSocket.java:29) at org.globus.gsi.gssapi.net.impl.GSIGssSocketFactory.createSocket(GSIGssSocketFactory.java:38) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:89) ... 8 more Full log here: http://www.mcs.anl.gov/~turam/20120203-1401/hostname-20120203-1357-8tekc3f7.log On Feb 3, 2012, at 1:54 PM, Mihael Hategan wrote: > On Fri, 2012-02-03 at 13:44 -0600, Thomas Uram wrote: >> No I didn't set GLOBUS_HOSTNAME. The address it complains about >> (206.12.24.2) is publicly reachable. So is the hostname of the machine >> on which I'm running Swift (fl.ci.uchicago.edu). > > They should be the same! (i.e. the coaster service tries to connect back > to the machine you're running Swift on). > > Can you try setting GLOBUS_HOSTNAME and see what happens? > >> >> >> I was wondering about the jumble that follows the hostname:port in >> that URL: >> >> >>>> Failed to start channel >>>> GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] > > (2) is the channel ID > [15...] is the channel context > They are not part of the IP address, but part of GSSChannel.toString(). > > From hategan at mcs.anl.gov Fri Feb 3 14:29:52 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 03 Feb 2012 12:29:52 -0800 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: References: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> <1328297827.22991.0.camel@blabla> <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov> <1328298843.3200.2.camel@blabla> Message-ID: <1328300992.4145.0.camel@blabla> Ok, so maybe the ssh-cl provider doesn't properly forward environment variables. I'll double check that. On Fri, 2012-02-03 at 14:02 -0600, Thomas Uram wrote: > I have done this without success: > > GLOBUS_HOSTNAME=fl.ci.uchicago.edu > GLOBUS_TCP_PORT_RANGE=50000,50100 > swiftt -sites.file sites.coasters.xml -tc.file tc.data hostname.swift > Swift trunk swift-r5501 (swift modified locally) cog-r3350 (cog modified locally) > > RunID: 20120203-1357-8tekc3f7 > Progress: time: Fri, 03 Feb 2012 13:57:24 -0600 > Progress: time: Fri, 03 Feb 2012 13:57:31 -0600 Selecting site:4 Initializing site shared directory:1 Stage in:1 > ssh not set, setting to 'gsissh' > ssh=gsissh > Find: https://206.12.24.2:38675 > Find: keepalive(120), reconnect - https://206.12.24.2:38675 > Progress: time: Fri, 03 Feb 2012 13:57:35 -0600 Selecting site:4 Submitting:1 Submitted:1 > Failed to transfer wrapper log for job hostname-1jnudkmk > Progress: time: Fri, 03 Feb 2012 13:57:38 -0600 Selecting site:3 Stage in:1 Failed but can retry:2 > Failed to transfer wrapper log for job hostname-2jnudkmk > Failed to transfer wrapper log for job hostname-4jnudkmk > Progress: time: Fri, 03 Feb 2012 13:57:54 -0600 Selecting site:3 Failed but can retry:3 > Progress: time: Fri, 03 Feb 2012 13:57:57 -0600 Selecting site:2 Stage in:1 Failed but can retry:3 > Failed to transfer wrapper log for job hostname-7jnudkmk > No events in 10s. > > Registered futures: > ---- > > Waiting threads: > ---- > > No events in 10s. > > Registered futures: > ---- > > Waiting threads: > ---- > > ** Ctrl-C here ** > > Progress: time: Fri, 03 Feb 2012 13:58:24 -0600 Selecting site:2 Failed but can retry:4 > Failed to shut down service https://206.12.24.2:38675 > org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:38675(6)[69518356: {}] > at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:103) > at org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:62) > at org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:55) > at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:116) > at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72) > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:236) > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:256) > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:217) > at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager$ServiceReaper.run(ServiceManager.java:430) > Caused by: java.net.NoRouteToHostException: No route to host > at java.net.PlainSocketImpl.socketConnect(Native Method) > at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) > at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) > at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) > at java.net.Socket.connect(Socket.java:529) > at java.net.Socket.connect(Socket.java:478) > at java.net.Socket.(Socket.java:375) > at java.net.Socket.(Socket.java:276) > at org.globus.net.SocketFactory.createSocket(SocketFactory.java:74) > at org.globus.net.SocketFactory.createSocket(SocketFactory.java:53) > at org.globus.gsi.gssapi.net.GssSocket.(GssSocket.java:56) > at org.globus.gsi.gssapi.net.impl.GSIGssSocket.(GSIGssSocket.java:29) > at org.globus.gsi.gssapi.net.impl.GSIGssSocketFactory.createSocket(GSIGssSocketFactory.java:38) > at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:89) > ... 8 more > > > Full log here: > http://www.mcs.anl.gov/~turam/20120203-1401/hostname-20120203-1357-8tekc3f7.log > > > > > > > On Feb 3, 2012, at 1:54 PM, Mihael Hategan wrote: > > > On Fri, 2012-02-03 at 13:44 -0600, Thomas Uram wrote: > >> No I didn't set GLOBUS_HOSTNAME. The address it complains about > >> (206.12.24.2) is publicly reachable. So is the hostname of the machine > >> on which I'm running Swift (fl.ci.uchicago.edu). > > > > They should be the same! (i.e. the coaster service tries to connect back > > to the machine you're running Swift on). > > > > Can you try setting GLOBUS_HOSTNAME and see what happens? > > > >> > >> > >> I was wondering about the jumble that follows the hostname:port in > >> that URL: > >> > >> > >>>> Failed to start channel > >>>> GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}] > > > > (2) is the channel ID > > [15...] is the channel context > > They are not part of the IP address, but part of GSSChannel.toString(). > > > > > From wilde at mcs.anl.gov Sat Feb 4 11:04:48 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 4 Feb 2012 11:04:48 -0600 (CST) Subject: [Swift-devel] Fwd: Google Summer of Code 2012 Announced In-Reply-To: Message-ID: <1201124034.213374.1328375088630.JavaMail.root@zimbra.anl.gov> ----- Forwarded Message ----- From: "Borja Sotomayor" To: "globus-dev" Cc: "Michael Wilde" , bresnaha at mcs.anl.gov Sent: Saturday, February 4, 2012 10:46:08 AM Subject: Fwd: Google Summer of Code 2012 Announced Hi all, fyi, Google Summer of Code 2012 has just been announced. Applications to become a Mentoring Organization are due on March 9th. ---------- Forwarded message ---------- From: Carol Smith Date: Sat, Feb 4, 2012 at 10:43 AM Subject: Google Summer of Code 2012 Announced To: Google Summer of Code Announce Hi all, We're pleased to announce that Google Summer of Code will be happening for?its eighth year this year. Please check out the blog post [1] about the?program and read the FAQs [2] and Timeline [3] on Melange for more?information. [1] -?http://google-opensource.blogspot.com/2012/02/google-summer-of-code-2012-is-on.html [2] -?http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2012/faqs [3] -?http://www.google-melange.com/gsoc/events/google/gsoc2012 Cheers, Carol -- You received this message because you are subscribed to the Google Groups "Google Summer of Code Announce" group. To post to this group, send email to google-summer-of-code-announce at googlegroups.com. To unsubscribe from this group, send email to google-summer-of-code-announce+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-summer-of-code-announce?hl=en. -- Borja Sotomayor ?Researcher, Computation Institute ?Lecturer, Department of Computer Science ?University of Chicago ?http://people.cs.uchicago.edu/~borja/ ?Community Manager, OpenNebula project ?http://www.opennebula.org/ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Feb 4 20:16:37 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 04 Feb 2012 18:16:37 -0800 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: <1328300992.4145.0.camel@blabla> References: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> <1328297827.22991.0.camel@blabla> <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov> <1328298843.3200.2.camel@blabla> <1328300992.4145.0.camel@blabla> Message-ID: <1328408197.14297.0.camel@blabla> Yep, it didn't. Fixed in latest trunk. Let me know if the problem persists. On Fri, 2012-02-03 at 12:29 -0800, Mihael Hategan wrote: > Ok, so maybe the ssh-cl provider doesn't properly forward environment > variables. I'll double check that. > > On Fri, 2012-02-03 at 14:02 -0600, Thomas Uram wrote: > > I have done this without success: > > > > GLOBUS_HOSTNAME=fl.ci.uchicago.edu > > GLOBUS_TCP_PORT_RANGE=50000,50100 > > swiftt -sites.file sites.coasters.xml -tc.file tc.data hostname.swift > > Swift trunk swift-r5501 (swift modified locally) cog-r3350 (cog modified locally) From davidk at ci.uchicago.edu Mon Feb 6 07:56:07 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 6 Feb 2012 07:56:07 -0600 (CST) Subject: [Swift-devel] merge 0.93 -> trunk In-Reply-To: Message-ID: <667869279.105106.1328536567589.JavaMail.root@zimbra-mb2.anl.gov> For the most part, the tests seems to be going pretty well. There's a group of tests called language-behaviour/cleanup in which the post-test cleanup scripts are failing. These tests are not in 0.93 for comparison.. not sure if the problem is with some expected cleanup behavior, or with the tests themselves. Does anyone know more about these? The other failure is related to the sequential iteration script. I believe this is related to some language behavior changes in this release. The script below fails to compile: --- type counterfile; app (counterfile t) echo(string m) { echo m stdout=@filename(t); } app (counterfile t) countstep(counterfile i) { wcl @filename(i) @filename(t); } counterfile a[] ; a[0] = echo("793578934574893"); iterate v { a[v+1] = countstep(a[v]); trace("extract int value ", at extractint(a[v+1])); } until (@extractint(a[v+1]) <= 1); --- Could not start execution: Failed to convert .xml to .kml for sequential_iteration.swift: null Other than those two issues, things look pretty good. All other tests have been passing consistently for the last few days. David ----- Original Message ----- > From: "Jonathan Monette" > To: "Mihael Hategan" > Cc: "David Kelly" , "Swift Devel" > Sent: Monday, January 30, 2012 8:45:06 AM > Subject: Re: [Swift-devel] merge 0.93 -> trunk > I am seeing the same error when trying to compile trunk. > > On Jan 29, 2012, at 6:15 PM, Mihael Hategan wrote: > > > Maybe the checkout happened in the middle of a commit? > > > > Is anybody seeing this with a clean checkout? > > > > On Sun, 2012-01-29 at 16:07 -0600, David Kelly wrote: > >> It looks like the compile failed and the test did not run last > >> night. Here is the error I am getting: > >> > >> [javac] > >> /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/LocalTCPService.java:29: > >> org.globus.cog.abstraction.coaster.service.LocalTCPService is > >> not abstract and does not override abstract method > >> registrationReceived(java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.KarajanChannel,java.util.Map) > >> in org.globus.cog.abstraction.coaster.service.Registering > >> [javac] public class LocalTCPService extends GSSService > >> implements Registering { > >> [javac] ^ > >> [javac] > >> /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/LocalTCPService.java:64: > >> registrationReceived(java.lang.String,java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.ChannelContext,java.util.Map) > >> in > >> org.globus.cog.abstraction.coaster.service.RegistrationManager > >> cannot be applied to > >> (java.lang.String,java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.ChannelContext) > >> [javac] registrationManager.registrationReceived(blockid, wid, > >> url, cc); > >> [javac] ^ > >> [javac] Note: > >> /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Block.java > >> uses or overrides a deprecated API. > >> [javac] Note: Recompile with -Xlint:deprecation for details. > >> [javac] Note: > >> /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BQPStatusHandler.java > >> uses unchecked or unsafe operations. > >> [javac] Note: Recompile with -Xlint:unchecked for details. > >> [javac] 2 errors > >> > >> BUILD FAILED > >> /swift/swift-trunk/cog/modules/swift/build.xml:73: The following > >> error occurred while executing this line: > >> /swift/swift-trunk/cog/mbuild.xml:445: The following error occurred > >> while executing this line: > >> /swift/swift-trunk/cog/mbuild.xml:79: The following error occurred > >> while executing this line: > >> /swift/swift-trunk/cog/mbuild.xml:52: The following error occurred > >> while executing this line: > >> /swift/swift-trunk/cog/modules/swift/dependencies.xml:13: The > >> following error occurred while executing this line: > >> /swift/swift-trunk/cog/mbuild.xml:163: The following error occurred > >> while executing this line: > >> /swift/swift-trunk/cog/mbuild.xml:168: The following error occurred > >> while executing this line: > >> /swift/swift-trunk/cog/modules/provider-coaster/build.xml:59: The > >> following error occurred while executing this line: > >> /swift/swift-trunk/cog/mbuild.xml:466: The following error occurred > >> while executing this line: > >> /swift/swift-trunk/cog/mbuild.xml:229: Compile failed; see the > >> compiler error output for details. > >> > >> > >> > >> ----- Original Message ----- > >>> From: "Michael Wilde" > >>> To: "Mihael Hategan" > >>> Cc: "Swift Devel" > >>> Sent: Sunday, January 29, 2012 10:27:18 AM > >>> Subject: Re: [Swift-devel] merge 0.93 -> trunk > >>> Excellent - thanks! David, can you tell us how the nightly tests > >>> in > >>> trunk were affected by the integration? > >>> > >>> - Mike > >>> > >>> ----- Original Message ----- > >>>> From: "Mihael Hategan" > >>>> To: "Swift Devel" > >>>> Sent: Saturday, January 28, 2012 11:01:54 PM > >>>> Subject: [Swift-devel] merge 0.93 -> trunk > >>>> Did the merge. I still need to do some sanity checks, so it may > >>>> be > >>>> shaky > >>>> at the moment. > >>>> > >>>> Mihael > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >>> > >>> -- > >>> Michael Wilde > >>> Computation Institute, University of Chicago > >>> Mathematics and Computer Science Division > >>> Argonne National Laboratory > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Feb 6 12:23:28 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Feb 2012 10:23:28 -0800 Subject: [Swift-devel] merge 0.93 -> trunk In-Reply-To: <667869279.105106.1328536567589.JavaMail.root@zimbra-mb2.anl.gov> References: <667869279.105106.1328536567589.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1328552608.26929.1.camel@blabla> Cool. I didn't mess up too much then :) On Mon, 2012-02-06 at 07:56 -0600, David Kelly wrote: > For the most part, the tests seems to be going pretty well. > > There's a group of tests called language-behaviour/cleanup in which the post-test cleanup scripts are failing. These tests are not in 0.93 for comparison.. not sure if the problem is with some expected cleanup behavior, or with the tests themselves. Does anyone know more about these? > > The other failure is related to the sequential iteration script. I believe this is related to some language behavior changes in this release. The script below fails to compile: > > --- > type counterfile; > > app (counterfile t) echo(string m) { > echo m stdout=@filename(t); > } > > app (counterfile t) countstep(counterfile i) { > wcl @filename(i) @filename(t); > } > > counterfile a[] ; > > a[0] = echo("793578934574893"); > > iterate v { > a[v+1] = countstep(a[v]); > trace("extract int value ", at extractint(a[v+1])); > } until (@extractint(a[v+1]) <= 1); > --- > > Could not start execution: > Failed to convert .xml to .kml for sequential_iteration.swift: > null > > Other than those two issues, things look pretty good. All other tests have been passing consistently for the last few days. > > David > > ----- Original Message ----- > > From: "Jonathan Monette" > > To: "Mihael Hategan" > > Cc: "David Kelly" , "Swift Devel" > > Sent: Monday, January 30, 2012 8:45:06 AM > > Subject: Re: [Swift-devel] merge 0.93 -> trunk > > I am seeing the same error when trying to compile trunk. > > > > On Jan 29, 2012, at 6:15 PM, Mihael Hategan wrote: > > > > > Maybe the checkout happened in the middle of a commit? > > > > > > Is anybody seeing this with a clean checkout? > > > > > > On Sun, 2012-01-29 at 16:07 -0600, David Kelly wrote: > > >> It looks like the compile failed and the test did not run last > > >> night. Here is the error I am getting: > > >> > > >> [javac] > > >> /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/LocalTCPService.java:29: > > >> org.globus.cog.abstraction.coaster.service.LocalTCPService is > > >> not abstract and does not override abstract method > > >> registrationReceived(java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.KarajanChannel,java.util.Map) > > >> in org.globus.cog.abstraction.coaster.service.Registering > > >> [javac] public class LocalTCPService extends GSSService > > >> implements Registering { > > >> [javac] ^ > > >> [javac] > > >> /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/LocalTCPService.java:64: > > >> registrationReceived(java.lang.String,java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.ChannelContext,java.util.Map) > > >> in > > >> org.globus.cog.abstraction.coaster.service.RegistrationManager > > >> cannot be applied to > > >> (java.lang.String,java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.ChannelContext) > > >> [javac] registrationManager.registrationReceived(blockid, wid, > > >> url, cc); > > >> [javac] ^ > > >> [javac] Note: > > >> /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Block.java > > >> uses or overrides a deprecated API. > > >> [javac] Note: Recompile with -Xlint:deprecation for details. > > >> [javac] Note: > > >> /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BQPStatusHandler.java > > >> uses unchecked or unsafe operations. > > >> [javac] Note: Recompile with -Xlint:unchecked for details. > > >> [javac] 2 errors > > >> > > >> BUILD FAILED > > >> /swift/swift-trunk/cog/modules/swift/build.xml:73: The following > > >> error occurred while executing this line: > > >> /swift/swift-trunk/cog/mbuild.xml:445: The following error occurred > > >> while executing this line: > > >> /swift/swift-trunk/cog/mbuild.xml:79: The following error occurred > > >> while executing this line: > > >> /swift/swift-trunk/cog/mbuild.xml:52: The following error occurred > > >> while executing this line: > > >> /swift/swift-trunk/cog/modules/swift/dependencies.xml:13: The > > >> following error occurred while executing this line: > > >> /swift/swift-trunk/cog/mbuild.xml:163: The following error occurred > > >> while executing this line: > > >> /swift/swift-trunk/cog/mbuild.xml:168: The following error occurred > > >> while executing this line: > > >> /swift/swift-trunk/cog/modules/provider-coaster/build.xml:59: The > > >> following error occurred while executing this line: > > >> /swift/swift-trunk/cog/mbuild.xml:466: The following error occurred > > >> while executing this line: > > >> /swift/swift-trunk/cog/mbuild.xml:229: Compile failed; see the > > >> compiler error output for details. > > >> > > >> > > >> > > >> ----- Original Message ----- > > >>> From: "Michael Wilde" > > >>> To: "Mihael Hategan" > > >>> Cc: "Swift Devel" > > >>> Sent: Sunday, January 29, 2012 10:27:18 AM > > >>> Subject: Re: [Swift-devel] merge 0.93 -> trunk > > >>> Excellent - thanks! David, can you tell us how the nightly tests > > >>> in > > >>> trunk were affected by the integration? > > >>> > > >>> - Mike > > >>> > > >>> ----- Original Message ----- > > >>>> From: "Mihael Hategan" > > >>>> To: "Swift Devel" > > >>>> Sent: Saturday, January 28, 2012 11:01:54 PM > > >>>> Subject: [Swift-devel] merge 0.93 -> trunk > > >>>> Did the merge. I still need to do some sanity checks, so it may > > >>>> be > > >>>> shaky > > >>>> at the moment. > > >>>> > > >>>> Mihael > > >>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > >>> > > >>> -- > > >>> Michael Wilde > > >>> Computation Institute, University of Chicago > > >>> Mathematics and Computer Science Division > > >>> Argonne National Laboratory > > >>> > > >>> _______________________________________________ > > >>> Swift-devel mailing list > > >>> Swift-devel at ci.uchicago.edu > > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From turam at mcs.anl.gov Mon Feb 6 17:38:13 2012 From: turam at mcs.anl.gov (Thomas Uram) Date: Mon, 6 Feb 2012 17:38:13 -0600 Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters, ssh-cl:pbs) In-Reply-To: <1328408197.14297.0.camel@blabla> References: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov> <1328297827.22991.0.camel@blabla> <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov> <1328298843.3200.2.camel@blabla> <1328300992.4145.0.camel@blabla> <1328408197.14297.0.camel@blabla> Message-ID: <08E19F6E-C455-4C62-A6A8-7AD85D2F98EE@mcs.anl.gov> Okay, with this version, my job succeeded: http://www.mcs.anl.gov/~turam/20120206-1731/hostname-20120206-1603-am03uzbb.log This requires that GLOBUS_TCP_PORT_RANGE be set properly so the bootstrap service is started where it can be reached. I do get the original error message a number of times: Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:34724(2)[1625488363: {}] Caused by: java.net.NoRouteToHostException: No route to host It seems to start and stop the coaster service on a variety of ports, one of which eventually succeeds. I don't have documentation to tell me the open port range on the target cluster (I'll get it), but in the meantime, I've discovered some ports that work. Can I specify the port range to be used for the coaster service? I've seen some discussion on the mailing lists about doing so in the context of "coaster-service". At the moment, I'm just running Swift with the configuration you see in the log above. Can I specify the port in my case, or should I use the "coaster-service" script instead? Thanks! Tom On Feb 4, 2012, at 8:16 PM, Mihael Hategan wrote: > Yep, it didn't. Fixed in latest trunk. Let me know if the problem > persists. > > On Fri, 2012-02-03 at 12:29 -0800, Mihael Hategan wrote: >> Ok, so maybe the ssh-cl provider doesn't properly forward environment >> variables. I'll double check that. >> >> On Fri, 2012-02-03 at 14:02 -0600, Thomas Uram wrote: >>> I have done this without success: >>> >>> GLOBUS_HOSTNAME=fl.ci.uchicago.edu >>> GLOBUS_TCP_PORT_RANGE=50000,50100 >>> swiftt -sites.file sites.coasters.xml -tc.file tc.data hostname.swift >>> Swift trunk swift-r5501 (swift modified locally) cog-r3350 (cog modified locally) > > From jonmon at mcs.anl.gov Fri Feb 10 15:03:53 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 10 Feb 2012 15:03:53 -0600 Subject: [Swift-devel] tc and sites file debugging Message-ID: What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run? I keep getting an error saying that the application is not in my tc file for any of the site pool entries. I just want to see if Swift is grabbing the write files. From wozniak at mcs.anl.gov Fri Feb 10 15:14:09 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 10 Feb 2012 15:14:09 -0600 (Central Standard Time) Subject: [Swift-devel] tc and sites file debugging In-Reply-To: References: Message-ID: Just set: log4j.logger.swift.textfiles=DEBUG On Fri, 10 Feb 2012, Jonathan Monette wrote: > What log4j properties do I have to turn on to see what the path is to > the tc and sites file I am using in a Swift run? I keep getting an > error saying that the application is not in my tc file for any of the > site pool entries. I just want to see if Swift is grabbing the write > files. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Justin M Wozniak From jonmon at mcs.anl.gov Fri Feb 10 15:16:33 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 10 Feb 2012 15:16:33 -0600 Subject: [Swift-devel] tc and sites file debugging In-Reply-To: References: Message-ID: It is set. So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct? On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote: > > Just set: > > log4j.logger.swift.textfiles=DEBUG > > On Fri, 10 Feb 2012, Jonathan Monette wrote: > >> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run? I keep getting an error saying that the application is not in my tc file for any of the site pool entries. I just want to see if Swift is grabbing the write files. >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Justin M Wozniak From wozniak at mcs.anl.gov Fri Feb 10 15:19:53 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 10 Feb 2012 15:19:53 -0600 (Central Standard Time) Subject: [Swift-devel] tc and sites file debugging In-Reply-To: References: Message-ID: There should be a message for that case as well. Which branch are you using? On Fri, 10 Feb 2012, Jonathan Monette wrote: > It is set. So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct? > > On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote: > >> >> Just set: >> >> log4j.logger.swift.textfiles=DEBUG >> >> On Fri, 10 Feb 2012, Jonathan Monette wrote: >> >>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run? I keep getting an error saying that the application is not in my tc file for any of the site pool entries. I just want to see if Swift is grabbing the write files. >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Justin M Wozniak > > -- Justin M Wozniak From jonmon at mcs.anl.gov Fri Feb 10 15:20:17 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 10 Feb 2012 15:20:17 -0600 Subject: [Swift-devel] tc and sites file debugging In-Reply-To: References: Message-ID: 0.93 On Feb 10, 2012, at 3:19 PM, Justin M Wozniak wrote: > > There should be a message for that case as well. Which branch are you using? > > On Fri, 10 Feb 2012, Jonathan Monette wrote: > >> It is set. So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct? >> >> On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote: >> >>> >>> Just set: >>> >>> log4j.logger.swift.textfiles=DEBUG >>> >>> On Fri, 10 Feb 2012, Jonathan Monette wrote: >>> >>>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run? I keep getting an error saying that the application is not in my tc file for any of the site pool entries. I just want to see if Swift is grabbing the write files. >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> -- >>> Justin M Wozniak >> >> > > -- > Justin M Wozniak From wozniak at mcs.anl.gov Fri Feb 10 15:41:45 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 10 Feb 2012 15:41:45 -0600 (Central Standard Time) Subject: [Swift-devel] tc and sites file debugging In-Reply-To: References: Message-ID: Using branches/release-0.93, I find that the sites and tc files are included in the log. If you use the default location you just get the path name. On Fri, 10 Feb 2012, Jonathan Monette wrote: > 0.93 > > On Feb 10, 2012, at 3:19 PM, Justin M Wozniak wrote: > >> >> There should be a message for that case as well. Which branch are you using? >> >> On Fri, 10 Feb 2012, Jonathan Monette wrote: >> >>> It is set. So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct? >>> >>> On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote: >>> >>>> >>>> Just set: >>>> >>>> log4j.logger.swift.textfiles=DEBUG >>>> >>>> On Fri, 10 Feb 2012, Jonathan Monette wrote: >>>> >>>>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run? I keep getting an error saying that the application is not in my tc file for any of the site pool entries. I just want to see if Swift is grabbing the write files. >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>> >>>> -- >>>> Justin M Wozniak >>> >>> >> >> -- >> Justin M Wozniak > > -- Justin M Wozniak From jonmon at mcs.anl.gov Fri Feb 10 16:08:10 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 10 Feb 2012 16:08:10 -0600 Subject: [Swift-devel] tc and sites file debugging In-Reply-To: References: Message-ID: So this turned out to be a sites file xml formatting issue. Filed as bug 732. On Feb 10, 2012, at 3:41 PM, Justin M Wozniak wrote: > > Using branches/release-0.93, I find that the sites and tc files are included in the log. If you use the default location you just get the path name. > > On Fri, 10 Feb 2012, Jonathan Monette wrote: > >> 0.93 >> >> On Feb 10, 2012, at 3:19 PM, Justin M Wozniak wrote: >> >>> >>> There should be a message for that case as well. Which branch are you using? >>> >>> On Fri, 10 Feb 2012, Jonathan Monette wrote: >>> >>>> It is set. So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct? >>>> >>>> On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote: >>>> >>>>> >>>>> Just set: >>>>> >>>>> log4j.logger.swift.textfiles=DEBUG >>>>> >>>>> On Fri, 10 Feb 2012, Jonathan Monette wrote: >>>>> >>>>>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run? I keep getting an error saying that the application is not in my tc file for any of the site pool entries. I just want to see if Swift is grabbing the write files. >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>>>> >>>>> -- >>>>> Justin M Wozniak >>>> >>>> >>> >>> -- >>> Justin M Wozniak >> >> > > -- > Justin M Wozniak From wilde at mcs.anl.gov Fri Feb 10 16:08:54 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 10 Feb 2012 16:08:54 -0600 (CST) Subject: [Swift-devel] Useful guid to Cray PBS submit files Message-ID: <1867549921.236830.1328911734948.JavaMail.root@zimbra.anl.gov> http://www.nersc.gov/users/computational-systems/hopper/running-jobs/example-batch-scripts/ From wilde at mcs.anl.gov Mon Feb 13 10:02:57 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Feb 2012 10:02:57 -0600 (CST) Subject: [Swift-devel] Does anyone have a newer/working svn on Beagle? In-Reply-To: <2143442693.240566.1329148680202.JavaMail.root@zimbra.anl.gov> Message-ID: <520551894.240585.1329148977478.JavaMail.root@zimbra.anl.gov> Hi All, Is there a more recent (1.6++) version of svn available on Beagle than the default 1.5.7? If not, can anyone install one? If not, I'll file this as a Beagle ticket. Thanks, - Mike I get this when trying to use svn on a dir checked out with 1.6: login2$ svn up svn: This client is too old to work with working copy '.'. You need to get a newer Subversion client, or to downgrade this working copy. See http://subversion.tigris.org/faq.html#working-copy-format-change for details. login2$ svn --version svn, version 1.5.7 (r36142) compiled Jun 7 2011, 12:23:36 login2$ which svn /usr/bin/svn login2$ From benc at hawaga.org.uk Mon Feb 13 14:43:05 2012 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Feb 2012 20:43:05 +0000 Subject: [Swift-devel] Does anyone have a newer/working svn on Beagle? In-Reply-To: <520551894.240585.1329148977478.JavaMail.root@zimbra.anl.gov> References: <520551894.240585.1329148977478.JavaMail.root@zimbra.anl.gov> Message-ID: <5C8731F4-5D8F-455D-8834-99F867500E69@hawaga.org.uk> On Feb 13, 2012, at 4:02 PM, Michael Wilde wrote: > Hi All, > > Is there a more recent (1.6++) version of svn available on Beagle than the default 1.5.7? If not, can anyone install one? > > If not, I'll file this as a Beagle ticket. > I think you can work around this by making the original checkout with that version of SVN. Its bugged me in the past a few times when I've moved svn checkouts from one machine to another with NFS or rsync. Ben > Thanks, > > - Mike > > I get this when trying to use svn on a dir checked out with 1.6: > > login2$ svn up > > svn: This client is too old to work with working copy '.'. You need > to get a newer Subversion client, or to downgrade this working copy. > See http://subversion.tigris.org/faq.html#working-copy-format-change > for details. > > login2$ svn --version > > svn, version 1.5.7 (r36142) > compiled Jun 7 2011, 12:23:36 > > login2$ which svn > /usr/bin/svn > login2$ From wilde at mcs.anl.gov Wed Feb 15 22:30:56 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Feb 2012 22:30:56 -0600 (CST) Subject: [Swift-devel] Beagle swift module out pf date Message-ID: <2043049884.10276.1329366656393.JavaMail.root@zimbra.anl.gov> Why is the Beagle swift module loading RC5? login2$ module load swift Swift version swift-0.93RC5 loaded login2$ which swift /soft/swift/swift-0.93RC5/bin/swift login2$ - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Thu Feb 16 01:38:38 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 16 Feb 2012 01:38:38 -0600 (CST) Subject: [Swift-devel] Beagle swift module out pf date In-Reply-To: <2043049884.10276.1329366656393.JavaMail.root@zimbra.anl.gov> Message-ID: <1121747705.121490.1329377918828.JavaMail.root@zimbra-mb2.anl.gov> Beagle should be using the 0.93 release now. I'll try to update the other CI/ANL machines tomorrow. ----- Original Message ----- > From: "Michael Wilde" > To: "Ketan Maheshwari" , "David Kelly" > Cc: "Swift Devel" > Sent: Wednesday, February 15, 2012 10:30:56 PM > Subject: Beagle swift module out pf date > Why is the Beagle swift module loading RC5? > > login2$ module load swift > Swift version swift-0.93RC5 loaded > login2$ which swift > /soft/swift/swift-0.93RC5/bin/swift > login2$ > > > - Mike > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From wilde at mcs.anl.gov Fri Feb 17 08:23:48 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 17 Feb 2012 08:23:48 -0600 (CST) Subject: [Swift-devel] Agenda for Swift devel meeting today In-Reply-To: <605162052.2791.1329487757438.JavaMail.root@zimbra.anl.gov> Message-ID: <1247532943.2827.1329488628397.JavaMail.root@zimbra.anl.gov> Here's what I have so far for discussion today. Please add info, points, or more topics. - coaster provider staging timeouts -- reproduce on same topology as SCEC bugs occurred -- discuss and test: do we need TCP window control -- longer term: test how gridftp works in same topology - coaster timeouts -- execution doesnt continue and recover on coaster worker time walltime expiration -- subtler bug: failing retryable jobs have strange interaction with hang checker (still need to reproduce this in a test case; lower prio) - hang checker: can we help user diagnose these faster/easier? -- whats in the current log for this - IO strategy improvements -- CDM as a default -- provider staging selectable -- staging via worker-side transfer client (esp. globus-url-copy) - BG/P -- what are known issues? -- test problems with _concurrent mapping - gensites -- also allow cmd line setting of params -- SciColSim suggests we should generalize its run script into "swiftrun" -- next steps on tc.data -> apps typically find apps in path more wildcards to reduce need to set this file interaction with sites file - tryswift -- report from David on FutureGrid execution environment for this -- obstacles? - Please suggest additional topics! Thanks, - Mike From hategan at mcs.anl.gov Sat Feb 18 18:07:03 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 18 Feb 2012 16:07:03 -0800 Subject: [Swift-devel] emails Message-ID: <1329610023.25129.1.camel@blabla> Hmm, so my otherwise very reliable (until now) email notifier has stopped working yesterday or so. It took me a while to start wondering why I'm not seeing any new emails. So sorry for not replying to things yesterday and today so far. Mihael From jonmon at mcs.anl.gov Sun Feb 19 18:08:10 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 19 Feb 2012 18:08:10 -0600 Subject: [Swift-devel] Walltime exceeded error Message-ID: Hello, So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing. The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep This run does not produce the issue. In face it does show that the workers shutdown and restart takes over. It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs. The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002. There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress. We though that the job would be killed and then retried once the wall time exceeded what we provided. It looks like the job was killed but was not restarted. This script is very complicated but does produce the issue when run long enough. Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not. Perhaps this issue is Beagle specific(not sure what that means). I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does. From hategan at mcs.anl.gov Sun Feb 19 18:14:10 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 19 Feb 2012 16:14:10 -0800 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: References: Message-ID: <1329696850.31828.0.camel@blabla> Thanks. I'll take a look at the logs and see if anything pops up. On Sun, 2012-02-19 at 18:08 -0600, Jonathan Monette wrote: > Hello, > So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing. The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep > > This run does not produce the issue. In face it does show that the workers shutdown and restart takes over. It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs. > > The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002. There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress. We though that the job would be killed and then retried once the wall time exceeded what we provided. It looks like the job was killed but was not restarted. This script is very complicated but does produce the issue when run long enough. > > Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not. Perhaps this issue is Beagle specific(not sure what that means). I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does. From wilde at mcs.anl.gov Mon Feb 20 09:35:33 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 20 Feb 2012 09:35:33 -0600 (CST) Subject: [Swift-devel] Walltime exceeded error In-Reply-To: Message-ID: <2037160806.8873.1329752133523.JavaMail.root@zimbra.anl.gov> Jon, can you try another run on PADS with these changes: - 1 slot instead of 192 to keep the log much smaller - n=20 instead of 1000 (ditto) - t=70 to make sure that the app() runtime exceeds the specified maxwalltime by enough - local:pbs instead of ssh:pbs to stay closer to the config where the problem occurred - beagle if possible (one node in the scalability or development queue) and same Java as used in the failing case Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Swift Devel" > Sent: Sunday, February 19, 2012 6:08:10 PM > Subject: [Swift-devel] Walltime exceeded error > Hello, > So I have been spending the better part of today trying to reproduce > this maxwalltime issue we have been witnessing. The most recent run I > ran is at /home/jonmon/PADS/Swift/tests/catsnsleep > > This run does not produce the issue. In face it does show that the > workers shutdown and restart takes over. It does show that there were > 120 jobs failed but I believe that is because the retries were > exceeded on those jobs. > > The run in question where this was being witnessed was on Beagle and > is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002. > There is a log file in that directory that you should be able to view > and see the issue and perhaps clarify why the execution just hung and > made no progress. We though that the job would be killed and then > retried once the wall time exceeded what we provided. It looks like > the job was killed but was not restarted. This script is very > complicated but does produce the issue when run long enough. > > Maybe Mihael can provide some insight as to what was going in the code > when the code hung on Beagle as the hang checker never kicked in so > Swift thought it was doing something to make progress when in fact it > was not. Perhaps this issue is Beagle specific(not sure what that > means). I am going to try the same scale of a run on PADS and see if > it completes(although it may take longer as PADS does not have the > computing power that Beagle does. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Mon Feb 20 10:11:24 2012 From: jonmon at mcs.anl.gov (Jonathan) Date: Mon, 20 Feb 2012 10:11:24 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <2037160806.8873.1329752133523.JavaMail.root@zimbra.anl.gov> References: <2037160806.8873.1329752133523.JavaMail.root@zimbra.anl.gov> Message-ID: <5A90CDC9-7516-4D4D-A407-B347E2CB17CB@mcs.anl.gov> Yes. I will. On Feb 20, 2012, at 9:35, Michael Wilde wrote: > Jon, can you try another run on PADS with these changes: > > - 1 slot instead of 192 to keep the log much smaller > - n=20 instead of 1000 (ditto) > - t=70 to make sure that the app() runtime exceeds the specified maxwalltime by enough > - local:pbs instead of ssh:pbs to stay closer to the config where the problem occurred > - beagle if possible (one node in the scalability or development queue) and same Java as used in the failing case > > Mike > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Swift Devel" >> Sent: Sunday, February 19, 2012 6:08:10 PM >> Subject: [Swift-devel] Walltime exceeded error >> Hello, >> So I have been spending the better part of today trying to reproduce >> this maxwalltime issue we have been witnessing. The most recent run I >> ran is at /home/jonmon/PADS/Swift/tests/catsnsleep >> >> This run does not produce the issue. In face it does show that the >> workers shutdown and restart takes over. It does show that there were >> 120 jobs failed but I believe that is because the retries were >> exceeded on those jobs. >> >> The run in question where this was being witnessed was on Beagle and >> is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002. >> There is a log file in that directory that you should be able to view >> and see the issue and perhaps clarify why the execution just hung and >> made no progress. We though that the job would be killed and then >> retried once the wall time exceeded what we provided. It looks like >> the job was killed but was not restarted. This script is very >> complicated but does produce the issue when run long enough. >> >> Maybe Mihael can provide some insight as to what was going in the code >> when the code hung on Beagle as the hang checker never kicked in so >> Swift thought it was doing something to make progress when in fact it >> was not. Perhaps this issue is Beagle specific(not sure what that >> means). I am going to try the same scale of a run on PADS and see if >> it completes(although it may take longer as PADS does not have the >> computing power that Beagle does. >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From hategan at mcs.anl.gov Mon Feb 20 16:11:16 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 20 Feb 2012 14:11:16 -0800 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: References: Message-ID: <1329775876.6072.1.camel@blabla> I can't log in to beagle. Can you move them to some place where I can access them? On Sun, 2012-02-19 at 18:08 -0600, Jonathan Monette wrote: > Hello, > So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing. The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep > > This run does not produce the issue. In face it does show that the workers shutdown and restart takes over. It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs. > > The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002. There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress. We though that the job would be killed and then retried once the wall time exceeded what we provided. It looks like the job was killed but was not restarted. This script is very complicated but does produce the issue when run long enough. > > Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not. Perhaps this issue is Beagle specific(not sure what that means). I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does. From jonmon at mcs.anl.gov Mon Feb 20 16:14:19 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 20 Feb 2012 16:14:19 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <1329775876.6072.1.camel@blabla> References: <1329775876.6072.1.camel@blabla> Message-ID: /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on /gpfs/pads /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on any CI machine On Feb 20, 2012, at 4:11 PM, Mihael Hategan wrote: > I can't log in to beagle. Can you move them to some place where I can > access them? > > On Sun, 2012-02-19 at 18:08 -0600, Jonathan Monette wrote: >> Hello, >> So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing. The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep >> >> This run does not produce the issue. In face it does show that the workers shutdown and restart takes over. It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs. >> >> The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002. There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress. We though that the job would be killed and then retried once the wall time exceeded what we provided. It looks like the job was killed but was not restarted. This script is very complicated but does produce the issue when run long enough. >> >> Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not. Perhaps this issue is Beagle specific(not sure what that means). I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does. > > From hategan at mcs.anl.gov Mon Feb 20 16:16:34 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 20 Feb 2012 14:16:34 -0800 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: References: <1329775876.6072.1.camel@blabla> Message-ID: <1329776194.6072.2.camel@blabla> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote: > /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on /gpfs/pads > /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on any CI machine Ok. Sorry. I thought the last one was on beagle. From jonmon at mcs.anl.gov Mon Feb 20 16:19:45 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 20 Feb 2012 16:19:45 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <1329776194.6072.2.camel@blabla> References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> Message-ID: No. The last run was run using Beagle. That is the more interesting one. That shows jobs failed but the "Failed but can retry" count was not printed very often. You can see that in the swift.out file. Eventually the workflow just hung and the hang checker kicked in. You can also see that Swift got stuck in the initializing state with a count of 61. On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote: > On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote: >> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on /gpfs/pads >> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on any CI machine > > Ok. Sorry. I thought the last one was on beagle. > From hategan at mcs.anl.gov Mon Feb 20 16:24:12 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 20 Feb 2012 14:24:12 -0800 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> Message-ID: <1329776652.6072.3.camel@blabla> I'm not sure if I asked this, but did you happen to get a jstack of the hanging swift? On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote: > No. The last run was run using Beagle. That is the more interesting one. That shows jobs failed but the "Failed but can retry" count was not printed very often. You can see that in the swift.out file. Eventually the workflow just hung and the hang checker kicked in. You can also see that Swift got stuck in the initializing state with a count of 61. > > On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote: > > > On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote: > >> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on /gpfs/pads > >> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on any CI machine > > > > Ok. Sorry. I thought the last one was on beagle. > > > From jonmon at mcs.anl.gov Mon Feb 20 16:26:49 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 20 Feb 2012 16:26:49 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <1329776652.6072.3.camel@blabla> References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> <1329776652.6072.3.camel@blabla> Message-ID: No. This was a run Ketan did a while back. I have been using this as a reference when trying to re-create the issue with a simple catsnsleep job. This run was also done on Beagle using the pre-installed java package, which does not have jstack. On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote: > I'm not sure if I asked this, but did you happen to get a jstack of the > hanging swift? > > On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote: >> No. The last run was run using Beagle. That is the more interesting one. That shows jobs failed but the "Failed but can retry" count was not printed very often. You can see that in the swift.out file. Eventually the workflow just hung and the hang checker kicked in. You can also see that Swift got stuck in the initializing state with a count of 61. >> >> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote: >> >>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote: >>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on /gpfs/pads >>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on any CI machine >>> >>> Ok. Sorry. I thought the last one was on beagle. >>> >> > > From jonmon at mcs.anl.gov Mon Feb 20 16:27:30 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 20 Feb 2012 16:27:30 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> <1329776652.6072.3.camel@blabla> Message-ID: <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov> Correction, Beagle does have jstack. Do not know why I thought it did not have it. On Feb 20, 2012, at 4:26 PM, Jonathan Monette wrote: > No. This was a run Ketan did a while back. I have been using this as a reference when trying to re-create the issue with a simple catsnsleep job. > > This run was also done on Beagle using the pre-installed java package, which does not have jstack. > > On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote: > >> I'm not sure if I asked this, but did you happen to get a jstack of the >> hanging swift? >> >> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote: >>> No. The last run was run using Beagle. That is the more interesting one. That shows jobs failed but the "Failed but can retry" count was not printed very often. You can see that in the swift.out file. Eventually the workflow just hung and the hang checker kicked in. You can also see that Swift got stuck in the initializing state with a count of 61. >>> >>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote: >>> >>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote: >>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on /gpfs/pads >>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on any CI machine >>>> >>>> Ok. Sorry. I thought the last one was on beagle. >>>> >>> >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From ketancmaheshwari at gmail.com Mon Feb 20 16:42:10 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 20 Feb 2012 16:42:10 -0600 Subject: [Swift-devel] cybershake hangs in the latest Swift 0.93 does not hang in a previous rel Message-ID: Mihael, Reporting a case of deadlock/hang occurring in the recent swift 0.93 update: I've been working on the cybershake script with David today and it seems that the script hangs at around the same on David's Swift installation which is: Swift 0.93 swift-r5658 cog-r3361 I successfully tested the same configuration with my swift installation which is a bit older release: Swift 0.93 swift-r5609 (swift modified locally) cog-r3361 (cog modified locally) The log for the hanged version is: http://ci.uchicago.edu/~ketan/postproc-20120220-1617-hvcmjs71.log The jstack for the hang version is: http://ci.uchicago.edu/~ketan/cybershake.jstack The log for the successful run is: http://ci.uchicago.edu/~ketan/postproc-20120220-1454-lfog5xu1.log Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at utexas.edu Tue Feb 21 12:56:23 2012 From: jonmon at utexas.edu (Jonathan Monette) Date: Tue, 21 Feb 2012 12:56:23 -0600 Subject: [Swift-devel] Command Reply Timeout Message-ID: What does this mean? Command Command(54, HEARTBEAT): handling reply timeout; sendReqTime=120221-185033.459, sendTime=700101-000000.000, now=120221-185233.649, channel=SC-0221-330346-000016-000001 I see these lines sprinkled throughout my swift run and in the logs are log4j level WARN. What is it trying to tell me? Should I be worried? I cannot tell if my workflow is making progress or not. It looked like it was even with these messages popping up but now I am not sure if it is. What is the above line saying? From hategan at mcs.anl.gov Tue Feb 21 13:00:57 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Feb 2012 11:00:57 -0800 Subject: [Swift-devel] Command Reply Timeout In-Reply-To: References: Message-ID: <1329850857.17237.0.camel@blabla> It's saying that a connection between the coaster service and a worker isn't going quite right. On Tue, 2012-02-21 at 12:56 -0600, Jonathan Monette wrote: > What does this mean? > Command Command(54, HEARTBEAT): handling reply timeout; sendReqTime=120221-185033.459, sendTime=700101-000000.000, now=120221-185233.649, channel=SC-0221-330346-000016-000001 > > I see these lines sprinkled throughout my swift run and in the logs are log4j level WARN. What is it trying to tell me? Should I be worried? I cannot tell if my workflow is making progress or not. It looked like it was even with these messages popping up but now I am not sure if it is. What is the above line saying? From jonmon at mcs.anl.gov Tue Feb 21 13:08:41 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 21 Feb 2012 13:08:41 -0600 Subject: [Swift-devel] Command Reply Timeout In-Reply-To: <1329850857.17237.0.camel@blabla> References: <1329850857.17237.0.camel@blabla> Message-ID: <798DD6A5-30AA-4273-A792-263D9D792E4C@mcs.anl.gov> I see?.thanks. I will figure out what happened. On Feb 21, 2012, at 1:00 PM, Mihael Hategan wrote: > It's saying that a connection between the coaster service and a worker > isn't going quite right. > > On Tue, 2012-02-21 at 12:56 -0600, Jonathan Monette wrote: >> What does this mean? >> Command Command(54, HEARTBEAT): handling reply timeout; sendReqTime=120221-185033.459, sendTime=700101-000000.000, now=120221-185233.649, channel=SC-0221-330346-000016-000001 >> >> I see these lines sprinkled throughout my swift run and in the logs are log4j level WARN. What is it trying to tell me? Should I be worried? I cannot tell if my workflow is making progress or not. It looked like it was even with these messages popping up but now I am not sure if it is. What is the above line saying? > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Wed Feb 22 09:12:14 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 22 Feb 2012 09:12:14 -0600 (CST) Subject: [Swift-devel] cybershake hangs in the latest Swift 0.93 does not hang in a previous rel In-Reply-To: Message-ID: <1865835731.130174.1329923534838.JavaMail.root@zimbra-mb2.anl.gov> I changed to r5609 to match Ketan's working version, but am still getting the same errors. [davidk at communicado run]$ swift -version no sites file specified, setting to default: /home/davidk/swift-0.93/cog/modules/swift/dist/swift-svn/etc/sites.xml Swift 0.93 swift-r5609 cog-r3361 I'll dig through the logs a bit and see if I can narrow it down. Here is the message I get via stdout. No events in 10s. Registered futures: string[] var_str Closed, 242 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 242 elements, no listeners string[] var_str Closed, 18 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 32 elements, no listeners string[] var_str Closed, 32 elements, no listeners string[] var_str Closed, 2 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 18 elements, no listeners string[] var_str Closed, 50 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 200 elements, no listeners string[] var_str Closed, 2 elements, no listeners string[] var_str Closed, 242 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 18 elements, no listeners string[] var_str Closed, 50 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 18 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 18 elements, no listeners string[] var_str Closed, 18 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 50 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 2 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 18 elements, no listeners SgtDim sgt_var - F/sgt_var..y:SgtDim - Open string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 128 elements, no listeners string[] var_str Closed, 18 elements, no listeners string[] var_str Closed, 72 elements, no listeners string[] var_str Closed, 2 elements, no listeners string[] var_str Closed, 50 elements, no listeners string[] var_str Closed, 32 elements, no listeners string[] var_str Closed, 50 elements, no listeners string[] var_str Closed, 128 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 32 elements, no listeners string[] var_str Closed, 2 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 72 elements, no listeners string[] var_str Closed, 200 elements, no listeners string[] var_str Closed, 32 elements, no listeners string[] var_str Closed, 18 elements, no listeners string[] var_str Closed, 32 elements, no listeners string[] var_str Closed, 200 elements, no listeners string[] var_str Closed, 8 elements, no listeners string[] var_str Closed, 50 elements, no listeners string[] var_str Closed, 50 elements, no listeners string[] var_str Closed, 50 elements, no listeners string[] var_str Closed, 32 elements, no listeners string[] var_str Closed, 8 elements, no listeners ---- Waiting threads: 0-13-118-6 0-13-52-6 0-13-30-6 0-13-139-6 0-13-194-6 0-13-24-6 0-13-127-6 0-13-149-6 0-13-42-6 0-13-138-6 0-13-174-6 0-13-89-6 0-13-36-6 0-13-156-6 0-13-53-6 0-13-152-6 0-13-90-6 0-13-158-6 0-13-132-6 0-13-136-6 0-13-87-6 0-13-92-6 0-13-182-6 0-13-9-6 0-13-29-6 0-13-60-6 0-13-70-6 0-13-12-6 0-13-81-6 0-13-178-6 0-13-49-6 0-13-97-6 0-13-65-6 0-13-145-6 0-13-135-6 0-13-190-6 0-13-11-6 0-13-163-6 0-13-155-6 0-13-16-6 0-13-154-6 0-13-167-6 0-13-173-6 0-13-166-6 0-13-0-6 0-13-191-6 0-13-37-6 0-13-17-6 0-13-85-6 0-13-79-6 0-13-134-6 0-13-176-6 0-13-125-6 0-13-38-6 0-13-187-6 0-13-35-6 0-13-171-6 0-13-88-6 0-13-131-6 0-13-106-6 0-13-55-6 0-13-168-6 0-13-147-6 0-13-148-6 0-13-99-6 0-13-34-6 0-13-2-6 0-13-100-6 0-13-48-6 0-13-5-6 0-13-69-6 0-13-80-6 0-13-153-6 0-13-122-6 0-13-105-6 0-13-113-6 0-13-26-6 0-13-124-6 0-13-32-6 0-13-123-6 0-13-98-6 0-13-170-6 0-13-28-6 0-13-22-6 0-13-162-6 0-13-15-6 0-13-64-6 0-13-13-6 0-13-111-6 0-13-66-6 0-13-43-6 0-13-19-6 0-13-78-6 0-13-157-6 0-13-57-6 0-13-142-6 0-13-151-6 0-13-3-6 0-13-140-6 0-13-76-6 0-13-188-6 0-13-91-6 0-13-75-6 0-13-47-6 0-13-50-6 0-13-41-6 0-13-40-6 0-13-21-6 0-13-193-6 0-13-102-6 0-13-59-6 0-13-189-6 0-13-31-6 0-13-197-6 0-13-110-6 0-13-4-6 0-13-20-6 0-13-185-6 0-13-137-6 0-13-121-6 0-13-180-6 0-13-169-6 0-13-58-6 0-13-116-6 0-13-45-6 0-13-93-6 0-13-146-6 0-13-164-6 0-13-101-6 0-13-179-6 0-13-115-6 0-13-23-6 0-13-94-6 0-13-44-6 0-13-177-6 0-13-10-6 0-13-84-6 0-13-186-6 0-13-150-6 0-13-198-6 0-13-195-6 0-13-14-6 0-13-143-6 0-13-63-6 0-13-77-6 0-13-51-6 0-13-25-6 0-13-172-6 0-13-18-6 0-13-68-6 0-13-159-6 0-13-128-6 0-13-104-6 0-13-141-6 0-13-6-6 0-13-126-6 0-13-108-6 0-13-1-6 0-13-199-6 0-13-175-6 0-13-120-6 0-13-119-6 0-13-192-6 0-13-183-6 0-13-103-6 0-13-133-6 0-13-184-6 0-13-161-6 0-13-196-6 0-13-112-6 0-13-129-6 0-13-33-6 0-13-72-6 0-13-74-6 0-13-39-6 0-13-160-6 0-13-54-6 0-13-117-6 0-13-114-6 0-13-95-6 0-13-165-6 0-13-181-6 0-13-46-6 0-13-27-6 0-13-109-6 0-13-130-6 0-13-144-6 0-13-82-6 0-13-67-6 0-13-8-6 0-13-62-6 0-13-73-6 0-13-56-6 0-13-86-6 0-13-7-6 0-13-107-6 0-13-83-6 0-13-71-6 0-13-96-6 0-13-61-6 ---- ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Mihael Hategan" > Cc: "David Kelly" , "Swift Devel" > Sent: Monday, February 20, 2012 4:42:10 PM > Subject: [Swift-devel] cybershake hangs in the latest Swift 0.93 does not hang in a previous rel > Mihael, > > > Reporting a case of deadlock/hang occurring in the recent swift 0.93 > update: > > > I've been working on the cybershake script with David today and it > seems that the script hangs at around the same on David's Swift > installation which is: > Swift 0.93 swift-r5658 cog-r3361 > > > I successfully tested the same configuration with my swift > installation which is a bit older release: > Swift 0.93 swift-r5609 (swift modified locally) cog-r3361 (cog > modified locally) > > > > The log for the hanged version is: > http://ci.uchicago.edu/~ketan/postproc-20120220-1617-hvcmjs71.log > > > The jstack for the hang version is: > http://ci.uchicago.edu/~ketan/cybershake.jstack > > > The log for the successful run is: > http://ci.uchicago.edu/~ketan/postproc-20120220-1454-lfog5xu1.log > > > Regards, -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From jonmon at mcs.anl.gov Wed Feb 22 15:45:53 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 22 Feb 2012 15:45:53 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov> References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> <1329776652.6072.3.camel@blabla> <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov> Message-ID: <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov> Mihael, I have a hung Java process showing this error right now, 2 jobs are stuck in the initializing state. I have a jstack -l of this hung java process. Is there anything else you need before I kill it? Do you need any other probing information from this process other than this jstack output? On Feb 20, 2012, at 4:27 PM, Jonathan Monette wrote: > Correction, Beagle does have jstack. Do not know why I thought it did not have it. > > On Feb 20, 2012, at 4:26 PM, Jonathan Monette wrote: > >> No. This was a run Ketan did a while back. I have been using this as a reference when trying to re-create the issue with a simple catsnsleep job. >> >> This run was also done on Beagle using the pre-installed java package, which does not have jstack. >> >> On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote: >> >>> I'm not sure if I asked this, but did you happen to get a jstack of the >>> hanging swift? >>> >>> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote: >>>> No. The last run was run using Beagle. That is the more interesting one. That shows jobs failed but the "Failed but can retry" count was not printed very often. You can see that in the swift.out file. Eventually the workflow just hung and the hang checker kicked in. You can also see that Swift got stuck in the initializing state with a count of 61. >>>> >>>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote: >>>> >>>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote: >>>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on /gpfs/pads >>>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on any CI machine >>>>> >>>>> Ok. Sorry. I thought the last one was on beagle. >>>>> >>>> >>> >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Feb 22 15:56:24 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Feb 2012 15:56:24 -0600 (CST) Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov> Message-ID: <73065862.21539.1329947784250.JavaMail.root@zimbra.anl.gov> Hi Jon, I think Mondays Mihael is pretty swamped with school commitments. The only other thing I can think of grabbing is worker logs, but I doubt that any provision was made to request worker logging for this run. I'd go ahead and terminate the run. - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Wednesday, February 22, 2012 3:45:53 PM > Subject: Re: [Swift-devel] Walltime exceeded error > Mihael, > I have a hung Java process showing this error right now, 2 jobs are > stuck in the initializing state. I have a jstack -l of this hung > java process. Is there anything else you need before I kill it? Do you > need any other probing information from this process other than this > jstack output? > > On Feb 20, 2012, at 4:27 PM, Jonathan Monette wrote: > > > Correction, Beagle does have jstack. Do not know why I thought it > > did not have it. > > > > On Feb 20, 2012, at 4:26 PM, Jonathan Monette wrote: > > > >> No. This was a run Ketan did a while back. I have been using this > >> as a reference when trying to re-create the issue with a simple > >> catsnsleep job. > >> > >> This run was also done on Beagle using the pre-installed java > >> package, which does not have jstack. > >> > >> On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote: > >> > >>> I'm not sure if I asked this, but did you happen to get a jstack > >>> of the > >>> hanging swift? > >>> > >>> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote: > >>>> No. The last run was run using Beagle. That is the more > >>>> interesting one. That shows jobs failed but the "Failed but can > >>>> retry" count was not printed very often. You can see that in the > >>>> swift.out file. Eventually the workflow just hung and the hang > >>>> checker kicked in. You can also see that Swift got stuck in the > >>>> initializing state with a count of 61. > >>>> > >>>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote: > >>>> > >>>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote: > >>>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on > >>>>>> /gpfs/pads > >>>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on > >>>>>> any CI machine > >>>>> > >>>>> Ok. Sorry. I thought the last one was on beagle. > >>>>> > >>>> > >>> > >>> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Wed Feb 22 16:00:34 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 22 Feb 2012 16:00:34 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <73065862.21539.1329947784250.JavaMail.root@zimbra.anl.gov> References: <73065862.21539.1329947784250.JavaMail.root@zimbra.anl.gov> Message-ID: <5E5C647A-E37F-41D2-8AD4-B5C4135BC609@mcs.anl.gov> Ok. I shall kill it. > Hi Jon, I think Mondays Mihael is pretty swamped with school commitments. > > The only other thing I can think of grabbing is worker logs, but I doubt that any provision was made to request worker logging for this run. > > I'd go ahead and terminate the run. > > - Mike > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Mihael Hategan" >> Cc: "Swift Devel" >> Sent: Wednesday, February 22, 2012 3:45:53 PM >> Subject: Re: [Swift-devel] Walltime exceeded error >> Mihael, >> I have a hung Java process showing this error right now, 2 jobs are >> stuck in the initializing state. I have a jstack -l of this hung >> java process. Is there anything else you need before I kill it? Do you >> need any other probing information from this process other than this >> jstack output? >> >> On Feb 20, 2012, at 4:27 PM, Jonathan Monette wrote: >> >>> Correction, Beagle does have jstack. Do not know why I thought it >>> did not have it. >>> >>> On Feb 20, 2012, at 4:26 PM, Jonathan Monette wrote: >>> >>>> No. This was a run Ketan did a while back. I have been using this >>>> as a reference when trying to re-create the issue with a simple >>>> catsnsleep job. >>>> >>>> This run was also done on Beagle using the pre-installed java >>>> package, which does not have jstack. >>>> >>>> On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote: >>>> >>>>> I'm not sure if I asked this, but did you happen to get a jstack >>>>> of the >>>>> hanging swift? >>>>> >>>>> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote: >>>>>> No. The last run was run using Beagle. That is the more >>>>>> interesting one. That shows jobs failed but the "Failed but can >>>>>> retry" count was not printed very often. You can see that in the >>>>>> swift.out file. Eventually the workflow just hung and the hang >>>>>> checker kicked in. You can also see that Swift got stuck in the >>>>>> initializing state with a count of 61. >>>>>> >>>>>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote: >>>>>> >>>>>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote: >>>>>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on >>>>>>>> /gpfs/pads >>>>>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on >>>>>>>> any CI machine >>>>>>> >>>>>>> Ok. Sorry. I thought the last one was on beagle. >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From hategan at mcs.anl.gov Wed Feb 22 16:28:22 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 22 Feb 2012 14:28:22 -0800 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov> References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> <1329776652.6072.3.camel@blabla> <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov> <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov> Message-ID: <1329949702.23375.0.camel@blabla> On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote: > Mihael, > I have a hung Java process showing this error right now, 2 jobs are > stuck in the initializing state. I have a jstack -l of this > hung java process. Is there anything else you need before I kill it? > Do you need any other probing information from this process other than > this jstack output? I don't think so. From jonmon at mcs.anl.gov Wed Feb 22 16:33:03 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 22 Feb 2012 16:33:03 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <1329949702.23375.0.camel@blabla> References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> <1329776652.6072.3.camel@blabla> <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov> <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov> <1329949702.23375.0.camel@blabla> Message-ID: <5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov> Ok. I have killed the process and I am in the process of copying the run directory from the lustre file system on Beagle to /gpfs/pads On Feb 22, 2012, at 4:28 PM, Mihael Hategan wrote: > On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote: >> Mihael, >> I have a hung Java process showing this error right now, 2 jobs are >> stuck in the initializing state. I have a jstack -l of this >> hung java process. Is there anything else you need before I kill it? >> Do you need any other probing information from this process other than >> this jstack output? > > I don't think so. > > From jonmon at mcs.anl.gov Wed Feb 22 17:05:40 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 22 Feb 2012 17:05:40 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov> References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> <1329776652.6072.3.camel@blabla> <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov> <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov> <1329949702.23375.0.camel@blabla> <5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov> Message-ID: <22E47E80-C5C5-46DA-80CA-1A6063E90727@mcs.anl.gov> This has been done. I have also moved the run that Ketan had produced to PADS. /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run002 <-----Ketan's run /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run047 <-----My run(has a jstack.log file, also more recent) On Feb 22, 2012, at 4:33 PM, Jonathan Monette wrote: > Ok. I have killed the process and I am in the process of copying the run directory from the lustre file system on Beagle to /gpfs/pads > > On Feb 22, 2012, at 4:28 PM, Mihael Hategan wrote: > >> On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote: >>> Mihael, >>> I have a hung Java process showing this error right now, 2 jobs are >>> stuck in the initializing state. I have a jstack -l of this >>> hung java process. Is there anything else you need before I kill it? >>> Do you need any other probing information from this process other than >>> this jstack output? >> >> I don't think so. >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Feb 24 08:38:33 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Feb 2012 08:38:33 -0600 (CST) Subject: [Swift-devel] Questions on coaster behavior In-Reply-To: <308182225.23832.1330012227581.JavaMail.root@zimbra.anl.gov> Message-ID: <1155840334.27995.1330094313609.JavaMail.root@zimbra.anl.gov> Hi Mihael, All, I wanted to confirm some aspects of Coaster behavior that are still unclear to me after re-reading the UCC paper: Scheduling: the coaster provider scheduler starts a number of worker blocks that are sized (in time and nodes) based on the size of its queue when it computes a schedule. This queue consists of jobs that were emitted by Swift to the provider based on the site throttle. (note that by "job" here I mean the app() execution, not the LRM job). But the coaster provider does not actually launch a job on a free coaster slot until the slot is available, right? Ie, there is no tight connection between the coaster slot that a job's time estimate contributed to, and the worker that the job is actually run on, right? Jobs are placed on workers at the last possible moment, and thus when a worker can take a job, it can get *any* job that is queued for that site. Is all this correct? The key point behind this question being "is the coaster scheduler dynamic enough to hand cases where the job runtime estimates were conservative, and make best use of all available worker cores"? Staging: There are no cases in which the coaster provider staging mechanism pre-stages input data, right? Thanks, - Mike From hategan at mcs.anl.gov Fri Feb 24 12:30:50 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Feb 2012 10:30:50 -0800 Subject: [Swift-devel] Questions on coaster behavior In-Reply-To: <1155840334.27995.1330094313609.JavaMail.root@zimbra.anl.gov> References: <1155840334.27995.1330094313609.JavaMail.root@zimbra.anl.gov> Message-ID: <1330108250.4574.7.camel@blabla> On Fri, 2012-02-24 at 08:38 -0600, Michael Wilde wrote: > Hi Mihael, All, > > I wanted to confirm some aspects of Coaster behavior that are still unclear to me after re-reading the UCC paper: > > Scheduling: the coaster provider scheduler starts a number of worker blocks that are sized (in time and nodes) based on the size of its queue when it computes a schedule. This queue consists of jobs that were emitted by Swift to the provider based on the site throttle. > > (note that by "job" here I mean the app() execution, not the LRM job). > > But the coaster provider does not actually launch a job on a free > coaster slot until the slot is available, right? That is correct. Jobs are queued by the coaster service, blocks are submitted and killed based on the shape of the queued jobs, and once the blocks are running, jobs are sent to them. > Ie, there is no tight connection between the coaster slot that a > job's time estimate contributed to, and the worker that the job is > actually run on, right? That's right. Only the totals are tightly connected. > Jobs are placed on workers at the last possible moment, and thus when > a worker can take a job, it can get *any* job that is queued for that > site. Jobs are placed on workers when workers don't have anything else to do. Each worker will get the longest job that it can fit. > Is all this correct? The key point behind this question being "is > the coaster scheduler dynamic enough to hand cases where the job > runtime estimates were conservative, and make best use of all > available worker cores"? Yes. That's the basic idea. > > Staging: There are no cases in which the coaster provider staging mechanism pre-stages input data, right? If by pre-staging you mean staging before the job makes it to the worker, then no. The worker initiates staging. From jonmon at mcs.anl.gov Fri Feb 24 16:09:54 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 24 Feb 2012 16:09:54 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <22E47E80-C5C5-46DA-80CA-1A6063E90727@mcs.anl.gov> References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> <1329776652.6072.3.camel@blabla> <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov> <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov> <1329949702.23375.0.camel@blabla> <5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov> <22E47E80-C5C5-46DA-80CA-1A6063E90727@mcs.anl.gov> Message-ID: <91121EC0-D9A6-4C65-8AE3-29C1F37ED18E@mcs.anl.gov> I have updated the bugzilla bug with the below directories: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=720 I have also added another directory showing the same behavior /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run054 There is a jstack.log file in that directory. All three of the run directories show that jobs get stuck in the initialized state and the hang checker kicks in. On Feb 22, 2012, at 5:05 PM, Jonathan Monette wrote: > This has been done. I have also moved the run that Ketan had produced to PADS. > > /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run002 <-----Ketan's run > /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run047 <-----My run(has a jstack.log file, also more recent) > > On Feb 22, 2012, at 4:33 PM, Jonathan Monette wrote: > >> Ok. I have killed the process and I am in the process of copying the run directory from the lustre file system on Beagle to /gpfs/pads >> >> On Feb 22, 2012, at 4:28 PM, Mihael Hategan wrote: >> >>> On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote: >>>> Mihael, >>>> I have a hung Java process showing this error right now, 2 jobs are >>>> stuck in the initializing state. I have a jstack -l of this >>>> hung java process. Is there anything else you need before I kill it? >>>> Do you need any other probing information from this process other than >>>> this jstack output? >>> >>> I don't think so. >>> >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Sun Feb 26 07:57:48 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sun, 26 Feb 2012 07:57:48 -0600 Subject: [Swift-devel] CFP: The 9th Int. Conf. on Autonomic Computing (ICAC) 2012 Message-ID: <4F4A3A5C.7010304@cs.iit.edu> CALL FOR PAPERS The 9th International Conference on Autonomic Computing (ICAC 2012) September 16-20, 2012. San Jose, CA, USA http://icac2012.cs.fiu.edu/ ----------------------------------------------------------------- IMPORTANT DATES Paper and Poster Submission: March 9, 2012, 11:59pm PST Notification: May 18, 2012 Camera-ready Due: June 8, 2012 ----------------------------------------------------------------- OVERVIEW ICAC is the leading conference on autonomic computing techniques, foundations, and applications. Autonomic computing refers to methods and means for automated management of performance, fault, security, and configuration with little involvement of users or administrators. Systems introducing new autonomic features are becoming increasingly prevalent, motivating research that spans a variety of areas, from computer systems, networking, software engineering, and data management to machine learning, control theory, and bio-inspired computing. ICAC brings together researchers and practitioners across these disciplines to address multiple facets of adaptation and self-management in computing systems and applications from different perspectives. Autonomic computing solutions are sought for clouds, grids, data centers, enterprise software, internet services, data services, smart phones, embedded systems, and sensor networks. In these environments, resources and applications must be managed to maximize performance and minimize cost, while maintaining predictable and reliable behavior in the face of varying workloads, failures, and malicious threats. Papers are solicited from all areas of autonomic computing, including (but not limited to): * End-to-end techniques for management of resources, workloads, performance, faults, power/cooling, security, and others. * Self-managing components, such as server, storage, network protocols, or specific application elements, and embedded and mobile end systems such as smart phones. * Decision and analysis techniques and their use, such as machine learning, control theory, predictive methods, probability and stochastic processes, queuing theory methodologies, emergent behavior, rule-based systems, and bio-inspired techniques. * Monitoring systems for autonomic computing. * Hypervisor, operating systems, hardware, or application support for autonomic computing. * Novel human interfaces for monitoring and controlling autonomic systems. * Management topics, such as specification and modeling of service-level agreements, behavior enforcement and tie-in with IT governance. * Toolkits, frameworks, principles and architectures, from software engineering practices and experimental methodologies to agent-based techniques and virtualization. * Fundamental science and theory of self-managing systems: understanding, controlling or exploiting system behaviors to enforce autonomic properties. * Applications of autonomic computing and experiences with prototyped or deployed systems solving real-world problems in science, engineering, business and society. Papers will be judged on originality, significance, interest, correctness, clarity and relevance to the broader community. Papers should report on experiences, measurements, user studies, or other evaluations, as appropriate. Evaluations of a prototype or large-scale deployment of systems and applications is expected. PAPER AND POSTER SUBMISSIONS Full papers (a maximum of 10 pages in the two-column ACM proceedings format) and posters (2 pages) are invited on a wide variety of topics relating to autonomic computing. Submitted papers must be original work, and may not be under consideration for another conference or journal. Complete formatting and submission instructions can be found on the conference web site. Accepted papers and posters will appear in proceedings distributed at the conference and available electronically. Relevant top ICAC'12 papers will be invited for "fast-track" submissions to the ACM Transactions on Autonomous and Adaptive Systems (TAAS). WORKSHOPS, DEMONSTRATIONS AND EXHIBITION ICAC'12 welcomes proposals for co-located workshops on topics of interest to the autonomic computing community. Workshop proposals should be submitted to the Workshop Chair, Fred Douglis (f.douglis at computer.org) by February 10, 2012. Workshops are expected to publish proceedings, and should cover areas that complement the main program. ICAC'12 will also feature a demonstration and exhibition session consisting of prototypes and technology artifacts such as demonstrating autonomic software or autonomic computing principles. Entries will be judged by a separate committee led by the demo/exhibit chair. INDUSTRY SESSION One of ICAC's important roles is to bring together researchers and practitioners from academia and industry. In its industry session, ICAC helps fulfill this role by presenting an industry viewpoint on technologies, products, and market needs. The industry session also addresses current challenges, and opportunities for academic and corporate research collaborations. We encourage industry leaders, including entrepreneurs, product developers, architects, managers, marketers and end users, to submit their papers and posters reflecting such industry perspectives as part of the regular submission process. ------------------------------------------------------------------ ORGANIZERS GENERAL CHAIR Dejan Milojicic, HP Labs PROGRAM CHAIRS Dongyan Xu, Purdue University Vanish Talwar, HP Labs INDUSTRY CHAIR Xiaoyun Zhu, VMware WORKSHOPS CHAIR Fred Douglis, EMC POSTERS/DEMO/EXHIBITS CHAIR Eno Thereska, Microsoft Research FINANCE CHAIR Michael Kozuch, Intel LOCAL ARRANGEMENT CHAIR Jessica Blaine PUBLICITY CHAIRS Daniel Batista, University of S?o Paulo Vartan Padaryan, ISP/Russian Academy of Sci. Ioan Raicu, Illinois Inst. of Technology Jianfeng Zhan, ICT/Chinese Academy of Sci. Ming Zhao, Florida Intl. University PROGRAM COMMITTEE Tarek Abdelzaher, UIUC Umesh Bellur, IIT, Bombay Ken Birman, Cornell University Rajkumar Buyya, Univ. of Melbourne Rocky Chang, Hong Kong Polytechnic University Yuan Chen, HP Labs Alva Couch, Tufts University Peter Dinda, Northwestern University Fred Douglis, EMC Renato Figueiredo, University of Florida Mohamed Hefeeda, Qatar Computing Research Institute Joe Hellerstein, Google Geoff Jiang, NEC Labs Jeff Kephart, IBM Research Emre Kiciman, Microsoft Research Fabio Kon, University of S?o Paulo Michael Kozuch, Intel Dejan Milojicic, HP Labs Klara Nahrstedt, UIUC Priya Narasimhan, CMU Manish Parashar, Rutgers University Ioan Raicu, Illinois Inst. of Technology Omer Rana, Cardiff University Masoud Sadjadi, Florida Intl. University Rick Schlichting, AT&T Labs Hartmut Schmeck, KIT Karsten Schwan, Georgia Tech Onn Shehory, IBM Research Eno Thereska, Microsoft Research Xiaoyun Zhu, VMware -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From iraicu at cs.iit.edu Sun Feb 26 08:28:38 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Sun, 26 Feb 2012 08:28:38 -0600 Subject: [Swift-devel] CFP: IEEE Int. Scalable Computing Challenge (SCALE) at CCGrid 2012 Message-ID: <4F4A4196.1090701@cs.iit.edu> CALL FOR PAPERS The Fifth IEEE International Scalable Computing Challenge (SCALE) Co-located with the 11th CCGrid Conference in Ottawa, Canada Sponsored by the IEEE Computer Society Technical Committee on Scalable Computing (TCSC) May 13-16, 2012 http://www.cloudbus.org/ccgrid2012/cfp-scale.html --------------------------------------------------------------------- Objective and Focus: The objective of the Fifth IEEE International Scalable Computing Challenge (SCALE 2012), sponsored by the IEEE Computer Society Technical Committee on Scalable Computing (TCSC), is to highlight and showcase real-world problem solving using computing that scales. Effective solutions to many scientific problems require applications that can scale. There are different dimensions to application scaling: for example, applications can scale-up to large number of cores or compute units, scale-out to utilize multiple distinct compute units, or scale-down to release resources that are no longer needed. In order to scale, applications need the support of tools, middleware, infrastructure, programming systems, etc. SCALE is concerned with advances in application development and supporting infrastructure that enable scaling. Call for Proposals: The Fifth IEEE International Scalable Computing Challenge (SCALE 2012) contest will focus on end-to-end problem solving using concepts, technologies and architectures (including Clusters, Grids and Clouds) that facilitate scaling. Participants in the challenge will be expected to identify significant current real-world problems where scalable computing techniques can be effectively used, and design, implement, evaluate and demonstrate solutions. SCALE2012 will be held in conjunction with the 11th CCGrid Conference in Ottawa, Canada on 13-16 May, 2012. We invite teams to submit white papers outlining the problem addressed and the technologies employed to enable applications to scale. White papers should be up to 4 pages long, 12-pt. font and single column, and in addition to listing team members and contact information, should clearly outline: 1. The problem being solved and the technology employed 2. The application scenario and its requirements 3. Performance data and a qualitative description of how the application scales -- scale-up, scale-out or any other type of scaling 4. The solution -- architecture, underlying concepts and technologies used -- highlighting the innovative aspects of the solution 5. Impact of the solution, including extensibility and uniqueness of results, and the extent to which the presented solution pushes the envelope in scalable computing 6. Analysis of solution and technology employed compared to related approaches In addition to the above, finalists will be judged on the quality of their presentation, which shall include a 5-minute demonstration, as well as their responses to questions by a technical committee. Papers will be shortlisted using the above 6 points as merit criteria, and up to 6 papers will be invited to compete in a final round at CCGrid 2012. Selected teams will receive an award of up to $1000 to help with travel to the conference. At least one member from each selected team will be expected to present and demonstrate their project at CCGrid 2012. Participation from students and young researchers, especially in leadership roles, is strongly encouraged. Awards: First prize: Plaque + $1000 Second prize: Plaque + $500 Tentative timeline: The deadline for submitting proposals is 15 March, 2012. Decisions: 01 April 2012. Final presentation/demo: 13-16 May, 2012. Coordinator: Shantenu Jha, Rutgers University, USA, shantenu dot jha at rutgers dot edu -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From wilde at mcs.anl.gov Sun Feb 26 12:43:30 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 26 Feb 2012 12:43:30 -0600 (CST) Subject: [Swift-devel] Example files for MATLAB parameter sweep In-Reply-To: <1915879243.33439.1330281166171.JavaMail.root@zimbra.anl.gov> Message-ID: <1065251823.33445.1330281810169.JavaMail.root@zimbra.anl.gov> Hi Lorenzo and Albert, You can find a new tutorial example of a parameter sweep by following the README at: https://svn.ci.uchicago.edu/svn/vdl2/trunk/examples/tutorial/ParameterSweep/README (which is also pasted below). This is a simple example which you can run on any local host (e.g. sandbox.beagle) after you do "module load swift". Over time we will add this to the Swift tutorial document, test it, etc. Lorenzo: this is meant to give you and Albert a base example (non-MATLAB) from which you can create the MATLAB example(s). Ideally we will grow this into a tutorial sequence that shows a few useful variations of organizing a parameter sweep or ensemble of simulations, including passing parameters only via files, or via a combination of Swift variables and files. We welcome your help in developing this, starting with the MATLAB version of it. The first thing for that would be to develop the MATLAB replacements for gensweep.sh and simulate.sh. These two "apps" are meant to be stand-ins for the equivalent MATLAB programs. They use a simple two-column "name value" file format to simulate a .mat file. I'll add you to the Swift committers list so you can place anything you add in SVN. David: this doesn't yet use gensites. Can we add gensites without adding any complexity to the sweep.sh script? Or do we want a version with and without? Hopefully only with. We should extend to use PADS, Fusion, Beagle, MCS servers, FutureGrid, TrySwift, and more. Can you start adding this to the tutorial asciidoc? Jon: I simplified the handling of run dir creation. Maybe we can refit this into swiftopt.sh? We can do this as a collaborative exercise because the result will be of great benefit to all new Swift users, MATLAB and non-MATLAB alike. In fact we should do a version of it for Python, Octave, MATLAB, and R. Regards, - Mike $ cat README This directory contains an example of running a "parameter sweep" or "ensemble" of N simulations or "members". To run: # make sure Swift 0.93 or trunk is in your $PATH svn co https://svn.ci.uchicago.edu/svn/vdl2/trunk/examples/tutorial/ParameterSweep cd ParameterSweep ./sweep.sh # Runs default sweep of 5 members with 3 common data/parameter files ./sweep.sh -nMembers=20 -nCommon=2 # 20 members with 2 common data/parameter files # Each run is executed in a new unique runNNN directory: run001, run002, ... # tc, sites file (local.xml), and Swift properties files (cf) are generated by sweep.sh $ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Sun Feb 26 13:13:41 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 26 Feb 2012 13:13:41 -0600 Subject: [Swift-devel] Example files for MATLAB parameter sweep In-Reply-To: <1065251823.33445.1330281810169.JavaMail.root@zimbra.anl.gov> References: <1065251823.33445.1330281810169.JavaMail.root@zimbra.anl.gov> Message-ID: <0D14FE27-7B46-440B-B371-A43A49D7F119@mcs.anl.gov> On Feb 26, 2012, at 12:43 PM, Michael Wilde wrote: > Jon: I simplified the handling of run dir creation. Maybe we can refit this into swiftopt.sh? This has been done. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Sun Feb 26 17:12:18 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 26 Feb 2012 17:12:18 -0600 Subject: [Swift-devel] Walltime exceeded error In-Reply-To: <91121EC0-D9A6-4C65-8AE3-29C1F37ED18E@mcs.anl.gov> References: <1329775876.6072.1.camel@blabla> <1329776194.6072.2.camel@blabla> <1329776652.6072.3.camel@blabla> <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov> <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov> <1329949702.23375.0.camel@blabla> <5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov> <22E47E80-C5C5-46DA-80CA-1A6063E90727@mcs.anl.gov> <91121EC0-D9A6-4C65-8AE3-29C1F37ED18E@mcs.anl.gov> Message-ID: <77EC683C-7A68-4FB0-AD6D-C229196126E3@mcs.anl.gov> I have again updated the bug: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=720 There are steps now how to reproduce it with a small test from the application we are running. The steps outlined sets up the application to be run on whatever machine you are testing on. This turns out not to be a coaster bug but a swift bug. The test in /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run014 is a local test, it did not use coasters at all and still the hang checker kicked in. On Feb 24, 2012, at 4:09 PM, Jonathan Monette wrote: > I have updated the bugzilla bug with the below directories: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=720 > > I have also added another directory showing the same behavior > /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run054 > > There is a jstack.log file in that directory. All three of the run directories show that jobs get stuck in the initialized state and the hang checker kicks in. > > On Feb 22, 2012, at 5:05 PM, Jonathan Monette wrote: > >> This has been done. I have also moved the run that Ketan had produced to PADS. >> >> /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run002 <-----Ketan's run >> /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run047 <-----My run(has a jstack.log file, also more recent) >> >> On Feb 22, 2012, at 4:33 PM, Jonathan Monette wrote: >> >>> Ok. I have killed the process and I am in the process of copying the run directory from the lustre file system on Beagle to /gpfs/pads >>> >>> On Feb 22, 2012, at 4:28 PM, Mihael Hategan wrote: >>> >>>> On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote: >>>>> Mihael, >>>>> I have a hung Java process showing this error right now, 2 jobs are >>>>> stuck in the initializing state. I have a jstack -l of this >>>>> hung java process. Is there anything else you need before I kill it? >>>>> Do you need any other probing information from this process other than >>>>> this jstack output? >>>> >>>> I don't think so. >>>> >>>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Feb 27 12:58:05 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 27 Feb 2012 12:58:05 -0600 (CST) Subject: [Swift-devel] Did we add: Message-ID: <1963723478.36503.1330369085034.JavaMail.root@zimbra.anl.gov> Hi All, Does anyone know if the enhancement described in bug 359 ("Add ability to set ENV vars, maxwalltime, and RAM requirements on app invocation") was ever done? https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359 I thought some form of it *was*, but I cant find any discussion of that feature in the devel archive, my email, or bugzilla. Was this just wishful thinking, or does some form of the ability to set profile values on a per-app-call basis actually exist? Thanks, - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Mon Feb 27 13:03:55 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 27 Feb 2012 13:03:55 -0600 Subject: [Swift-devel] Did we add: In-Reply-To: <1963723478.36503.1330369085034.JavaMail.root@zimbra.anl.gov> References: <1963723478.36503.1330369085034.JavaMail.root@zimbra.anl.gov> Message-ID: I think this on sites.xml: is intended to do env for application. On Mon, Feb 27, 2012 at 12:58 PM, Michael Wilde wrote: > Hi All, > > Does anyone know if the enhancement described in bug 359 ("Add ability to > set ENV vars, maxwalltime, and RAM requirements on app invocation") was > ever done? > > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359 > > I thought some form of it *was*, but I cant find any discussion of that > feature in the devel archive, my email, or bugzilla. Was this just wishful > thinking, or does some form of the ability to set profile values on a > per-app-call basis actually exist? > > Thanks, > > - Mike > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Feb 27 13:16:21 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 27 Feb 2012 13:16:21 -0600 (CST) Subject: [Swift-devel] Did we add: dynamic profile entries? In-Reply-To: Message-ID: <1797820808.36665.1330370181668.JavaMail.root@zimbra.anl.gov> Ketan, That element sets a profile entry for all jobs on a site. Setting the profile on a tc entry sets it for all calls of the given app. What bug 359 is asking for is the ability to set profile entries dynamically on a per-app-invocation basis, e.g. to give each invocation a specific time or memory limit or env var value, dynamically calculated in the Swift script. - Mike ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Monday, February 27, 2012 1:03:55 PM > Subject: Re: [Swift-devel] Did we add: > I think this on sites.xml: > > > > > is intended to do env for application. > > > > On Mon, Feb 27, 2012 at 12:58 PM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > Hi All, > > Does anyone know if the enhancement described in bug 359 ("Add ability > to set ENV vars, maxwalltime, and RAM requirements on app invocation") > was ever done? > > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359 > > I thought some form of it *was*, but I cant find any discussion of > that feature in the devel archive, my email, or bugzilla. Was this > just wishful thinking, or does some form of the ability to set profile > values on a per-app-call basis actually exist? > > Thanks, > > - Mike > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > -- > Ketan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Mon Feb 27 13:42:00 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 27 Feb 2012 13:42:00 -0600 (CST) Subject: [Swift-devel] Did we add: In-Reply-To: References: <1963723478.36503.1330369085034.JavaMail.root@zimbra.anl.gov> Message-ID: It's in there: http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_dynamic_profiles Justin On Mon, 27 Feb 2012, Ketan Maheshwari wrote: > I think this on sites.xml: > > > > is intended to do env for application. > > > On Mon, Feb 27, 2012 at 12:58 PM, Michael Wilde wrote: > >> Hi All, >> >> Does anyone know if the enhancement described in bug 359 ("Add ability to >> set ENV vars, maxwalltime, and RAM requirements on app invocation") was >> ever done? >> >> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359 >> >> I thought some form of it *was*, but I cant find any discussion of that >> feature in the devel archive, my email, or bugzilla. Was this just wishful >> thinking, or does some form of the ability to set profile values on a >> per-app-call basis actually exist? >> >> Thanks, >> >> - Mike >> >> >> >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > > > -- Justin M Wozniak From wilde at mcs.anl.gov Mon Feb 27 14:16:58 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 27 Feb 2012 14:16:58 -0600 (CST) Subject: [Swift-devel] Did we add: In-Reply-To: Message-ID: <808444397.37136.1330373818039.JavaMail.root@zimbra.anl.gov> Awesome - thanks, I recall now! - Mike ----- Original Message ----- > From: "Justin M Wozniak" > To: "Ketan Maheshwari" > Cc: "Swift Devel" > Sent: Monday, February 27, 2012 1:42:00 PM > Subject: Re: [Swift-devel] Did we add: > It's in there: > > http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_dynamic_profiles > > Justin > > On Mon, 27 Feb 2012, Ketan Maheshwari wrote: > > > I think this on sites.xml: > > > > > > > > is intended to do env for application. > > > > > > On Mon, Feb 27, 2012 at 12:58 PM, Michael Wilde > > wrote: > > > >> Hi All, > >> > >> Does anyone know if the enhancement described in bug 359 ("Add > >> ability to > >> set ENV vars, maxwalltime, and RAM requirements on app invocation") > >> was > >> ever done? > >> > >> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359 > >> > >> I thought some form of it *was*, but I cant find any discussion of > >> that > >> feature in the devel archive, my email, or bugzilla. Was this just > >> wishful > >> thinking, or does some form of the ability to set profile values on > >> a > >> per-app-call basis actually exist? > >> > >> Thanks, > >> > >> - Mike > >> > >> > >> > >> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > >> > > > > > > > > > > -- > Justin M Wozniak > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From svemalayan at yahoo.com Mon Feb 27 19:33:27 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Mon, 27 Feb 2012 17:33:27 -0800 (PST) Subject: [Swift-devel] coaster-service.conf In-Reply-To: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com> References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com> Message-ID: <1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com> Hi All, When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to"zepto-vn-eval/mosatest". How can I do this ? Is there any configuration parameter available to change qsub command? If so how I can specify / pass this parameter?? (via start-coaster-service's command-line parameter or via a setting in coaster-service.conf) Also could you please tell me the exact format ? Please point to me a document if there is any. Thank you Emalayan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Mon Feb 27 20:13:18 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 27 Feb 2012 20:13:18 -0600 Subject: [Swift-devel] [Swift-user] coaster-service.conf In-Reply-To: <1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com> References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com> <1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com> Message-ID: <53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov> I assume you mean the "-k" option in the cqsub command. So currently this is hard coded into the start-coaster-service script, it always uses zeptoos. I have made a quick change that uses a KERNEL variable that needs to be set in the coaster-service.conf file. Are you using your own checkout of trunk or the one in Justin's home directory? On Feb 27, 2012, at 7:33 PM, Emalayan Vairavanathan wrote: > Hi All, > > When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to "zepto-vn-eval/mosatest". > How can I do this ? Is there any configuration parameter available to change qsub command? > > If so how I can specify / pass this parameter? (via start-coaster-service 's command-line parameter or via a setting in coaster-service.conf) > > Also could you please tell me the exact format ? Please point to me a document if there is any. > > > Thank you > Emalayan > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Mon Feb 27 21:06:33 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Mon, 27 Feb 2012 19:06:33 -0800 (PST) Subject: [Swift-devel] [Swift-user] coaster-service.conf In-Reply-To: <53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov> References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com> <1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com> <53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov> Message-ID: <1330398393.5937.YahooMailNeo@web39503.mail.mud.yahoo.com> Hi Jon Thank you very much. Please find my ans below. I assume you mean the "-k" option in the cqsub command. What I meant was "--kernel" option in qsub. Are you using your own checkout of trunk or the one in Justin's home directory? Code from Justin's home directory Thank you very much. Emalayan ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: swift user ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Monday, 27 February 2012 6:13 PM Subject: Re: [Swift-user] coaster-service.conf I assume you mean the "-k" option in the cqsub command. ?So currently this is hard coded into the start-coaster-service script, it always uses zeptoos. ?I have made a quick change that uses a KERNEL variable that needs to be set in the coaster-service.conf file. ?Are you using your own checkout of trunk or the one in Justin's home directory? On Feb 27, 2012, at 7:33 PM, Emalayan Vairavanathan wrote: Hi All, > > >When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to"zepto-vn-eval/mosatest". >How can I do this ? Is there any configuration parameter available to change qsub command? > > >If so how I can specify / pass this parameter?? (via start-coaster-service's command-line parameter or via a setting in coaster-service.conf) > >Also could you please tell me the exact format ? Please point to me a document if there is any. > > > > >Thank you >Emalayan > > >_______________________________________________ >Swift-user mailing list >Swift-user at ci.uchicago.edu >https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Mon Feb 27 21:39:14 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Mon, 27 Feb 2012 21:39:14 -0600 Subject: [Swift-devel] [Swift-user] coaster-service.conf In-Reply-To: <1330398393.5937.YahooMailNeo@web39503.mail.mud.yahoo.com> References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com> <1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com> <53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov> <1330398393.5937.YahooMailNeo@web39503.mail.mud.yahoo.com> Message-ID: <37C4069A-300B-4FB8-BCFD-C8E97143FDBA@mcs.anl.gov> On Feb 27, 2012, at 9:06 PM, Emalayan Vairavanathan wrote: > Hi Jon > > Thank you very much. Please find my ans below. > > I assume you mean the "-k" option in the cqsub command. > > What I meant was "--kernel" option in qsub. So the start-coaster-service for cobalt uses cqsub and not qsub. qsub has the --kernel option while cqsub has the -k option. Looking at the man pages they seem to be the same, but not sure. Justin can probably provide more information on the difference, or point to a document that explains when one should be used over the other. > > > Are you using your own checkout of trunk or the one in Justin's home directory? > > Code from Justin's home directory Then Justin will have to do a svn up tomorrow for you. We probably should figure out a way for you to have your own stable copy to make changes too, so that us updating that copy is not a blocker for you progressing further. > > > Thank you very much. > Emalayan > From: Jonathan Monette > To: Emalayan Vairavanathan > Cc: swift user ; "swift-devel at ci.uchicago.edu" ; MosaStore > Sent: Monday, 27 February 2012 6:13 PM > Subject: Re: [Swift-user] coaster-service.conf > > I assume you mean the "-k" option in the cqsub command. So currently this is hard coded into the start-coaster-service script, it always uses zeptoos. I have made a quick change that uses a KERNEL variable that needs to be set in the coaster-service.conf file. Are you using your own checkout of trunk or the one in Justin's home directory? > > On Feb 27, 2012, at 7:33 PM, Emalayan Vairavanathan wrote: > >> Hi All, >> >> When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to "zepto-vn-eval/mosatest". >> How can I do this ? Is there any configuration parameter available to change qsub command? >> >> If so how I can specify / pass this parameter? (via start-coaster-service 's command-line parameter or via a setting in coaster-service.conf) >> >> Also could you please tell me the exact format ? Please point to me a document if there is any. >> >> >> Thank you >> Emalayan >> >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Mon Feb 27 22:59:20 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Mon, 27 Feb 2012 20:59:20 -0800 (PST) Subject: [Swift-devel] [Swift-user] coaster-service.conf In-Reply-To: <37C4069A-300B-4FB8-BCFD-C8E97143FDBA@mcs.anl.gov> References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com> <1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com> <53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov> <1330398393.5937.YahooMailNeo@web39503.mail.mud.yahoo.com> <37C4069A-300B-4FB8-BCFD-C8E97143FDBA@mcs.anl.gov> Message-ID: <1330405160.59260.YahooMailNeo@web39508.mail.mud.yahoo.com> So the start-coaster-service for cobalt uses cqsub and not qsub. ?qsub has the --kernel option while cqsub has the -k option. ?Looking at the man pages they seem to be the same, but not sure. ?Justin can probably provide more information on the difference, or point to a document that explains when one should be used over the other. >> Thank youvery much for fixing it Jon. I am not sure about the trade-offs though. I can switch to cqsub if it is necessary. Then Justin will have to do a svn up tomorrow for you. ?We probably should figure out a way for you to have your own stable copy to make changes too, so that us updating that copy is not a blocker for you progressing further. >> Yes. We need to talk about this in Wednesday meeting. Justin could you please take an update ? Thank you Emalayan ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: swift user ; "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Monday, 27 February 2012 7:39 PM Subject: Re: [Swift-user] coaster-service.conf On Feb 27, 2012, at 9:06 PM, Emalayan Vairavanathan wrote: Hi Jon > > >Thank you very much. Please find my ans below. > > >I assume you mean the "-k" option in the cqsub command. > > > >What I meant was "--kernel" option in qsub. > So the start-coaster-service for cobalt uses cqsub and not qsub. ?qsub has the --kernel option while cqsub has the -k option. ?Looking at the man pages they seem to be the same, but not sure. ?Justin can probably provide more information on the difference, or point to a document that explains when one should be used over the other. > > > >Are you using your own checkout of trunk or the one in Justin's home directory? > > >Code from Justin's home directory Then Justin will have to do a svn up tomorrow for you. ?We probably should figure out a way for you to have your own stable copy to make changes too, so that us updating that copy is not a blocker for you progressing further. > > > >Thank you very much. > >Emalayan > > >________________________________ > From: Jonathan Monette >To: Emalayan Vairavanathan >Cc: swift user ; "swift-devel at ci.uchicago.edu" ; MosaStore >Sent: Monday, 27 February 2012 6:13 PM >Subject: Re: [Swift-user] coaster-service.conf > > >I assume you mean the "-k" option in the cqsub command. ?So currently this is hard coded into the start-coaster-service script, it always uses zeptoos. ?I have made a quick change that uses a KERNEL variable that needs to be set in the coaster-service.conf file. ?Are you using your own checkout of trunk or the one in Justin's home directory? > > >On Feb 27, 2012, at 7:33 PM, Emalayan Vairavanathan wrote: > >Hi All, >> >> >>When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to"zepto-vn-eval/mosatest". >>How can I do this ? Is there any configuration parameter available to change qsub command? >> >> >>If so how I can specify / pass this parameter?? (via start-coaster-service's command-line parameter or via a setting in coaster-service.conf) >> >>Also could you please tell me the exact format ? Please point to me a document if there is any. >> >> >> >> >>Thank you >>Emalayan >> >> >>_______________________________________________ >>Swift-user mailing list >>Swift-user at ci.uchicago.edu >>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From svemalayan at yahoo.com Tue Feb 28 13:53:57 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Tue, 28 Feb 2012 11:53:57 -0800 (PST) Subject: [Swift-devel] Running applications with Swift on Surveyor Message-ID: <1330458837.43066.YahooMailNeo@web39506.mail.mud.yahoo.com> Hi All, I have a quick question. It seems the step I was following to run the applications on BG/P with swift is different from the steps suggested by https://sites.google.com/site/exmproject/development/mosaswift. I was running applications+Swift from head node by just submitting a command below. swift -config cf? -tc.file tc -sites.file sites.xml ftdock.swift -n=1 -list=pdb.list -grid=10 I didnt start the coaster-service but my site file was using coaster as execution-provider (in site files). Then Swift allocated some nodes and executed the job and placed the result in my home directory. (My assumption here was coaster-service and workers will be started automatically by swift). But the above link suggests me to use persistent-coasters, changes to coaster-config files and also to start coaster-service in the head node. Basically I have three questions: 1) What is the different between Coasters and Persistent-Coasters? 2) How I was able to run the swift+application without starting the coaster-service, since coaster-service is expected to be started manually (according to the above link) ? Does swift use some other mechanisms to send a job if coaster-service is not started explicitly? 3) How I need to run my experiments in future with MosaStore and Swift ? Should I use Coasters / Persistent-Coasters ? Thank you Emalayan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Tue Feb 28 14:21:18 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 28 Feb 2012 14:21:18 -0600 Subject: [Swift-devel] Running applications with Swift on Surveyor In-Reply-To: <1330458837.43066.YahooMailNeo@web39506.mail.mud.yahoo.com> References: <1330458837.43066.YahooMailNeo@web39506.mail.mud.yahoo.com> Message-ID: <68E4F9BB-E06E-4057-84B2-AC38FF14396E@mcs.anl.gov> Hey Emalayan, My answers are below. On Feb 28, 2012, at 1:53 PM, Emalayan Vairavanathan wrote: > Hi All, > > I have a quick question. > > It seems the step I was following to run the applications on BG/P with swift is different from the steps suggested by https://sites.google.com/site/exmproject/development/mosaswift. I was running applications+Swift from head node by just submitting a command below. > > swift -config cf -tc.file tc -sites.file sites.xml ftdock.swift -n=1 -list=pdb.list -grid=10 > > I didnt start the coaster-service but my site file was using coaster as execution-provider (in site files). Then Swift allocated some nodes and executed the job and placed the result in my home directory. (My assumption here was coaster-service and workers will be started automatically by swift). > > But the above link suggests me to use persistent-coasters, changes to coaster-config files and also to start coaster-service in the head node. > > > Basically I have three questions: > > 1) What is the different between Coasters and Persistent-Coasters? The mechanism is name Coaster. The persistent part of the name is for the workers, they are persistent through swift executions. You can re-use the same workers you started with the 'start-coaster-service' script for many swift executions. When not running in a persistent mode(i.e. the automatic mode) they coaster service and the workers are killed before swift comes to completion. > 2) How I was able to run the swift+application without starting the coaster-service, since coaster-service is expected to be started manually (according to the above link) ? Does swift use some other mechanisms to send a job if coaster-service is not started explicitly? So if you just say the execution provider is coaster, then Swift will start the coaster service and the workers automatically. Swift will then shut down the coaster service and the workers once they the swift script is done executing. Persistent-coasters will wait until you stop the service explicitly with the 'stop-coaster-service' script. This script will shutdown the service and any workers connected to it. > 3) How I need to run my experiments in future with MosaStore and Swift ? Should I use Coasters / Persistent-Coasters ? I think you want to use the persistent-coaster mode where you start the workers manually with the 'start-coaster-service' script. > > Thank you > Emalayan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Feb 28 14:23:50 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Feb 2012 14:23:50 -0600 (CST) Subject: [Swift-devel] Making Swift run on Eureka In-Reply-To: <1959174301.41783.1330460542638.JavaMail.root@zimbra.anl.gov> Message-ID: <596189083.41792.1330460630800.JavaMail.root@zimbra.anl.gov> I asked Jon to make Swift run (in automatic coaster mode, then manual coaster mode) on Eureka. First step is to find the status of the relevant ALCF Cobalt ticket against Cobalt on Eureka. Thats listed in the Swift ticket for this problem: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=245 If anyone has knowledge or advice, please reply. Thanks, - Mike From svemalayan at yahoo.com Tue Feb 28 14:33:06 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Tue, 28 Feb 2012 12:33:06 -0800 (PST) Subject: [Swift-devel] Running applications with Swift on Surveyor In-Reply-To: <68E4F9BB-E06E-4057-84B2-AC38FF14396E@mcs.anl.gov> References: <1330458837.43066.YahooMailNeo@web39506.mail.mud.yahoo.com> <68E4F9BB-E06E-4057-84B2-AC38FF14396E@mcs.anl.gov> Message-ID: <1330461186.83110.YahooMailNeo@web39504.mail.mud.yahoo.com> Great. Thank you very much Jon. I have a better understanding about both approaches now. ________________________________ From: Jonathan Monette To: Emalayan Vairavanathan Cc: "swift-devel at ci.uchicago.edu" ; MosaStore Sent: Tuesday, 28 February 2012 12:21 PM Subject: Re: [Swift-devel] Running applications with Swift on Surveyor Hey Emalayan, ? ?My answers are below. On Feb 28, 2012, at 1:53 PM, Emalayan Vairavanathan wrote: Hi All, > > >I have a quick question. > > > >It seems the step I was following to run the applications on BG/P with swift is different from the steps suggested by https://sites.google.com/site/exmproject/development/mosaswift. I was running applications+Swift from head node by just submitting a command below. > > >swift -config cf? -tc.file tc -sites.file sites.xml ftdock.swift -n=1 -list=pdb.list -grid=10 > > >I didnt start the coaster-service but my site file was using coaster as execution-provider (in site files). Then Swift allocated some nodes and executed the job and placed the result in my home directory. (My assumption here was coaster-service and workers will be started automatically by swift). > > > >But the above link suggests me to use persistent-coasters, changes to coaster-config files and also to start coaster-service in the head node. > > > > > >Basically I have three questions: > > >1) What is the different between Coasters and Persistent-Coasters? The mechanism is name Coaster. ?The persistent part of the name is for the workers, they are persistent through swift executions. ?You can re-use the same workers you started with the 'start-coaster-service' script for many swift executions. ?When not running in a persistent mode(i.e. the automatic mode) they coaster service and the workers are killed before swift comes to completion. 2) How I was able to run the swift+application without starting the coaster-service, since coaster-service is expected to be started manually (according to the above link) ? Does swift use some other mechanisms to send a job if coaster-service is not started explicitly? So if you just say the execution provider is coaster, then Swift will start the coaster service and the workers automatically. ?Swift ?will then shut down the coaster service and the workers once they the swift script is done executing. ?Persistent-coasters will wait until you stop the service explicitly with the 'stop-coaster-service' script. ?This script will shutdown the service and any workers connected to it. 3) How I need to run my experiments in future with MosaStore and Swift ? Should I use Coasters / Persistent-Coasters ? > I think you want to use the persistent-coaster mode where you start the workers manually with the 'start-coaster-service' script. > >Thank you >Emalayan >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Feb 28 23:07:35 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Feb 2012 23:07:35 -0600 (CST) Subject: [Swift-devel] Problems running Swift on BG/P In-Reply-To: <969265748.43469.1330491577144.JavaMail.root@zimbra.anl.gov> Message-ID: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov> Emalayan and I spent a considerable amount of time debugging Swift on surveyor tonight. As far as I can tell, after fixing a few config problems, it seems like the workers are unable to connect the coaster service. They seem to be trying to connect on the correct address. The workers start, and produce logs, but dont seem to make connections. I noticed the following email thread: http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html which talk about the sites attribute "alcfbgpnat" and state: --- This code snippet may be of relevance: if (settings.getAlcfbgpnat()) { spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true"); } So you should set that env variable for the job if you want NAT. --- Is this being done in the current start-coaster-service job? (Presumably needs to be done in the cobalt job?) We also noticed that Emalayan was unable to follow the standard recipe for logging into the compute nodes of a running job. He could get to the IOP, but from there, got something like "no route to host" when he tried to telnet (or ping?) to the compute nodes. I'll check on the ZOID_ENABLE_NAT setting, but any thoughts? Thanks, - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Tue Feb 28 23:09:28 2012 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 28 Feb 2012 23:09:28 -0600 Subject: [Swift-devel] Problems running Swift on BG/P In-Reply-To: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov> References: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov> Message-ID: <74728418-B3CE-484A-A81D-2BBEE8199922@mcs.anl.gov> Is the internalHostname variable being set in the sites file? It should be set to the 172.*.* address returned from ifconfig On Feb 28, 2012, at 11:07 PM, Michael Wilde wrote: > Emalayan and I spent a considerable amount of time debugging Swift on surveyor tonight. > > As far as I can tell, after fixing a few config problems, it seems like the workers are unable to connect the coaster service. They seem to be trying to connect on the correct address. The workers start, and produce logs, but dont seem to make connections. > > I noticed the following email thread: > http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html > > which talk about the sites attribute "alcfbgpnat" and state: > --- > This code snippet may be of relevance: > if (settings.getAlcfbgpnat()) { > spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true"); > } > > So you should set that env variable for the job if you want NAT. > --- > > Is this being done in the current start-coaster-service job? (Presumably needs to be done in the cobalt job?) > > We also noticed that Emalayan was unable to follow the standard recipe for logging into the compute nodes of a running job. He could get to the IOP, but from there, got something like "no route to host" when he tried to telnet (or ping?) to the compute nodes. > > I'll check on the ZOID_ENABLE_NAT setting, but any thoughts? > > Thanks, > > - Mike > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Feb 28 23:18:56 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Feb 2012 23:18:56 -0600 (CST) Subject: [Swift-devel] Problems running Swift on BG/P In-Reply-To: <74728418-B3CE-484A-A81D-2BBEE8199922@mcs.anl.gov> Message-ID: <597307474.43481.1330492736016.JavaMail.root@zimbra.anl.gov> I asked Emalayan to set GLOBUS_HOSTNAME to that value. Its not being set in the sites file. But somehow that is getting through (I think) because the workers are trying to connect to that address. The sites file was: passive 4 1000 10000 /home/emalayan/work I also see that start-coaster-service is trying to set ZOID_ENABLE_NAT: ENV="WORKER_LOGGING_LEVEL=DEBUG:ZOID_ENABLE_NAT=true" if [ -n $WORKER_ENVIRONMENT ]; then ENV+=:$WORKER_ENVIRONMENT fi set -x cqsub -q ${QUEUE} \ -k zeptoos \ -t ${MAXTIME} \ -n ${NODES} \ -C ${PWD}/${LOG_DIR} \ -E cobalt.${$}.stderr \ -o cobalt.${$}.stdout \ -e $ENV \ $SWIFT_BIN/$WORKER $EXECUTION_URL $ID $PWD/$LOG_DIR Im thinking that one possibility is that without NAT enabled, the workers cant connect back to the login host's 172. network, which is a different subnet than the 172. net of the login host. Jon, did this mechanism work for you? Also, is it possible that somehow the ":"-separated envvars are not getting from cqsub to the job's environment? Could something have changed in cobalt in yesterday' maintenance window? - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Michael Wilde" > Cc: "Justin M Wozniak" , "Jonathan Monette" , emalayan at ece.ubc.ca, "Matei > Ripeanu" , "Swift Devel" > Sent: Tuesday, February 28, 2012 11:09:28 PM > Subject: Re: [Swift-devel] Problems running Swift on BG/P > Is the internalHostname variable being set in the sites file? It > should be set to the 172.*.* address returned from ifconfig > > On Feb 28, 2012, at 11:07 PM, Michael Wilde wrote: > > > Emalayan and I spent a considerable amount of time debugging Swift > > on surveyor tonight. > > > > As far as I can tell, after fixing a few config problems, it seems > > like the workers are unable to connect the coaster service. They > > seem to be trying to connect on the correct address. The workers > > start, and produce logs, but dont seem to make connections. > > > > I noticed the following email thread: > > http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html > > > > which talk about the sites attribute "alcfbgpnat" and state: > > --- > > This code snippet may be of relevance: > > if (settings.getAlcfbgpnat()) { > > spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true"); > > } > > > > So you should set that env variable for the job if you want NAT. > > --- > > > > Is this being done in the current start-coaster-service job? > > (Presumably needs to be done in the cobalt job?) > > > > We also noticed that Emalayan was unable to follow the standard > > recipe for logging into the compute nodes of a running job. He could > > get to the IOP, but from there, got something like "no route to > > host" when he tried to telnet (or ping?) to the compute nodes. > > > > I'll check on the ZOID_ENABLE_NAT setting, but any thoughts? > > > > Thanks, > > > > - Mike > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From zhaozhang at uchicago.edu Tue Feb 28 23:21:14 2012 From: zhaozhang at uchicago.edu (ZHAO ZHANG) Date: Tue, 28 Feb 2012 23:21:14 -0600 Subject: [Swift-devel] Problems running Swift on BG/P In-Reply-To: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov> References: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov> Message-ID: <4F4DB5CA.8070301@uchicago.edu> Hi, Mike, All, Please refer to http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ#How_to_open_a_socket_from_a_CN_to_the_outside_world for the NAT feature of ZeptoOS. It could be enabled in the cqsub command line. Keep in mind that, if we use this feature, we have to start a server a the login node, and let compute nodes connect the server socket. Once the server socket got the connection, it can send message back. To access CNs from IO Node, we need to use the tree network, which range from 192.168.1.1 to 192.168.1.64. There is a overlay mapping of the tree network and the torus network. But I never figured it out. We could work around the problem by login one of the compute nodes, then telnet the torus network address. An simple example is we could login 192.168.1.64. PS: in any scale, 192.168.1.68 in the first pset is always the one with Rank 0. From there, we could login 12.0.0.2 and etc.. best zhao On 2/28/2012 11:07 PM, Michael Wilde wrote: > Emalayan and I spent a considerable amount of time debugging Swift on surveyor tonight. > > As far as I can tell, after fixing a few config problems, it seems like the workers are unable to connect the coaster service. They seem to be trying to connect on the correct address. The workers start, and produce logs, but dont seem to make connections. > > I noticed the following email thread: > http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html > > which talk about the sites attribute "alcfbgpnat" and state: > --- > This code snippet may be of relevance: > if (settings.getAlcfbgpnat()) { > spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true"); > } > > So you should set that env variable for the job if you want NAT. > --- > > Is this being done in the current start-coaster-service job? (Presumably needs to be done in the cobalt job?) > > We also noticed that Emalayan was unable to follow the standard recipe for logging into the compute nodes of a running job. He could get to the IOP, but from there, got something like "no route to host" when he tried to telnet (or ping?) to the compute nodes. > > I'll check on the ZOID_ENABLE_NAT setting, but any thoughts? > > Thanks, > > - Mike > From davidk at ci.uchicago.edu Tue Feb 28 23:34:14 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 28 Feb 2012 23:34:14 -0600 (CST) Subject: [Swift-devel] Making Swift run on Eureka In-Reply-To: <596189083.41792.1330460630800.JavaMail.root@zimbra.anl.gov> Message-ID: <1552374827.140785.1330493654834.JavaMail.root@zimbra-mb2.anl.gov> I don't know the details of this bug, but I remember seeing this email a few months ago if it helps.. ---- Original Message ----- > From: "Paul Rich" > To: "Michael Wilde" > Cc: support at alcf.anl.gov, "Robert Jacob" , "swift-devel" , "Andrew > Cherry" > Sent: Wednesday, June 22, 2011 2:39:02 PM > Subject: Re: [Swift-devel] [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? > Michael, > > I wanted to let you know that a recent patch to Cobalt on Eureka > should allow you to pass command-line arguments into the program > supplied to the Cobalt job. Let us know if you encounter any further > difficulties, and I am sorry that this took so long to deploy. > > Thank you for your patience, > > -- > Paul Rich > ALCF Operations -- AIG > richp at alcf.anl.gov > > > ----- Original Message ----- > From: "Michael Wilde" > To: "Paul M. Rich" , "Andrew Cherry" > > Cc: "swift-devel" , "Robert Jacob" > , support at alcf.anl.gov > Sent: Tuesday, January 11, 2011 7:30:30 PM > Subject: Re: [alcf-support #60887] Can Cobalt command-line bug on > Eureka be fixed? > > Paul, Andrew, > > What I think we're going to do on this from the Swift side is > temporarily try to use Eureka in a mode where we manually start Swift > workers on the cluster using a batch job. > > We'll wait on testing the Swift Cobolt interface (which is different > than the above) until we hear from you that the bug is fixed and ready > for testing. > > So even though it may be many weeks or more away, we'd like to put in > our vote for fixing this issue (realizing that you have many other > priorities :) > > Thanks, > > MIke From svemalayan at yahoo.com Tue Feb 28 23:44:08 2012 From: svemalayan at yahoo.com (Emalayan Vairavanathan) Date: Tue, 28 Feb 2012 21:44:08 -0800 (PST) Subject: [Swift-devel] WORKER_INIT_CMD - with log file In-Reply-To: <1330043315.75850.YahooMailNeo@web39506.mail.mud.yahoo.com> References: <1329174515.43886.YahooMailNeo@web39505.mail.mud.yahoo.com> <1329771862.41926.YahooMailNeo@web39501.mail.mud.yahoo.com> <1329772390.14447.YahooMailNeo@web39504.mail.mud.yahoo.com> <8742CFD8-8C54-4D67-A0D0-945BB6AF9948@mcs.anl.gov> <1329781225.86597.YahooMailNeo@web39505.mail.mud.yahoo.com> <1330041782.64906.YahooMailNeo@web39506.mail.mud.yahoo.com> <1330043315.75850.YahooMailNeo@web39506.mail.mud.yahoo.com> Message-ID: <1330494248.2990.YahooMailNeo@web39501.mail.mud.yahoo.com> Hi All, Today I tried run 001-catsn-surveyor.swift script with Mike's help. But I am still facing some issues and it would be great if? you can shed some light on this. Brief overview about what I am doing: I am trying to run a simple swift script (001-catsn-surveyor.swift) with persistent-coasters. The goal is to trying out more complex applications such as Montage and ModFTDock and then ultimately integrating MosaStore+Swift+Applications. Current Problem: Workers could not connect to coaster-service with 001-catsn-surveyor.swift. Steps taken: I was doing the steps below (in Surveyor with Swift available in ~/wozniak/Public/swift/bin/swift). 1) Set the port-numbers, IP address and Nodes in coaster-service.conf? ?? (LOCAL_PORT=22346, SERVICE_PORT=22356, IPADDR=172.17.3.12 and NODES=3) 2) Set environment variable GLOBUS_HOSTNAME=172.17.3.12 3) Launched coaster-service from 172.17.3.12 (inside ~/emalayan/app/swift-test folder) 4) Launched 001-catsn-surveyor.swift using run.sh from 172.17.3.12 (inside ~/emalayan/app/swift-test folder) But the 001-catsn-surveyor.swift did not make any progress. I have observed that the nodes were allocated and running via cqstat. When I checked the worker log files I figured-out that the workers were unable to connect to coaster-service.? Then I tried to connect to the compute-nodes to see whether workers are actually running there. But Icould not connect to compute-nodes from IO node. I repeated the same steps againwith NODES=64 and 1024 to see whether this problem (inability to connect to coaster-service) is coupled with the number of nodes setting in coaster-service.conf? (which was initially? 3). But I observed the same behavior. In order to find-out whether this is because of some network configuration issues in Surveyor, I tried to run ModFTDock+Swift (available in ~/emalayan/app/forEmalayan_ccGrdid) with coasters. It was successfully running and also I were able to connect to compute nodes without any issues during the application run. You can find the 001-catsn-surveyor.swift script, config files and log files inside ~/emalayan/app/swift-test and ~/emalayan/app/swift-test/log folders. I highly appreciate your input. Please let me know if you have questions. Thank you Emalayan From: Emalayan Vairavanathan To: Justin M Wozniak Cc: Jonathan Monette ; swift user ; matei Sent: Thursday, 23 February 2012 4:28 PM Subject: Re: WORKER_INIT_CMD - with log file Thank you. I was not aware about that. Now I am getting the error below. Am I missing some configurations ? Thank you Emalayan Swift trunk swift-r5662 (swift modified locally) cog-r3361 (cog modified locally) RunID: 20120224-0021-jtfvfc90 Progress:? time: Fri, 24 Feb 2012 00:21:32 +0000 Find: http://172.17.3.12:12346 Find:? keepalive(120), reconnect - http://172.17.3.12:12346 Failed to transfer wrapper log for job cat-leiykink Failed to transfer wrapper log for job cat-qeiykink Failed to transfer wrapper log for job cat-jeiykink Failed to transfer wrapper log for job cat-meiykink Failed to transfer wrapper log for job cat-oeiykink Failed to transfer wrapper log for job cat-peiykink Failed to transfer wrapper log for job cat-keiykink Failed to transfer wrapper log for job cat-seiykink Failed to transfer wrapper log for job cat-neiykink Progress:? time: Fri, 24 Feb 2012 00:21:33 +0000? Stage in:3? Submitting:5 Failed but can retry:2 Failed to transfer wrapper log for job cat-reiykink Failed to transfer wrapper log for job cat-3fiykink Failed to transfer wrapper log for job cat-7fiykink Failed to transfer wrapper log for job cat-8fiykink Failed to transfer wrapper log for job cat-bfiykink Failed to transfer wrapper log for job cat-zeiykink Failed to transfer wrapper log for job cat-0fiykink Failed to transfer wrapper log for job cat-afiykink Failed to transfer wrapper log for job cat-yeiykink Failed to transfer wrapper log for job cat-1fiykink Failed to transfer wrapper log for job cat-dfiykink Failed to transfer wrapper log for job cat-ffiykink EXCEPTION Exception in cat: Arguments: [data.txt] Host: persistent-coasters Directory: 001-catsn-surveyor-20120224-0021-jtfvfc90/jobs/f/cat-ffiykink stderr.txt: stdout.txt: ---- Caused by: Task failed: null org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel died and no contact available ??? at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:234) ??? at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:256) ??? at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) ??? at org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:132) ??? at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:258) ??? at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:213) ??? at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:199) ??? at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:169) ??? at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:114) Execution failed: ??? Failed to transfer wrapper log for job cat-mfiykink EXCEPTION Exception in cat: Arguments: [data.txt] Host: persistent-coasters Directory: 001-catsn-surveyor-20120224-0021-jtfvfc90/jobs/m/cat-mfiykink stderr.txt: stdout.txt: ---- ________________________________ From: Justin M Wozniak To: Emalayan Vairavanathan Cc: Jonathan Monette ; swift user ; matei Sent: Thursday, 23 February 2012 4:08 PM Subject: Re: WORKER_INIT_CMD - with log file Do the pool names agree?? You may want to check that you are using the right tc file.? You might need tc.persistent.data . On Thu, 23 Feb 2012, Emalayan Vairavanathan wrote: > Hi Justin, > > I copied swift-test from ~wozniak/Public/swift-test and try to run it. I followed the steps below. > > - Modified the coaster-service-conf > > - Started the coaster service. > - Started swift > > > The tc.data has entries for cat but I am getting the error below. Do you have any ideas ? > > > Thank you > Emalayan > > emalayan at login2.surveyor:~/swift-test> run.sh > > Swift trunk swift-r5662 (swift modified locally) cog-r3361 (cog modified locally) > > RunID: 20120224-0000-rec5hd4a > Progress:? time: Fri, 24 Feb 2012 00:00:45 +0000 > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > Execution failed: > ?? ?EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog > Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException > The application "cat" is not available in the given site/pool in your tc.data catalog? > > > > > > ________________________________ > From: Justin M Wozniak > To: Jonathan Monette > Cc: Emalayan Vairavanathan ; matei > Sent: Tuesday, 21 February 2012 12:40 PM > Subject: Re: WORKER_INIT_CMD - with log file > > Hi guys > > Yes, that version is based on trunk and is up-to-date with > WORKER_INIT_CMD.? The recent bug fix for the BG/P is in there, I just > tested it. > > I moved the location to ~wozniak/Public/swift .? The test case I used is > in ~wozniak/Public/swift-test .? Both should be readable. > > ??? Justin > > On Mon, 20 Feb 2012, Jonathan Monette wrote: > >> It might....that is a question that Justin can answer.? If it doesn't I am sure the feature can be quickly added. >> >> On Feb 20, 2012, at 5:40 PM, Emalayan Vairavanathan wrote: >> >>> Hi Jon, >>> >>> I didn't try with the swift-version available in Justin's home directory. I can try and tell it now. >>> >>> But just a quick question: Does this version has WORKER_INIT_CMD ? >>> >>> Thank you >>> Emalayan >>> >>> From: Jonathan Monette >>> To: Emalayan Vairavanathan >>> Cc: Justin Wozniak ; matei >>> Sent: Monday, 20 February 2012 3:35 PM >>> Subject: Re: WORKER_INIT_CMD - with log file >>> >>> I suggest to use the swift version in Justin's directory. This is the stable version for the bg/p. If you already are using it, then let me debug further. >>> >>> What login host where you running on, login1 or login2? >>> >>> On Feb 20, 2012, at 3:13 PM, Emalayan Vairavanathan wrote: >>> >>>> >>>> Hi Jon and Justin, >>>> >>>> I checkout the swift-code from trunk and try to see whether ModFTDock+Swift works on Surveyor (without MosaStore). But the job did not complete for a long time. >>>> >>>> Could you please have a look ? >>>> >>>> (As I can remember, last time there was a bug when I tried to launch swift on Surveyor. Justin fixed the bug and asked me to use the swift executables from his home directory. May be this fix is not available in the trunk ?) >>>> >>>> Thank you >>>> Emalayan >>>> >>>> From: Emalayan Vairavanathan >>>> To: Jonathan Monette ; "emalayan at ece.ubc.ca" >>>> Cc: Justin Wozniak ; matei ; MosaStore >>>> Sent: Monday, 13 February 2012 3:08 PM >>>> Subject: Re: WORKER_INIT_CMD >>>> >>>> Thank you very much Jon. I will ask you if I have questions. >>>> >>>> Regards >>>> Emalayan >>>> >>>> From: Jonathan Monette >>>> To: emalayan at ece.ubc.ca >>>> Cc: Justin Wozniak >>>> Sent: Sunday, 12 February 2012 4:50 PM >>>> Subject: WORKER_INIT_CMD >>>> >>>> Emalayan, >>>> ?? We have now added an environment variable to the worker script.? The variable is called WORKER_INIT_CMD and works like so: >>>> >>>> export WORKER_INIT_CMD= >>>> >>>> The worker will then run this script before entering it's main loop that waits for Swift apps to run.? You have to use manual coasters to use this variable, which I believe you already are doing. >>>> >>>> Let me know if you have any questions about this env variable. >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> > > -- > Justin M Wozniak -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Feb 28 23:53:21 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Feb 2012 23:53:21 -0600 (CST) Subject: [Swift-devel] Making Swift run on Eureka In-Reply-To: <1552374827.140785.1330493654834.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1663252148.43503.1330494801381.JavaMail.root@zimbra.anl.gov> Thanks, David - that is indeed the mail we were looking for. I think Jon confirmed today that 0.93 now works on Eureka with no changes. Eureka! :) Mike ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Tuesday, February 28, 2012 11:34:14 PM > Subject: Re: [Swift-devel] Making Swift run on Eureka > I don't know the details of this bug, but I remember seeing this email > a few months ago if it helps.. > > ---- Original Message ----- > > From: "Paul Rich" > > To: "Michael Wilde" > > Cc: support at alcf.anl.gov, "Robert Jacob" , > > "swift-devel" , "Andrew > > Cherry" > > Sent: Wednesday, June 22, 2011 2:39:02 PM > > Subject: Re: [Swift-devel] [alcf-support #60887] Can Cobalt > > command-line bug on Eureka be fixed? > > Michael, > > > > I wanted to let you know that a recent patch to Cobalt on Eureka > > should allow you to pass command-line arguments into the program > > supplied to the Cobalt job. Let us know if you encounter any further > > difficulties, and I am sorry that this took so long to deploy. > > > > Thank you for your patience, > > > > -- > > Paul Rich > > ALCF Operations -- AIG > > richp at alcf.anl.gov > > > > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Paul M. Rich" , "Andrew Cherry" > > > > Cc: "swift-devel" , "Robert Jacob" > > , support at alcf.anl.gov > > Sent: Tuesday, January 11, 2011 7:30:30 PM > > Subject: Re: [alcf-support #60887] Can Cobalt command-line bug on > > Eureka be fixed? > > > > Paul, Andrew, > > > > What I think we're going to do on this from the Swift side is > > temporarily try to use Eureka in a mode where we manually start > > Swift > > workers on the cluster using a batch job. > > > > We'll wait on testing the Swift Cobolt interface (which is different > > than the above) until we hear from you that the bug is fixed and > > ready > > for testing. > > > > So even though it may be many weeks or more away, we'd like to put > > in > > our vote for fixing this issue (realizing that you have many other > > priorities :) > > > > Thanks, > > > > MIke -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Feb 29 00:43:04 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 29 Feb 2012 00:43:04 -0600 (CST) Subject: [Swift-devel] Problems running Swift on BG/P In-Reply-To: <4F4DB5CA.8070301@uchicago.edu> Message-ID: <1291594982.43559.1330497784865.JavaMail.root@zimbra.anl.gov> Thanks, Zhao. In this case we are using start-coaster-service, which does start a service on the login nodes. Its a procedure that has been tested and has worked for Justin. Buts its failing for Emalayan and I think Jon just verified that it is failing for him as well. This script does set ZOID_ENABLE_NAT via the cqsub -e option. Ive just verified that in at least a simple cqsub model on what start-coaster-service uses, that with ZOID_ENABLE_NAT=true I am able to ping the login host, and with that variable not set, I can not. I also tested with that variable set in between two other var settings, sandwiched between :'s, as it is in start-coaster-service, then NAT still works: /usr/bin/cqsub.py -q default -p MTCScienceApps -k zeptoos -t 60 -n 1 -C /home/wilde -E cobalt.17074.stderr -o cobalt.17074.stdout -e WORKER_LOGGING_LEVEL=debug:ZOID_ENABLE_NAT=true:WORKER_INIT_CMD=/home/wilde/bin/worker-init.pl /bin/ping -c 5 172.17.3.12 Command: '/bgsys/drivers/ppcfloor/bin/mpirun' '-host' '172.17.3.1' '-np' '1' '-partition' 'ANL-R00-M1-N02-64' '-mode' 'smp' '-cwd' '/home/wilde' '-exe' '/bin/ping' '-args' '-c 5 172.17.3.12' '-env' 'COBALT_JOBID=273236 WORKER_LOGGING_LEVEL=debug WORKER_INIT_CMD=/home/wilde/bin/worker-init.pl ZOID_ENABLE_NAT=true' So the behavior we are seeing suggests that somehow in Emalayan's tests, the ZOID_ENABLD_NAT setting is not getting through. Next I think we need to re-create the problem using the exact scripts and environment, conf, etc that Emalayan is using, and then debug it form there, ideally snapping the cqsub it uses and testing with just that to start with. Jon said he will do this in the morning, and I think we can nail the problem then. - Mike ----- Original Message ----- > From: "ZHAO ZHANG" > To: "Michael Wilde" > Cc: "Justin M Wozniak" , "Jonathan Monette" , emalayan at ece.ubc.ca, "Matei > Ripeanu" , "Swift Devel" > Sent: Tuesday, February 28, 2012 11:21:14 PM > Subject: Re: [Swift-devel] Problems running Swift on BG/P > Hi, Mike, All, > > Please refer to > http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ#How_to_open_a_socket_from_a_CN_to_the_outside_world > for the NAT feature of ZeptoOS. > It could be enabled in the cqsub command line. Keep in mind that, if > we > use this feature, we have to start a server a the login node, and let > compute nodes > connect the server socket. Once the server socket got the connection, > it > can send message back. > > To access CNs from IO Node, we need to use the tree network, which > range > from 192.168.1.1 to 192.168.1.64. There is a overlay mapping of the > tree > network > and the torus network. But I never figured it out. We could work > around > the problem by login one of the compute nodes, then telnet the torus > network > address. > > An simple example is we could login 192.168.1.64. PS: in any scale, > 192.168.1.68 in the first pset is always the one with Rank 0. From > there, we could login > 12.0.0.2 and etc.. > > best > zhao > > On 2/28/2012 11:07 PM, Michael Wilde wrote: > > Emalayan and I spent a considerable amount of time debugging Swift > > on surveyor tonight. > > > > As far as I can tell, after fixing a few config problems, it seems > > like the workers are unable to connect the coaster service. They > > seem to be trying to connect on the correct address. The workers > > start, and produce logs, but dont seem to make connections. > > > > I noticed the following email thread: > > http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html > > > > which talk about the sites attribute "alcfbgpnat" and state: > > --- > > This code snippet may be of relevance: > > if (settings.getAlcfbgpnat()) { > > spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true"); > > } > > > > So you should set that env variable for the job if you want NAT. > > --- > > > > Is this being done in the current start-coaster-service job? > > (Presumably needs to be done in the cobalt job?) > > > > We also noticed that Emalayan was unable to follow the standard > > recipe for logging into the compute nodes of a running job. He could > > get to the IOP, but from there, got something like "no route to > > host" when he tried to telnet (or ping?) to the compute nodes. > > > > I'll check on the ZOID_ENABLE_NAT setting, but any thoughts? > > > > Thanks, > > > > - Mike > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Wed Feb 29 22:19:51 2012 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Wed, 29 Feb 2012 22:19:51 -0600 Subject: [Swift-devel] visualize your code as it executes Message-ID: This is a nice page showing visualize as you run code: http://people.csail.mit.edu/pgbovine/python/tutor.html#mode=edit Relavant to the try Swift online venture. (from google+ python stream) -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Feb 29 22:54:45 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 29 Feb 2012 22:54:45 -0600 (CST) Subject: [Swift-devel] visualize your code as it executes In-Reply-To: Message-ID: <611892001.47880.1330577685077.JavaMail.root@zimbra.anl.gov> Very nice! I think its also relevant to Swift documentation and to understanding ExM Swift/Turbine semantics. - Mike ----- Original Message ----- > From: "Ketan Maheshwari" > To: "Swift Devel" > Sent: Wednesday, February 29, 2012 10:19:51 PM > Subject: [Swift-devel] visualize your code as it executes > This is a nice page showing visualize as you run code: > > > http://people.csail.mit.edu/pgbovine/python/tutor.html#mode=edit > > > Relavant to the try Swift online venture. > > > (from google+ python stream) > > > > -- > Ketan > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory