From iraicu at cs.iit.edu  Wed Feb  1 17:43:48 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Wed, 01 Feb 2012 17:43:48 -0600
Subject: [Swift-devel] Call for Workshops: The 9th Int. Conf. on Autonomic
 Computing (ICAC 2012)
Message-ID: <4F29CE34.8020706@cs.iit.edu>

CALL FOR WORKSHOP PROPOSALS

The 9th International Conference on Autonomic Computing (ICAC 2012)

September 17-21, 2012. San Jose, CA, USA
http://icac2012.cs.fiu.edu/

-----------------------------------------------------------------
IMPORTANT DATES
Workshop Proposal Submission: February 10, 2012

-----------------------------------------------------------------
OVERVIEW
ICAC is the leading conference on autonomic computing techniques,
foundations, and applications. Autonomic computing refers to
methods and means for automated management of performance, fault,
security, and configuration with little involvement of users or
administrators. Systems introducing new autonomic features are
becoming increasingly prevalent, motivating research that spans
a variety of areas, from computer systems, networking, software
engineering, and data management to machine learning, control
theory, and bio-inspired computing. ICAC brings together
researchers and practitioners across these disciplines to
address multiple facets of adaptation and self-management in
computing systems and applications from different perspectives.
Autonomic computing solutions are sought for clouds, grids,
data centers, enterprise software, internet services, data
services, smart phones, embedded systems, and sensor networks.
In these environments, resources and applications must be managed
to maximize performance and minimize cost, while maintaining
predictable and reliable behavior in the face of varying
workloads, failures, and malicious threats.

ICAC'12 welcomes proposals for co-located workshops on topics of
interest to the autonomic computing community. Workshop proposals
should be submitted to the Workshop Chair, Fred Douglis
(f.douglis at computer.org) by February 10, 2012. Workshops are
expected to publish proceedings, and should cover areas that
complement the main program.

------------------------------------------------------------------
ORGANIZERS
GENERAL CHAIR: Dejan Milojicic, HP Labs
WORKSHOPS CHAIR: Fred Douglis, EMC

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel:    1-847-722-0876
Office: 1-312-567-5704
Email:  iraicu at cs.iit.edu
Web:    http://www.cs.iit.edu/~iraicu/
Web:    http://datasys.cs.iit.edu/
=================================================================
=================================================================


From turam at mcs.anl.gov  Fri Feb  3 13:20:05 2012
From: turam at mcs.anl.gov (Thomas Uram)
Date: Fri, 3 Feb 2012 13:20:05 -0600
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk, coasters,
	ssh-cl:pbs)
Message-ID: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>


I'm encountering a problem using coasters with ssh-cl:pbs in trunk. The first error is as follows:

2012-02-03 13:05:29,823-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-k2urbkmk - Application exception: null
Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]
Caused by: java.net.NoRouteToHostException: No route to host
2012-02-03 13:05:29,875-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-j2urbkmk - Application exception: null
Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(3)[1544213635: {}]
Caused by: java.net.NoRouteToHostException: No route to host
2012-02-03 13:05:32,585-0600 WARN  vdl:transferwrapperlog Failed to transfer wrapper log for job hostname-k2urbkmk
2012-02-03 13:05:32,586-0600 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from hostname-20120203-1305-3q1m7jg3/info/k on Bugaboo: null
Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile
Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /home/turam/tmp/hostname-20120203-1305-3q1m7jg3/info/k/hostname-k2urbkmk-info
Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message:  Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory
500-A system call failed: No such file or directory
500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory
500-A system call failed: No such file or directory
500 End.]


The full log file (with embedded sites and tc files) is here:

http://www.mcs.anl.gov/~turam/20120203-1308/hostname-20120203-1305-3q1m7jg3.log

This same scenario worked with Swift 0.93, using ssh:pbs instead (ssh-cl is only available in trunk).

Any help understanding and working around this problem would be great.

Thanks,
Tom Uram


From jonmon at mcs.anl.gov  Fri Feb  3 13:27:32 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Fri, 3 Feb 2012 13:27:32 -0600
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
	coasters, ssh-cl:pbs)
In-Reply-To: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
Message-ID: <D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>

So I am not sure if this is a similar problem I ran into, but I had to change the X509_USER_PROXY variable.  Normally this is set to /tmp/x509up_u<uid>.  I had to change it(changed it to $HOME/.globus/<proxy_file>  For some reason when issuing a command over ssh(example: ssh jonmon at login.pads.ci.uchicago.edu ls /tmp/) my proxy file was not there.  But when I would log into the machine before issuing the ls command the proxy file was there.  I assumed(not verified) that the /tmp/ directory is not fully configured/mounted properly when issuing a command over ssh.  Changing the X509_USER_PROXY variable fixed the issue.

On Feb 3, 2012, at 1:20 PM, Thomas Uram wrote:

> 
> I'm encountering a problem using coasters with ssh-cl:pbs in trunk. The first error is as follows:
> 
> 2012-02-03 13:05:29,823-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-k2urbkmk - Application exception: null
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]
> Caused by: java.net.NoRouteToHostException: No route to host
> 2012-02-03 13:05:29,875-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-j2urbkmk - Application exception: null
> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(3)[1544213635: {}]
> Caused by: java.net.NoRouteToHostException: No route to host
> 2012-02-03 13:05:32,585-0600 WARN  vdl:transferwrapperlog Failed to transfer wrapper log for job hostname-k2urbkmk
> 2012-02-03 13:05:32,586-0600 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from hostname-20120203-1305-3q1m7jg3/info/k on Bugaboo: null
> Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile
> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /home/turam/tmp/hostname-20120203-1305-3q1m7jg3/info/k/hostname-k2urbkmk-info
> Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message:  Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory
> 500-A system call failed: No such file or directory
> 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory
> 500-A system call failed: No such file or directory
> 500 End.]
> 
> 
> The full log file (with embedded sites and tc files) is here:
> 
> http://www.mcs.anl.gov/~turam/20120203-1308/hostname-20120203-1305-3q1m7jg3.log
> 
> This same scenario worked with Swift 0.93, using ssh:pbs instead (ssh-cl is only available in trunk).
> 
> Any help understanding and working around this problem would be great.
> 
> Thanks,
> Tom Uram
> 
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From turam at mcs.anl.gov  Fri Feb  3 13:35:10 2012
From: turam at mcs.anl.gov (Thomas Uram)
Date: Fri, 3 Feb 2012 13:35:10 -0600
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
	coasters, ssh-cl:pbs)
In-Reply-To: <D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
Message-ID: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>

That doesn't appear to help in my case.

Should the hostname in the URL here concern me?

>>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]

>> 


On Feb 3, 2012, at 1:27 PM, Jonathan Monette wrote:

> So I am not sure if this is a similar problem I ran into, but I had to change the X509_USER_PROXY variable.  Normally this is set to /tmp/x509up_u<uid>.  I had to change it(changed it to $HOME/.globus/<proxy_file>  For some reason when issuing a command over ssh(example: ssh jonmon at login.pads.ci.uchicago.edu ls /tmp/) my proxy file was not there.  But when I would log into the machine before issuing the ls command the proxy file was there.  I assumed(not verified) that the /tmp/ directory is not fully configured/mounted properly when issuing a command over ssh.  Changing the X509_USER_PROXY variable fixed the issue.
> 
> On Feb 3, 2012, at 1:20 PM, Thomas Uram wrote:
> 
>> 
>> I'm encountering a problem using coasters with ssh-cl:pbs in trunk. The first error is as follows:
>> 
>> 2012-02-03 13:05:29,823-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-k2urbkmk - Application exception: null
>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]
>> Caused by: java.net.NoRouteToHostException: No route to host
>> 2012-02-03 13:05:29,875-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-j2urbkmk - Application exception: null
>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(3)[1544213635: {}]
>> Caused by: java.net.NoRouteToHostException: No route to host
>> 2012-02-03 13:05:32,585-0600 WARN  vdl:transferwrapperlog Failed to transfer wrapper log for job hostname-k2urbkmk
>> 2012-02-03 13:05:32,586-0600 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from hostname-20120203-1305-3q1m7jg3/info/k on Bugaboo: null
>> Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile
>> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /home/turam/tmp/hostname-20120203-1305-3q1m7jg3/info/k/hostname-k2urbkmk-info
>> Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message:  Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory
>> 500-A system call failed: No such file or directory
>> 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory
>> 500-A system call failed: No such file or directory
>> 500 End.]
>> 
>> 
>> The full log file (with embedded sites and tc files) is here:
>> 
>> http://www.mcs.anl.gov/~turam/20120203-1308/hostname-20120203-1305-3q1m7jg3.log
>> 
>> This same scenario worked with Swift 0.93, using ssh:pbs instead (ssh-cl is only available in trunk).
>> 
>> Any help understanding and working around this problem would be great.
>> 
>> Thanks,
>> Tom Uram
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 


From jonmon at mcs.anl.gov  Fri Feb  3 13:36:53 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Fri, 3 Feb 2012 13:36:53 -0600
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
	coasters, ssh-cl:pbs)
In-Reply-To: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
	<7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
Message-ID: <32EB9F52-B542-4793-B2AF-C65D0803C9B2@mcs.anl.gov>

What machine are you executing on?  bridled to where?

On Feb 3, 2012, at 1:35 PM, Thomas Uram wrote:

> That doesn't appear to help in my case.
> 
> Should the hostname in the URL here concern me?
> 
>>>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]
> 
>>> 
> 
> 
> 
> 
> 
> On Feb 3, 2012, at 1:27 PM, Jonathan Monette wrote:
> 
>> So I am not sure if this is a similar problem I ran into, but I had to change the X509_USER_PROXY variable.  Normally this is set to /tmp/x509up_u<uid>.  I had to change it(changed it to $HOME/.globus/<proxy_file>  For some reason when issuing a command over ssh(example: ssh jonmon at login.pads.ci.uchicago.edu ls /tmp/) my proxy file was not there.  But when I would log into the machine before issuing the ls command the proxy file was there.  I assumed(not verified) that the /tmp/ directory is not fully configured/mounted properly when issuing a command over ssh.  Changing the X509_USER_PROXY variable fixed the issue.
>> 
>> On Feb 3, 2012, at 1:20 PM, Thomas Uram wrote:
>> 
>>> 
>>> I'm encountering a problem using coasters with ssh-cl:pbs in trunk. The first error is as follows:
>>> 
>>> 2012-02-03 13:05:29,823-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-k2urbkmk - Application exception: null
>>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
>>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]
>>> Caused by: java.net.NoRouteToHostException: No route to host
>>> 2012-02-03 13:05:29,875-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=hostname-j2urbkmk - Application exception: null
>>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
>>> Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:35836(3)[1544213635: {}]
>>> Caused by: java.net.NoRouteToHostException: No route to host
>>> 2012-02-03 13:05:32,585-0600 WARN  vdl:transferwrapperlog Failed to transfer wrapper log for job hostname-k2urbkmk
>>> 2012-02-03 13:05:32,586-0600 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from hostname-20120203-1305-3q1m7jg3/info/k on Bugaboo: null
>>> Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile
>>> Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /home/turam/tmp/hostname-20120203-1305-3q1m7jg3/info/k/hostname-k2urbkmk-info
>>> Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message:  Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory
>>> 500-A system call failed: No such file or directory
>>> 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 500-Command failed : System error in stat: No such file or directory
>>> 500-A system call failed: No such file or directory
>>> 500 End.]
>>> 
>>> 
>>> The full log file (with embedded sites and tc files) is here:
>>> 
>>> http://www.mcs.anl.gov/~turam/20120203-1308/hostname-20120203-1305-3q1m7jg3.log
>>> 
>>> This same scenario worked with Swift 0.93, using ssh:pbs instead (ssh-cl is only available in trunk).
>>> 
>>> Any help understanding and working around this problem would be great.
>>> 
>>> Thanks,
>>> Tom Uram
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> 
> 


From hategan at mcs.anl.gov  Fri Feb  3 13:37:07 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 03 Feb 2012 11:37:07 -0800
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
 coasters, ssh-cl:pbs)
In-Reply-To: <7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
	<7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
Message-ID: <1328297827.22991.0.camel@blabla>

On Fri, 2012-02-03 at 13:35 -0600, Thomas Uram wrote:
> That doesn't appear to help in my case.
> 
> Should the hostname in the URL here concern me?

It should. Or rather said "no route to host" should. Did you set
GLOBUS_HOSTNAME on the client side to the public IP of the client
machine?

Mihael


From turam at mcs.anl.gov  Fri Feb  3 13:44:16 2012
From: turam at mcs.anl.gov (Thomas Uram)
Date: Fri, 3 Feb 2012 13:44:16 -0600
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
	coasters, ssh-cl:pbs)
In-Reply-To: <1328297827.22991.0.camel@blabla>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
	<7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
	<1328297827.22991.0.camel@blabla>
Message-ID: <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov>

No I didn't set GLOBUS_HOSTNAME. The address it complains about (206.12.24.2) is publicly reachable. So is the hostname of the machine on which I'm running Swift (fl.ci.uchicago.edu).

I was wondering about the jumble that follows the hostname:port in that URL:

>> Failed to start channel GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]


On Feb 3, 2012, at 1:37 PM, Mihael Hategan wrote:

> On Fri, 2012-02-03 at 13:35 -0600, Thomas Uram wrote:
>> That doesn't appear to help in my case.
>> 
>> Should the hostname in the URL here concern me?
> 
> It should. Or rather said "no route to host" should. Did you set
> GLOBUS_HOSTNAME on the client side to the public IP of the client
> machine?
> 
> Mihael
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120203/dde92b17/attachment.html>

From hategan at mcs.anl.gov  Fri Feb  3 13:54:03 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 03 Feb 2012 11:54:03 -0800
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
 coasters, ssh-cl:pbs)
In-Reply-To: <25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
	<7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
	<1328297827.22991.0.camel@blabla>
	<25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov>
Message-ID: <1328298843.3200.2.camel@blabla>

On Fri, 2012-02-03 at 13:44 -0600, Thomas Uram wrote:
> No I didn't set GLOBUS_HOSTNAME. The address it complains about
> (206.12.24.2) is publicly reachable. So is the hostname of the machine
> on which I'm running Swift (fl.ci.uchicago.edu).

They should be the same! (i.e. the coaster service tries to connect back
to the machine you're running Swift on).

Can you try setting GLOBUS_HOSTNAME and see what happens?

> 
> 
> I was wondering about the jumble that follows the hostname:port in
> that URL:
> 
> 
> > > Failed to start channel
> > > GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]

(2) is the channel ID
[15...] is the channel context
They are not part of the IP address, but part of GSSChannel.toString().


From turam at mcs.anl.gov  Fri Feb  3 14:02:53 2012
From: turam at mcs.anl.gov (Thomas Uram)
Date: Fri, 3 Feb 2012 14:02:53 -0600
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
	coasters, ssh-cl:pbs)
In-Reply-To: <1328298843.3200.2.camel@blabla>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
	<7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
	<1328297827.22991.0.camel@blabla>
	<25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov>
	<1328298843.3200.2.camel@blabla>
Message-ID: <E9B76CEE-CD4F-4C20-BCFA-D1E270AF66AB@mcs.anl.gov>

I have done this without success:

GLOBUS_HOSTNAME=fl.ci.uchicago.edu
GLOBUS_TCP_PORT_RANGE=50000,50100
swiftt -sites.file sites.coasters.xml -tc.file tc.data hostname.swift
Swift trunk swift-r5501 (swift modified locally) cog-r3350 (cog modified locally)

RunID: 20120203-1357-8tekc3f7
Progress:  time: Fri, 03 Feb 2012 13:57:24 -0600
Progress:  time: Fri, 03 Feb 2012 13:57:31 -0600  Selecting site:4  Initializing site shared directory:1  Stage in:1
ssh not set, setting to 'gsissh'
ssh=gsissh
Find: https://206.12.24.2:38675
Find:  keepalive(120), reconnect - https://206.12.24.2:38675
Progress:  time: Fri, 03 Feb 2012 13:57:35 -0600  Selecting site:4  Submitting:1  Submitted:1
Failed to transfer wrapper log for job hostname-1jnudkmk
Progress:  time: Fri, 03 Feb 2012 13:57:38 -0600  Selecting site:3  Stage in:1 Failed but can retry:2
Failed to transfer wrapper log for job hostname-2jnudkmk
Failed to transfer wrapper log for job hostname-4jnudkmk
Progress:  time: Fri, 03 Feb 2012 13:57:54 -0600  Selecting site:3 Failed but can retry:3
Progress:  time: Fri, 03 Feb 2012 13:57:57 -0600  Selecting site:2  Stage in:1 Failed but can retry:3
Failed to transfer wrapper log for job hostname-7jnudkmk
No events in 10s.

Registered futures:
----

Waiting threads:
----

No events in 10s.

Registered futures:
----

Waiting threads:
----

** Ctrl-C here **

Progress:  time: Fri, 03 Feb 2012 13:58:24 -0600  Selecting site:2 Failed but can retry:4
Failed to shut down service https://206.12.24.2:38675
org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:38675(6)[69518356: {}]
	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:103)
	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:62)
	at org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:55)
	at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:116)
	at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
	at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:236)
	at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:256)
	at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:217)
	at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager$ServiceReaper.run(ServiceManager.java:430)
Caused by: java.net.NoRouteToHostException: No route to host
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
	at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
	at java.net.Socket.connect(Socket.java:529)
	at java.net.Socket.connect(Socket.java:478)
	at java.net.Socket.<init>(Socket.java:375)
	at java.net.Socket.<init>(Socket.java:276)
	at org.globus.net.SocketFactory.createSocket(SocketFactory.java:74)
	at org.globus.net.SocketFactory.createSocket(SocketFactory.java:53)
	at org.globus.gsi.gssapi.net.GssSocket.<init>(GssSocket.java:56)
	at org.globus.gsi.gssapi.net.impl.GSIGssSocket.<init>(GSIGssSocket.java:29)
	at org.globus.gsi.gssapi.net.impl.GSIGssSocketFactory.createSocket(GSIGssSocketFactory.java:38)
	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:89)
	... 8 more


Full log here:
http://www.mcs.anl.gov/~turam/20120203-1401/hostname-20120203-1357-8tekc3f7.log


On Feb 3, 2012, at 1:54 PM, Mihael Hategan wrote:

> On Fri, 2012-02-03 at 13:44 -0600, Thomas Uram wrote:
>> No I didn't set GLOBUS_HOSTNAME. The address it complains about
>> (206.12.24.2) is publicly reachable. So is the hostname of the machine
>> on which I'm running Swift (fl.ci.uchicago.edu).
> 
> They should be the same! (i.e. the coaster service tries to connect back
> to the machine you're running Swift on).
> 
> Can you try setting GLOBUS_HOSTNAME and see what happens?
> 
>> 
>> 
>> I was wondering about the jumble that follows the hostname:port in
>> that URL:
>> 
>> 
>>>> Failed to start channel
>>>> GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]
> 
> (2) is the channel ID
> [15...] is the channel context
> They are not part of the IP address, but part of GSSChannel.toString().
> 
> 


From hategan at mcs.anl.gov  Fri Feb  3 14:29:52 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 03 Feb 2012 12:29:52 -0800
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
 coasters, ssh-cl:pbs)
In-Reply-To: <E9B76CEE-CD4F-4C20-BCFA-D1E270AF66AB@mcs.anl.gov>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
	<7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
	<1328297827.22991.0.camel@blabla>
	<25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov>
	<1328298843.3200.2.camel@blabla>
	<E9B76CEE-CD4F-4C20-BCFA-D1E270AF66AB@mcs.anl.gov>
Message-ID: <1328300992.4145.0.camel@blabla>

Ok, so maybe the ssh-cl provider doesn't properly forward environment
variables. I'll double check that.

On Fri, 2012-02-03 at 14:02 -0600, Thomas Uram wrote:
> I have done this without success:
> 
> GLOBUS_HOSTNAME=fl.ci.uchicago.edu
> GLOBUS_TCP_PORT_RANGE=50000,50100
> swiftt -sites.file sites.coasters.xml -tc.file tc.data hostname.swift
> Swift trunk swift-r5501 (swift modified locally) cog-r3350 (cog modified locally)
> 
> RunID: 20120203-1357-8tekc3f7
> Progress:  time: Fri, 03 Feb 2012 13:57:24 -0600
> Progress:  time: Fri, 03 Feb 2012 13:57:31 -0600  Selecting site:4  Initializing site shared directory:1  Stage in:1
> ssh not set, setting to 'gsissh'
> ssh=gsissh
> Find: https://206.12.24.2:38675
> Find:  keepalive(120), reconnect - https://206.12.24.2:38675
> Progress:  time: Fri, 03 Feb 2012 13:57:35 -0600  Selecting site:4  Submitting:1  Submitted:1
> Failed to transfer wrapper log for job hostname-1jnudkmk
> Progress:  time: Fri, 03 Feb 2012 13:57:38 -0600  Selecting site:3  Stage in:1 Failed but can retry:2
> Failed to transfer wrapper log for job hostname-2jnudkmk
> Failed to transfer wrapper log for job hostname-4jnudkmk
> Progress:  time: Fri, 03 Feb 2012 13:57:54 -0600  Selecting site:3 Failed but can retry:3
> Progress:  time: Fri, 03 Feb 2012 13:57:57 -0600  Selecting site:2  Stage in:1 Failed but can retry:3
> Failed to transfer wrapper log for job hostname-7jnudkmk
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> ** Ctrl-C here **
> 
> Progress:  time: Fri, 03 Feb 2012 13:58:24 -0600  Selecting site:2 Failed but can retry:4
> Failed to shut down service https://206.12.24.2:38675
> org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:38675(6)[69518356: {}]
> 	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:103)
> 	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:62)
> 	at org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:55)
> 	at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:116)
> 	at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72)
> 	at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:236)
> 	at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:256)
> 	at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:217)
> 	at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager$ServiceReaper.run(ServiceManager.java:430)
> Caused by: java.net.NoRouteToHostException: No route to host
> 	at java.net.PlainSocketImpl.socketConnect(Native Method)
> 	at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> 	at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> 	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> 	at java.net.Socket.connect(Socket.java:529)
> 	at java.net.Socket.connect(Socket.java:478)
> 	at java.net.Socket.<init>(Socket.java:375)
> 	at java.net.Socket.<init>(Socket.java:276)
> 	at org.globus.net.SocketFactory.createSocket(SocketFactory.java:74)
> 	at org.globus.net.SocketFactory.createSocket(SocketFactory.java:53)
> 	at org.globus.gsi.gssapi.net.GssSocket.<init>(GssSocket.java:56)
> 	at org.globus.gsi.gssapi.net.impl.GSIGssSocket.<init>(GSIGssSocket.java:29)
> 	at org.globus.gsi.gssapi.net.impl.GSIGssSocketFactory.createSocket(GSIGssSocketFactory.java:38)
> 	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:89)
> 	... 8 more
> 
> 
> Full log here:
> http://www.mcs.anl.gov/~turam/20120203-1401/hostname-20120203-1357-8tekc3f7.log
> 
> 
> 
> 
> 
> 
> On Feb 3, 2012, at 1:54 PM, Mihael Hategan wrote:
> 
> > On Fri, 2012-02-03 at 13:44 -0600, Thomas Uram wrote:
> >> No I didn't set GLOBUS_HOSTNAME. The address it complains about
> >> (206.12.24.2) is publicly reachable. So is the hostname of the machine
> >> on which I'm running Swift (fl.ci.uchicago.edu).
> > 
> > They should be the same! (i.e. the coaster service tries to connect back
> > to the machine you're running Swift on).
> > 
> > Can you try setting GLOBUS_HOSTNAME and see what happens?
> > 
> >> 
> >> 
> >> I was wondering about the jumble that follows the hostname:port in
> >> that URL:
> >> 
> >> 
> >>>> Failed to start channel
> >>>> GSSCChannel-https://206.12.24.2:35836(2)[1544213635: {}]
> > 
> > (2) is the channel ID
> > [15...] is the channel context
> > They are not part of the IP address, but part of GSSChannel.toString().
> > 
> > 
> 


From wilde at mcs.anl.gov  Sat Feb  4 11:04:48 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 4 Feb 2012 11:04:48 -0600 (CST)
Subject: [Swift-devel] Fwd: Google Summer of Code 2012 Announced
In-Reply-To: <CADtyVT5muXBMDcV+-7zE9ZGgOi_YkRdYbzm1sFzZcjqL_hBkxQ@mail.gmail.com>
Message-ID: <1201124034.213374.1328375088630.JavaMail.root@zimbra.anl.gov>

----- Forwarded Message -----
From: "Borja Sotomayor" <borja at cs.uchicago.edu>
To: "globus-dev" <globus-dev at ci.uchicago.edu>
Cc: "Michael Wilde" <wilde at mcs.anl.gov>, bresnaha at mcs.anl.gov
Sent: Saturday, February 4, 2012 10:46:08 AM
Subject: Fwd: Google Summer of Code 2012 Announced

Hi all,

fyi, Google Summer of Code 2012 has just been announced. Applications
to become a Mentoring Organization are due on March 9th.


---------- Forwarded message ----------
From: Carol Smith <carols at google.com>
Date: Sat, Feb 4, 2012 at 10:43 AM
Subject: Google Summer of Code 2012 Announced
To: Google Summer of Code Announce
<google-summer-of-code-announce at googlegroups.com>


Hi all,

We're pleased to announce that Google Summer of Code will be happening
for?its eighth year this year. Please check out the blog post [1]
about the?program and read the FAQs [2] and Timeline [3] on Melange
for more?information.

[1] -?http://google-opensource.blogspot.com/2012/02/google-summer-of-code-2012-is-on.html
[2] -?http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2012/faqs
[3] -?http://www.google-melange.com/gsoc/events/google/gsoc2012

Cheers,
Carol

--
You received this message because you are subscribed to the Google
Groups "Google Summer of Code Announce" group.
To post to this group, send email to
google-summer-of-code-announce at googlegroups.com.
To unsubscribe from this group, send email to
google-summer-of-code-announce+unsubscribe at googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/google-summer-of-code-announce?hl=en.


-- 
Borja Sotomayor

?Researcher, Computation Institute
?Lecturer, Department of Computer Science
?University of Chicago
?http://people.cs.uchicago.edu/~borja/

?Community Manager, OpenNebula project
?http://www.opennebula.org/

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Sat Feb  4 20:16:37 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 04 Feb 2012 18:16:37 -0800
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
 coasters, ssh-cl:pbs)
In-Reply-To: <1328300992.4145.0.camel@blabla>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
	<7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
	<1328297827.22991.0.camel@blabla>
	<25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov>
	<1328298843.3200.2.camel@blabla>
	<E9B76CEE-CD4F-4C20-BCFA-D1E270AF66AB@mcs.anl.gov>
	<1328300992.4145.0.camel@blabla>
Message-ID: <1328408197.14297.0.camel@blabla>

Yep, it didn't. Fixed in latest trunk. Let me know if the problem
persists.

On Fri, 2012-02-03 at 12:29 -0800, Mihael Hategan wrote:
> Ok, so maybe the ssh-cl provider doesn't properly forward environment
> variables. I'll double check that.
> 
> On Fri, 2012-02-03 at 14:02 -0600, Thomas Uram wrote:
> > I have done this without success:
> > 
> > GLOBUS_HOSTNAME=fl.ci.uchicago.edu
> > GLOBUS_TCP_PORT_RANGE=50000,50100
> > swiftt -sites.file sites.coasters.xml -tc.file tc.data hostname.swift
> > Swift trunk swift-r5501 (swift modified locally) cog-r3350 (cog modified locally)


From davidk at ci.uchicago.edu  Mon Feb  6 07:56:07 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Mon, 6 Feb 2012 07:56:07 -0600 (CST)
Subject: [Swift-devel] merge 0.93 -> trunk
In-Reply-To: <E9900ED0-158C-4489-9D4B-07C8F4CEC218@mcs.anl.gov>
Message-ID: <667869279.105106.1328536567589.JavaMail.root@zimbra-mb2.anl.gov>


For the most part, the tests seems to be going pretty well. 

There's a group of tests called language-behaviour/cleanup in which the post-test cleanup scripts are failing. These tests are not in 0.93 for comparison.. not sure if the problem is with some expected cleanup behavior, or with the tests themselves. Does anyone know more about these?

The other failure is related to the sequential iteration script. I believe this is related to some language behavior changes in this release. The script below fails to compile:

---
type counterfile;  
  
app (counterfile t) echo(string m) {   
    echo m stdout=@filename(t);  
}  
  
app (counterfile t) countstep(counterfile i) {  
    wcl @filename(i) @filename(t);  
}  
  
counterfile a[]  <simple_mapper;prefix="sequential_iteration.foldout">;  
  
a[0] = echo("793578934574893");  
  
iterate v {  
  a[v+1] = countstep(a[v]);  
 trace("extract int value ", at extractint(a[v+1]));  
} until (@extractint(a[v+1]) <= 1);  
---

Could not start execution:
	Failed to convert .xml to .kml for sequential_iteration.swift:
	null

Other than those two issues, things look pretty good. All other tests have been passing consistently for the last few days.

David
 
----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "David Kelly" <davidk at ci.uchicago.edu>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, January 30, 2012 8:45:06 AM
> Subject: Re: [Swift-devel] merge 0.93 -> trunk
> I am seeing the same error when trying to compile trunk.
> 
> On Jan 29, 2012, at 6:15 PM, Mihael Hategan wrote:
> 
> > Maybe the checkout happened in the middle of a commit?
> >
> > Is anybody seeing this with a clean checkout?
> >
> > On Sun, 2012-01-29 at 16:07 -0600, David Kelly wrote:
> >> It looks like the compile failed and the test did not run last
> >> night. Here is the error I am getting:
> >>
> >>    [javac]
> >>    /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/LocalTCPService.java:29:
> >>    org.globus.cog.abstraction.coaster.service.LocalTCPService is
> >>    not abstract and does not override abstract method
> >>    registrationReceived(java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.KarajanChannel,java.util.Map<java.lang.String,java.lang.String>)
> >>    in org.globus.cog.abstraction.coaster.service.Registering
> >>    [javac] public class LocalTCPService extends GSSService
> >>    implements Registering {
> >>    [javac] ^
> >>    [javac]
> >>    /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/LocalTCPService.java:64:
> >>    registrationReceived(java.lang.String,java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.ChannelContext,java.util.Map<java.lang.String,java.lang.String>)
> >>    in
> >>    org.globus.cog.abstraction.coaster.service.RegistrationManager
> >>    cannot be applied to
> >>    (java.lang.String,java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.ChannelContext)
> >>    [javac] registrationManager.registrationReceived(blockid, wid,
> >>    url, cc);
> >>    [javac] ^
> >>    [javac] Note:
> >>    /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Block.java
> >>    uses or overrides a deprecated API.
> >>    [javac] Note: Recompile with -Xlint:deprecation for details.
> >>    [javac] Note:
> >>    /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BQPStatusHandler.java
> >>    uses unchecked or unsafe operations.
> >>    [javac] Note: Recompile with -Xlint:unchecked for details.
> >>    [javac] 2 errors
> >>
> >> BUILD FAILED
> >> /swift/swift-trunk/cog/modules/swift/build.xml:73: The following
> >> error occurred while executing this line:
> >> /swift/swift-trunk/cog/mbuild.xml:445: The following error occurred
> >> while executing this line:
> >> /swift/swift-trunk/cog/mbuild.xml:79: The following error occurred
> >> while executing this line:
> >> /swift/swift-trunk/cog/mbuild.xml:52: The following error occurred
> >> while executing this line:
> >> /swift/swift-trunk/cog/modules/swift/dependencies.xml:13: The
> >> following error occurred while executing this line:
> >> /swift/swift-trunk/cog/mbuild.xml:163: The following error occurred
> >> while executing this line:
> >> /swift/swift-trunk/cog/mbuild.xml:168: The following error occurred
> >> while executing this line:
> >> /swift/swift-trunk/cog/modules/provider-coaster/build.xml:59: The
> >> following error occurred while executing this line:
> >> /swift/swift-trunk/cog/mbuild.xml:466: The following error occurred
> >> while executing this line:
> >> /swift/swift-trunk/cog/mbuild.xml:229: Compile failed; see the
> >> compiler error output for details.
> >>
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Michael Wilde" <wilde at mcs.anl.gov>
> >>> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> >>> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> >>> Sent: Sunday, January 29, 2012 10:27:18 AM
> >>> Subject: Re: [Swift-devel] merge 0.93 -> trunk
> >>> Excellent - thanks! David, can you tell us how the nightly tests
> >>> in
> >>> trunk were affected by the integration?
> >>>
> >>> - Mike
> >>>
> >>> ----- Original Message -----
> >>>> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> >>>> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> >>>> Sent: Saturday, January 28, 2012 11:01:54 PM
> >>>> Subject: [Swift-devel] merge 0.93 -> trunk
> >>>> Did the merge. I still need to do some sanity checks, so it may
> >>>> be
> >>>> shaky
> >>>> at the moment.
> >>>>
> >>>> Mihael
> >>>>
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >>>
> >>> --
> >>> Michael Wilde
> >>> Computation Institute, University of Chicago
> >>> Mathematics and Computer Science Division
> >>> Argonne National Laboratory
> >>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From hategan at mcs.anl.gov  Mon Feb  6 12:23:28 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Feb 2012 10:23:28 -0800
Subject: [Swift-devel] merge 0.93 -> trunk
In-Reply-To: <667869279.105106.1328536567589.JavaMail.root@zimbra-mb2.anl.gov>
References: <667869279.105106.1328536567589.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <1328552608.26929.1.camel@blabla>

Cool. I didn't mess up too much then :)

On Mon, 2012-02-06 at 07:56 -0600, David Kelly wrote:
> For the most part, the tests seems to be going pretty well. 
> 
> There's a group of tests called language-behaviour/cleanup in which the post-test cleanup scripts are failing. These tests are not in 0.93 for comparison.. not sure if the problem is with some expected cleanup behavior, or with the tests themselves. Does anyone know more about these?
> 
> The other failure is related to the sequential iteration script. I believe this is related to some language behavior changes in this release. The script below fails to compile:
> 
> ---
> type counterfile;  
>   
> app (counterfile t) echo(string m) {   
>     echo m stdout=@filename(t);  
> }  
>   
> app (counterfile t) countstep(counterfile i) {  
>     wcl @filename(i) @filename(t);  
> }  
>   
> counterfile a[]  <simple_mapper;prefix="sequential_iteration.foldout">;  
>   
> a[0] = echo("793578934574893");  
>   
> iterate v {  
>   a[v+1] = countstep(a[v]);  
>  trace("extract int value ", at extractint(a[v+1]));  
> } until (@extractint(a[v+1]) <= 1);  
> ---
> 
> Could not start execution:
> 	Failed to convert .xml to .kml for sequential_iteration.swift:
> 	null
> 
> Other than those two issues, things look pretty good. All other tests have been passing consistently for the last few days.
> 
> David
>  
> ----- Original Message -----
> > From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> > To: "Mihael Hategan" <hategan at mcs.anl.gov>
> > Cc: "David Kelly" <davidk at ci.uchicago.edu>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Monday, January 30, 2012 8:45:06 AM
> > Subject: Re: [Swift-devel] merge 0.93 -> trunk
> > I am seeing the same error when trying to compile trunk.
> > 
> > On Jan 29, 2012, at 6:15 PM, Mihael Hategan wrote:
> > 
> > > Maybe the checkout happened in the middle of a commit?
> > >
> > > Is anybody seeing this with a clean checkout?
> > >
> > > On Sun, 2012-01-29 at 16:07 -0600, David Kelly wrote:
> > >> It looks like the compile failed and the test did not run last
> > >> night. Here is the error I am getting:
> > >>
> > >>    [javac]
> > >>    /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/LocalTCPService.java:29:
> > >>    org.globus.cog.abstraction.coaster.service.LocalTCPService is
> > >>    not abstract and does not override abstract method
> > >>    registrationReceived(java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.KarajanChannel,java.util.Map<java.lang.String,java.lang.String>)
> > >>    in org.globus.cog.abstraction.coaster.service.Registering
> > >>    [javac] public class LocalTCPService extends GSSService
> > >>    implements Registering {
> > >>    [javac] ^
> > >>    [javac]
> > >>    /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/LocalTCPService.java:64:
> > >>    registrationReceived(java.lang.String,java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.ChannelContext,java.util.Map<java.lang.String,java.lang.String>)
> > >>    in
> > >>    org.globus.cog.abstraction.coaster.service.RegistrationManager
> > >>    cannot be applied to
> > >>    (java.lang.String,java.lang.String,java.lang.String,org.globus.cog.karajan.workflow.service.channels.ChannelContext)
> > >>    [javac] registrationManager.registrationReceived(blockid, wid,
> > >>    url, cc);
> > >>    [javac] ^
> > >>    [javac] Note:
> > >>    /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Block.java
> > >>    uses or overrides a deprecated API.
> > >>    [javac] Note: Recompile with -Xlint:deprecation for details.
> > >>    [javac] Note:
> > >>    /swift/swift-trunk/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BQPStatusHandler.java
> > >>    uses unchecked or unsafe operations.
> > >>    [javac] Note: Recompile with -Xlint:unchecked for details.
> > >>    [javac] 2 errors
> > >>
> > >> BUILD FAILED
> > >> /swift/swift-trunk/cog/modules/swift/build.xml:73: The following
> > >> error occurred while executing this line:
> > >> /swift/swift-trunk/cog/mbuild.xml:445: The following error occurred
> > >> while executing this line:
> > >> /swift/swift-trunk/cog/mbuild.xml:79: The following error occurred
> > >> while executing this line:
> > >> /swift/swift-trunk/cog/mbuild.xml:52: The following error occurred
> > >> while executing this line:
> > >> /swift/swift-trunk/cog/modules/swift/dependencies.xml:13: The
> > >> following error occurred while executing this line:
> > >> /swift/swift-trunk/cog/mbuild.xml:163: The following error occurred
> > >> while executing this line:
> > >> /swift/swift-trunk/cog/mbuild.xml:168: The following error occurred
> > >> while executing this line:
> > >> /swift/swift-trunk/cog/modules/provider-coaster/build.xml:59: The
> > >> following error occurred while executing this line:
> > >> /swift/swift-trunk/cog/mbuild.xml:466: The following error occurred
> > >> while executing this line:
> > >> /swift/swift-trunk/cog/mbuild.xml:229: Compile failed; see the
> > >> compiler error output for details.
> > >>
> > >>
> > >>
> > >> ----- Original Message -----
> > >>> From: "Michael Wilde" <wilde at mcs.anl.gov>
> > >>> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> > >>> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > >>> Sent: Sunday, January 29, 2012 10:27:18 AM
> > >>> Subject: Re: [Swift-devel] merge 0.93 -> trunk
> > >>> Excellent - thanks! David, can you tell us how the nightly tests
> > >>> in
> > >>> trunk were affected by the integration?
> > >>>
> > >>> - Mike
> > >>>
> > >>> ----- Original Message -----
> > >>>> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > >>>> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > >>>> Sent: Saturday, January 28, 2012 11:01:54 PM
> > >>>> Subject: [Swift-devel] merge 0.93 -> trunk
> > >>>> Did the merge. I still need to do some sanity checks, so it may
> > >>>> be
> > >>>> shaky
> > >>>> at the moment.
> > >>>>
> > >>>> Mihael
> > >>>>
> > >>>> _______________________________________________
> > >>>> Swift-devel mailing list
> > >>>> Swift-devel at ci.uchicago.edu
> > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >>>
> > >>> --
> > >>> Michael Wilde
> > >>> Computation Institute, University of Chicago
> > >>> Mathematics and Computer Science Division
> > >>> Argonne National Laboratory
> > >>>
> > >>> _______________________________________________
> > >>> Swift-devel mailing list
> > >>> Swift-devel at ci.uchicago.edu
> > >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From turam at mcs.anl.gov  Mon Feb  6 17:38:13 2012
From: turam at mcs.anl.gov (Thomas Uram)
Date: Mon, 6 Feb 2012 17:38:13 -0600
Subject: [Swift-devel] Failed to start channel GSSCChannel (trunk,
	coasters, ssh-cl:pbs)
In-Reply-To: <1328408197.14297.0.camel@blabla>
References: <E16BD5F3-6A68-4D7F-9C89-AF9B2F66D8C8@mcs.anl.gov>
	<D3AA8015-2E19-4584-94E4-87673B8585B2@mcs.anl.gov>
	<7A1C49AF-3405-49D1-B114-4355224BAA7D@mcs.anl.gov>
	<1328297827.22991.0.camel@blabla>
	<25EDAA74-EA67-4D4D-B1CB-3ECF30088186@mcs.anl.gov>
	<1328298843.3200.2.camel@blabla>
	<E9B76CEE-CD4F-4C20-BCFA-D1E270AF66AB@mcs.anl.gov>
	<1328300992.4145.0.camel@blabla> <1328408197.14297.0.camel@blabla>
Message-ID: <08E19F6E-C455-4C62-A6A8-7AD85D2F98EE@mcs.anl.gov>

Okay, with this version, my job succeeded:

http://www.mcs.anl.gov/~turam/20120206-1731/hostname-20120206-1603-am03uzbb.log

This requires that GLOBUS_TCP_PORT_RANGE be set properly so the bootstrap service is started where it can be reached.

I do get the original error message a number of times:

Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://206.12.24.2:34724(2)[1625488363: {}]
Caused by: java.net.NoRouteToHostException: No route to host

It seems to start and stop the coaster service on a variety of ports, one of which eventually succeeds. I don't have documentation to tell me the open port range on the target cluster (I'll get it), but in the meantime, I've discovered some ports that work. Can I specify the port range to be used for the coaster service? I've seen some discussion on the mailing lists about doing so in the context of "coaster-service". At the moment, I'm just running Swift with the configuration you see in the log above. Can I specify the port in my case, or should I use the "coaster-service" script instead?

Thanks!

Tom


On Feb 4, 2012, at 8:16 PM, Mihael Hategan wrote:

> Yep, it didn't. Fixed in latest trunk. Let me know if the problem
> persists.
> 
> On Fri, 2012-02-03 at 12:29 -0800, Mihael Hategan wrote:
>> Ok, so maybe the ssh-cl provider doesn't properly forward environment
>> variables. I'll double check that.
>> 
>> On Fri, 2012-02-03 at 14:02 -0600, Thomas Uram wrote:
>>> I have done this without success:
>>> 
>>> GLOBUS_HOSTNAME=fl.ci.uchicago.edu
>>> GLOBUS_TCP_PORT_RANGE=50000,50100
>>> swiftt -sites.file sites.coasters.xml -tc.file tc.data hostname.swift
>>> Swift trunk swift-r5501 (swift modified locally) cog-r3350 (cog modified locally)
> 
> 


From jonmon at mcs.anl.gov  Fri Feb 10 15:03:53 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Fri, 10 Feb 2012 15:03:53 -0600
Subject: [Swift-devel] tc and sites file debugging
Message-ID: <F3562164-FDA0-43A0-AB3C-32CF2EBD8966@mcs.anl.gov>

What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run?  I keep getting an error saying that the application is not in my tc file for any of the site pool entries.  I just want to see if Swift is grabbing the write files.


From wozniak at mcs.anl.gov  Fri Feb 10 15:14:09 2012
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Fri, 10 Feb 2012 15:14:09 -0600 (Central Standard Time)
Subject: [Swift-devel] tc and sites file debugging
In-Reply-To: <F3562164-FDA0-43A0-AB3C-32CF2EBD8966@mcs.anl.gov>
References: <F3562164-FDA0-43A0-AB3C-32CF2EBD8966@mcs.anl.gov>
Message-ID: <alpine.WNT.2.00.1202101513540.3488@JWOZNIAK-DESK>


Just set:

log4j.logger.swift.textfiles=DEBUG

On Fri, 10 Feb 2012, Jonathan Monette wrote:

> What log4j properties do I have to turn on to see what the path is to 
> the tc and sites file I am using in a Swift run?  I keep getting an 
> error saying that the application is not in my tc file for any of the 
> site pool entries.  I just want to see if Swift is grabbing the write 
> files.
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Justin M Wozniak


From jonmon at mcs.anl.gov  Fri Feb 10 15:16:33 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Fri, 10 Feb 2012 15:16:33 -0600
Subject: [Swift-devel] tc and sites file debugging
In-Reply-To: <alpine.WNT.2.00.1202101513540.3488@JWOZNIAK-DESK>
References: <F3562164-FDA0-43A0-AB3C-32CF2EBD8966@mcs.anl.gov>
	<alpine.WNT.2.00.1202101513540.3488@JWOZNIAK-DESK>
Message-ID: <F3889308-BDFC-4753-98E9-4F2E2D9832F5@mcs.anl.gov>

It is set.  So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct?

On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote:

> 
> Just set:
> 
> log4j.logger.swift.textfiles=DEBUG
> 
> On Fri, 10 Feb 2012, Jonathan Monette wrote:
> 
>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run?  I keep getting an error saying that the application is not in my tc file for any of the site pool entries.  I just want to see if Swift is grabbing the write files.
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Justin M Wozniak


From wozniak at mcs.anl.gov  Fri Feb 10 15:19:53 2012
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Fri, 10 Feb 2012 15:19:53 -0600 (Central Standard Time)
Subject: [Swift-devel] tc and sites file debugging
In-Reply-To: <F3889308-BDFC-4753-98E9-4F2E2D9832F5@mcs.anl.gov>
References: <F3562164-FDA0-43A0-AB3C-32CF2EBD8966@mcs.anl.gov>
	<alpine.WNT.2.00.1202101513540.3488@JWOZNIAK-DESK>
	<F3889308-BDFC-4753-98E9-4F2E2D9832F5@mcs.anl.gov>
Message-ID: <alpine.WNT.2.00.1202101519320.3488@JWOZNIAK-DESK>


There should be a message for that case as well.  Which branch are you 
using?

On Fri, 10 Feb 2012, Jonathan Monette wrote:

> It is set.  So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct?
>
> On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote:
>
>>
>> Just set:
>>
>> log4j.logger.swift.textfiles=DEBUG
>>
>> On Fri, 10 Feb 2012, Jonathan Monette wrote:
>>
>>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run?  I keep getting an error saying that the application is not in my tc file for any of the site pool entries.  I just want to see if Swift is grabbing the write files.
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>
>> --
>> Justin M Wozniak
>
>

-- 
Justin M Wozniak


From jonmon at mcs.anl.gov  Fri Feb 10 15:20:17 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Fri, 10 Feb 2012 15:20:17 -0600
Subject: [Swift-devel] tc and sites file debugging
In-Reply-To: <alpine.WNT.2.00.1202101519320.3488@JWOZNIAK-DESK>
References: <F3562164-FDA0-43A0-AB3C-32CF2EBD8966@mcs.anl.gov>
	<alpine.WNT.2.00.1202101513540.3488@JWOZNIAK-DESK>
	<F3889308-BDFC-4753-98E9-4F2E2D9832F5@mcs.anl.gov>
	<alpine.WNT.2.00.1202101519320.3488@JWOZNIAK-DESK>
Message-ID: <DF57F048-0F16-444E-B784-97730FC375BB@mcs.anl.gov>

0.93

On Feb 10, 2012, at 3:19 PM, Justin M Wozniak wrote:

> 
> There should be a message for that case as well.  Which branch are you using?
> 
> On Fri, 10 Feb 2012, Jonathan Monette wrote:
> 
>> It is set.  So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct?
>> 
>> On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote:
>> 
>>> 
>>> Just set:
>>> 
>>> log4j.logger.swift.textfiles=DEBUG
>>> 
>>> On Fri, 10 Feb 2012, Jonathan Monette wrote:
>>> 
>>>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run?  I keep getting an error saying that the application is not in my tc file for any of the site pool entries.  I just want to see if Swift is grabbing the write files.
>>>> 
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>> 
>>> --
>>> Justin M Wozniak
>> 
>> 
> 
> -- 
> Justin M Wozniak


From wozniak at mcs.anl.gov  Fri Feb 10 15:41:45 2012
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Fri, 10 Feb 2012 15:41:45 -0600 (Central Standard Time)
Subject: [Swift-devel] tc and sites file debugging
In-Reply-To: <DF57F048-0F16-444E-B784-97730FC375BB@mcs.anl.gov>
References: <F3562164-FDA0-43A0-AB3C-32CF2EBD8966@mcs.anl.gov>
	<alpine.WNT.2.00.1202101513540.3488@JWOZNIAK-DESK>
	<F3889308-BDFC-4753-98E9-4F2E2D9832F5@mcs.anl.gov>
	<alpine.WNT.2.00.1202101519320.3488@JWOZNIAK-DESK>
	<DF57F048-0F16-444E-B784-97730FC375BB@mcs.anl.gov>
Message-ID: <alpine.WNT.2.00.1202101540210.3488@JWOZNIAK-DESK>


Using branches/release-0.93, I find that the sites and tc files are 
included in the log.  If you use the default location you just get the 
path name.

On Fri, 10 Feb 2012, Jonathan Monette wrote:

> 0.93
>
> On Feb 10, 2012, at 3:19 PM, Justin M Wozniak wrote:
>
>>
>> There should be a message for that case as well.  Which branch are you using?
>>
>> On Fri, 10 Feb 2012, Jonathan Monette wrote:
>>
>>> It is set.  So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct?
>>>
>>> On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote:
>>>
>>>>
>>>> Just set:
>>>>
>>>> log4j.logger.swift.textfiles=DEBUG
>>>>
>>>> On Fri, 10 Feb 2012, Jonathan Monette wrote:
>>>>
>>>>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run?  I keep getting an error saying that the application is not in my tc file for any of the site pool entries.  I just want to see if Swift is grabbing the write files.
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>>>
>>>> --
>>>> Justin M Wozniak
>>>
>>>
>>
>> --
>> Justin M Wozniak
>
>

-- 
Justin M Wozniak


From jonmon at mcs.anl.gov  Fri Feb 10 16:08:10 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Fri, 10 Feb 2012 16:08:10 -0600
Subject: [Swift-devel] tc and sites file debugging
In-Reply-To: <alpine.WNT.2.00.1202101540210.3488@JWOZNIAK-DESK>
References: <F3562164-FDA0-43A0-AB3C-32CF2EBD8966@mcs.anl.gov>
	<alpine.WNT.2.00.1202101513540.3488@JWOZNIAK-DESK>
	<F3889308-BDFC-4753-98E9-4F2E2D9832F5@mcs.anl.gov>
	<alpine.WNT.2.00.1202101519320.3488@JWOZNIAK-DESK>
	<DF57F048-0F16-444E-B784-97730FC375BB@mcs.anl.gov>
	<alpine.WNT.2.00.1202101540210.3488@JWOZNIAK-DESK>
Message-ID: <AFB51689-11E9-4F66-9278-B7460268E2FE@mcs.anl.gov>

So this turned out to be a sites file xml formatting issue.  Filed as bug 732.

On Feb 10, 2012, at 3:41 PM, Justin M Wozniak wrote:

> 
> Using branches/release-0.93, I find that the sites and tc files are included in the log.  If you use the default location you just get the path name.
> 
> On Fri, 10 Feb 2012, Jonathan Monette wrote:
> 
>> 0.93
>> 
>> On Feb 10, 2012, at 3:19 PM, Justin M Wozniak wrote:
>> 
>>> 
>>> There should be a message for that case as well.  Which branch are you using?
>>> 
>>> On Fri, 10 Feb 2012, Jonathan Monette wrote:
>>> 
>>>> It is set.  So, if not sites file or tc file shows up in the logs, that is a pretty good indication that they defaulted to the ones provided by Swift correct?
>>>> 
>>>> On Feb 10, 2012, at 3:14 PM, Justin M Wozniak wrote:
>>>> 
>>>>> 
>>>>> Just set:
>>>>> 
>>>>> log4j.logger.swift.textfiles=DEBUG
>>>>> 
>>>>> On Fri, 10 Feb 2012, Jonathan Monette wrote:
>>>>> 
>>>>>> What log4j properties do I have to turn on to see what the path is to the tc and sites file I am using in a Swift run?  I keep getting an error saying that the application is not in my tc file for any of the site pool entries.  I just want to see if Swift is grabbing the write files.
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>>>> 
>>>>> --
>>>>> Justin M Wozniak
>>>> 
>>>> 
>>> 
>>> --
>>> Justin M Wozniak
>> 
>> 
> 
> -- 
> Justin M Wozniak


From wilde at mcs.anl.gov  Fri Feb 10 16:08:54 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 10 Feb 2012 16:08:54 -0600 (CST)
Subject: [Swift-devel] Useful guid to Cray PBS submit files
Message-ID: <1867549921.236830.1328911734948.JavaMail.root@zimbra.anl.gov>

http://www.nersc.gov/users/computational-systems/hopper/running-jobs/example-batch-scripts/


From wilde at mcs.anl.gov  Mon Feb 13 10:02:57 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 13 Feb 2012 10:02:57 -0600 (CST)
Subject: [Swift-devel] Does anyone have a newer/working svn on Beagle?
In-Reply-To: <2143442693.240566.1329148680202.JavaMail.root@zimbra.anl.gov>
Message-ID: <520551894.240585.1329148977478.JavaMail.root@zimbra.anl.gov>

Hi All,

Is there a more recent (1.6++) version of svn available on Beagle than the default 1.5.7? If not, can anyone install one?

If not, I'll file this as a Beagle ticket.

Thanks,

- Mike

I get this when trying to use svn on a dir checked out with 1.6:

login2$ svn up

svn: This client is too old to work with working copy '.'.  You need
to get a newer Subversion client, or to downgrade this working copy.
See http://subversion.tigris.org/faq.html#working-copy-format-change
for details.

login2$ svn --version

svn, version 1.5.7 (r36142)
   compiled Jun  7 2011, 12:23:36

login2$ which svn
/usr/bin/svn
login2$ 


From benc at hawaga.org.uk  Mon Feb 13 14:43:05 2012
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 13 Feb 2012 20:43:05 +0000
Subject: [Swift-devel] Does anyone have a newer/working svn on Beagle?
In-Reply-To: <520551894.240585.1329148977478.JavaMail.root@zimbra.anl.gov>
References: <520551894.240585.1329148977478.JavaMail.root@zimbra.anl.gov>
Message-ID: <5C8731F4-5D8F-455D-8834-99F867500E69@hawaga.org.uk>


On Feb 13, 2012, at 4:02 PM, Michael Wilde wrote:

> Hi All,
> 
> Is there a more recent (1.6++) version of svn available on Beagle than the default 1.5.7? If not, can anyone install one?
> 
> If not, I'll file this as a Beagle ticket.
> 

I think you can work around this by making the original checkout with that version of SVN. Its bugged me in the past a few times when I've moved svn checkouts from one machine to another with NFS or rsync.

Ben

> Thanks,
> 
> - Mike
> 
> I get this when trying to use svn on a dir checked out with 1.6:
> 
> login2$ svn up
> 
> svn: This client is too old to work with working copy '.'.  You need
> to get a newer Subversion client, or to downgrade this working copy.
> See http://subversion.tigris.org/faq.html#working-copy-format-change
> for details.
> 
> login2$ svn --version
> 
> svn, version 1.5.7 (r36142)
>   compiled Jun  7 2011, 12:23:36
> 
> login2$ which svn
> /usr/bin/svn
> login2$ 


From wilde at mcs.anl.gov  Wed Feb 15 22:30:56 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 15 Feb 2012 22:30:56 -0600 (CST)
Subject: [Swift-devel] Beagle swift module out pf date
Message-ID: <2043049884.10276.1329366656393.JavaMail.root@zimbra.anl.gov>

Why is the Beagle swift module loading RC5?

login2$ module load swift
Swift version swift-0.93RC5 loaded
login2$ which swift
/soft/swift/swift-0.93RC5/bin/swift
login2$ 


- Mike

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From davidk at ci.uchicago.edu  Thu Feb 16 01:38:38 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Thu, 16 Feb 2012 01:38:38 -0600 (CST)
Subject: [Swift-devel] Beagle swift module out pf date
In-Reply-To: <2043049884.10276.1329366656393.JavaMail.root@zimbra.anl.gov>
Message-ID: <1121747705.121490.1329377918828.JavaMail.root@zimbra-mb2.anl.gov>

Beagle should be using the 0.93 release now. I'll try to update the other CI/ANL machines tomorrow.

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Ketan Maheshwari" <ketan at mcs.anl.gov>, "David Kelly" <davidk at ci.uchicago.edu>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Wednesday, February 15, 2012 10:30:56 PM
> Subject: Beagle swift module out pf date
> Why is the Beagle swift module loading RC5?
> 
> login2$ module load swift
> Swift version swift-0.93RC5 loaded
> login2$ which swift
> /soft/swift/swift-0.93RC5/bin/swift
> login2$
> 
> 
> - Mike
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory


From wilde at mcs.anl.gov  Fri Feb 17 08:23:48 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 17 Feb 2012 08:23:48 -0600 (CST)
Subject: [Swift-devel] Agenda for Swift devel meeting today
In-Reply-To: <605162052.2791.1329487757438.JavaMail.root@zimbra.anl.gov>
Message-ID: <1247532943.2827.1329488628397.JavaMail.root@zimbra.anl.gov>

Here's what I have so far for discussion today. Please add info, points, or more topics.

- coaster provider staging timeouts
  -- reproduce on same topology as SCEC bugs occurred
  -- discuss and test: do we need TCP window control
  -- longer term: test how gridftp works in same topology

- coaster timeouts
  -- execution doesnt continue and recover on coaster worker time walltime expiration
  -- subtler bug: failing retryable jobs have strange interaction with hang checker
     (still need to reproduce this in a test case; lower prio)

- hang checker: can we help user diagnose these faster/easier?
  -- whats in the current log for this

- IO strategy improvements
  -- CDM as a default
  -- provider staging selectable
  -- staging via worker-side transfer client (esp. globus-url-copy)

- BG/P
  -- what are known issues?
  -- test problems with _concurrent mapping

- gensites
  -- also allow cmd line setting of params
  -- SciColSim suggests we should generalize its run script into "swiftrun"
  -- next steps on tc.data -> apps
     typically find apps in path
     more wildcards to reduce need to set this file
     interaction with sites file

- tryswift
  -- report from David on FutureGrid execution environment for this
  -- obstacles?

- Please suggest additional topics!

Thanks,

- Mike


From hategan at mcs.anl.gov  Sat Feb 18 18:07:03 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 18 Feb 2012 16:07:03 -0800
Subject: [Swift-devel] emails
Message-ID: <1329610023.25129.1.camel@blabla>

Hmm, so my otherwise very reliable (until now) email notifier has
stopped working yesterday or so. It took me a while to start wondering
why I'm not seeing any new emails. So sorry for not replying to things
yesterday and today so far.

Mihael


From jonmon at mcs.anl.gov  Sun Feb 19 18:08:10 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Sun, 19 Feb 2012 18:08:10 -0600
Subject: [Swift-devel] Walltime exceeded error
Message-ID: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>

Hello,
   So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing.  The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep

This run does not produce the issue.  In face it does show that the workers shutdown and restart takes over.  It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs.  

The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002.  There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress.  We though that the job would be killed and then retried once the wall time exceeded what we provided.  It looks like the job was killed but was not restarted.  This script is very complicated but does produce the issue when run long enough.

Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not.  Perhaps this issue is Beagle specific(not sure what that means).  I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does.

From hategan at mcs.anl.gov  Sun Feb 19 18:14:10 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 19 Feb 2012 16:14:10 -0800
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
Message-ID: <1329696850.31828.0.camel@blabla>

Thanks. I'll take a look at the logs and see if anything pops up.

On Sun, 2012-02-19 at 18:08 -0600, Jonathan Monette wrote:
> Hello,
>    So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing.  The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep
> 
> This run does not produce the issue.  In face it does show that the workers shutdown and restart takes over.  It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs.  
> 
> The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002.  There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress.  We though that the job would be killed and then retried once the wall time exceeded what we provided.  It looks like the job was killed but was not restarted.  This script is very complicated but does produce the issue when run long enough.
> 
> Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not.  Perhaps this issue is Beagle specific(not sure what that means).  I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does.


From wilde at mcs.anl.gov  Mon Feb 20 09:35:33 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 20 Feb 2012 09:35:33 -0600 (CST)
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
Message-ID: <2037160806.8873.1329752133523.JavaMail.root@zimbra.anl.gov>

Jon, can you try another run on PADS with these changes:

- 1 slot instead of 192 to keep the log much smaller
- n=20 instead of 1000 (ditto)
- t=70 to make sure that the app() runtime exceeds the specified maxwalltime by enough
- local:pbs instead of ssh:pbs to stay closer to the config where the problem occurred
- beagle if possible (one node in the scalability or development queue) and same Java as used in the failing case

Mike

----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Sunday, February 19, 2012 6:08:10 PM
> Subject: [Swift-devel] Walltime exceeded error
> Hello,
> So I have been spending the better part of today trying to reproduce
> this maxwalltime issue we have been witnessing. The most recent run I
> ran is at /home/jonmon/PADS/Swift/tests/catsnsleep
> 
> This run does not produce the issue. In face it does show that the
> workers shutdown and restart takes over. It does show that there were
> 120 jobs failed but I believe that is because the retries were
> exceeded on those jobs.
> 
> The run in question where this was being witnessed was on Beagle and
> is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002.
> There is a log file in that directory that you should be able to view
> and see the issue and perhaps clarify why the execution just hung and
> made no progress. We though that the job would be killed and then
> retried once the wall time exceeded what we provided. It looks like
> the job was killed but was not restarted. This script is very
> complicated but does produce the issue when run long enough.
> 
> Maybe Mihael can provide some insight as to what was going in the code
> when the code hung on Beagle as the hang checker never kicked in so
> Swift thought it was doing something to make progress when in fact it
> was not. Perhaps this issue is Beagle specific(not sure what that
> means). I am going to try the same scale of a run on PADS and see if
> it completes(although it may take longer as PADS does not have the
> computing power that Beagle does.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From jonmon at mcs.anl.gov  Mon Feb 20 10:11:24 2012
From: jonmon at mcs.anl.gov (Jonathan)
Date: Mon, 20 Feb 2012 10:11:24 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <2037160806.8873.1329752133523.JavaMail.root@zimbra.anl.gov>
References: <2037160806.8873.1329752133523.JavaMail.root@zimbra.anl.gov>
Message-ID: <5A90CDC9-7516-4D4D-A407-B347E2CB17CB@mcs.anl.gov>

Yes.  I will. 


On Feb 20, 2012, at 9:35, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Jon, can you try another run on PADS with these changes:
> 
> - 1 slot instead of 192 to keep the log much smaller
> - n=20 instead of 1000 (ditto)
> - t=70 to make sure that the app() runtime exceeds the specified maxwalltime by enough
> - local:pbs instead of ssh:pbs to stay closer to the config where the problem occurred
> - beagle if possible (one node in the scalability or development queue) and same Java as used in the failing case
> 
> Mike
> 
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Sunday, February 19, 2012 6:08:10 PM
>> Subject: [Swift-devel] Walltime exceeded error
>> Hello,
>> So I have been spending the better part of today trying to reproduce
>> this maxwalltime issue we have been witnessing. The most recent run I
>> ran is at /home/jonmon/PADS/Swift/tests/catsnsleep
>> 
>> This run does not produce the issue. In face it does show that the
>> workers shutdown and restart takes over. It does show that there were
>> 120 jobs failed but I believe that is because the retries were
>> exceeded on those jobs.
>> 
>> The run in question where this was being witnessed was on Beagle and
>> is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002.
>> There is a log file in that directory that you should be able to view
>> and see the issue and perhaps clarify why the execution just hung and
>> made no progress. We though that the job would be killed and then
>> retried once the wall time exceeded what we provided. It looks like
>> the job was killed but was not restarted. This script is very
>> complicated but does produce the issue when run long enough.
>> 
>> Maybe Mihael can provide some insight as to what was going in the code
>> when the code hung on Beagle as the hang checker never kicked in so
>> Swift thought it was doing something to make progress when in fact it
>> was not. Perhaps this issue is Beagle specific(not sure what that
>> means). I am going to try the same scale of a run on PADS and see if
>> it completes(although it may take longer as PADS does not have the
>> computing power that Beagle does.
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 


From hategan at mcs.anl.gov  Mon Feb 20 16:11:16 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 20 Feb 2012 14:11:16 -0800
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
Message-ID: <1329775876.6072.1.camel@blabla>

I can't log in to beagle. Can you move them to some place where I can
access them?

On Sun, 2012-02-19 at 18:08 -0600, Jonathan Monette wrote:
> Hello,
>    So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing.  The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep
> 
> This run does not produce the issue.  In face it does show that the workers shutdown and restart takes over.  It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs.  
> 
> The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002.  There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress.  We though that the job would be killed and then retried once the wall time exceeded what we provided.  It looks like the job was killed but was not restarted.  This script is very complicated but does produce the issue when run long enough.
> 
> Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not.  Perhaps this issue is Beagle specific(not sure what that means).  I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does.


From jonmon at mcs.anl.gov  Mon Feb 20 16:14:19 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Mon, 20 Feb 2012 16:14:19 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <1329775876.6072.1.camel@blabla>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
Message-ID: <A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>

/gpfs/pads/swift/jonmon/Swift/tests/catsnsleep                                 <----- on /gpfs/pads
/home/jonmon/public_html/Swift/bugs/SciColSim/run002             <----- on any CI machine

On Feb 20, 2012, at 4:11 PM, Mihael Hategan wrote:

> I can't log in to beagle. Can you move them to some place where I can
> access them?
> 
> On Sun, 2012-02-19 at 18:08 -0600, Jonathan Monette wrote:
>> Hello,
>>   So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing.  The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep
>> 
>> This run does not produce the issue.  In face it does show that the workers shutdown and restart takes over.  It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs.  
>> 
>> The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002.  There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress.  We though that the job would be killed and then retried once the wall time exceeded what we provided.  It looks like the job was killed but was not restarted.  This script is very complicated but does produce the issue when run long enough.
>> 
>> Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not.  Perhaps this issue is Beagle specific(not sure what that means).  I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does.
> 
> 


From hategan at mcs.anl.gov  Mon Feb 20 16:16:34 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 20 Feb 2012 14:16:34 -0800
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
Message-ID: <1329776194.6072.2.camel@blabla>

On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote:
> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep                                 <----- on /gpfs/pads
> /home/jonmon/public_html/Swift/bugs/SciColSim/run002             <----- on any CI machine

Ok. Sorry. I thought the last one was on beagle.


From jonmon at mcs.anl.gov  Mon Feb 20 16:19:45 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Mon, 20 Feb 2012 16:19:45 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <1329776194.6072.2.camel@blabla>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
Message-ID: <BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>

No.  The last run was run using Beagle.  That is the more interesting one.  That shows jobs failed but the "Failed but can retry" count was not printed very often.  You can see that in the swift.out file.  Eventually the workflow just hung and the hang checker kicked in.  You can also see that Swift got stuck in the initializing state with a count of 61.

On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote:

> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote:
>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep                                 <----- on /gpfs/pads
>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002             <----- on any CI machine
> 
> Ok. Sorry. I thought the last one was on beagle.
> 


From hategan at mcs.anl.gov  Mon Feb 20 16:24:12 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 20 Feb 2012 14:24:12 -0800
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
Message-ID: <1329776652.6072.3.camel@blabla>

I'm not sure if I asked this, but did you happen to get a jstack of the
hanging swift?

On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote:
> No.  The last run was run using Beagle.  That is the more interesting one.  That shows jobs failed but the "Failed but can retry" count was not printed very often.  You can see that in the swift.out file.  Eventually the workflow just hung and the hang checker kicked in.  You can also see that Swift got stuck in the initializing state with a count of 61.
> 
> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote:
> 
> > On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote:
> >> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep                                 <----- on /gpfs/pads
> >> /home/jonmon/public_html/Swift/bugs/SciColSim/run002             <----- on any CI machine
> > 
> > Ok. Sorry. I thought the last one was on beagle.
> > 
> 


From jonmon at mcs.anl.gov  Mon Feb 20 16:26:49 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Mon, 20 Feb 2012 16:26:49 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <1329776652.6072.3.camel@blabla>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
	<1329776652.6072.3.camel@blabla>
Message-ID: <C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>

No.  This was a run Ketan did a while back.  I have been using this as a reference when trying to re-create the issue with a simple catsnsleep job.

This run was also done on Beagle using the pre-installed java package, which does not have jstack.

On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote:

> I'm not sure if I asked this, but did you happen to get a jstack of the
> hanging swift?
> 
> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote:
>> No.  The last run was run using Beagle.  That is the more interesting one.  That shows jobs failed but the "Failed but can retry" count was not printed very often.  You can see that in the swift.out file.  Eventually the workflow just hung and the hang checker kicked in.  You can also see that Swift got stuck in the initializing state with a count of 61.
>> 
>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote:
>> 
>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote:
>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep                                 <----- on /gpfs/pads
>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002             <----- on any CI machine
>>> 
>>> Ok. Sorry. I thought the last one was on beagle.
>>> 
>> 
> 
> 


From jonmon at mcs.anl.gov  Mon Feb 20 16:27:30 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Mon, 20 Feb 2012 16:27:30 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
	<1329776652.6072.3.camel@blabla>
	<C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>
Message-ID: <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov>

Correction, Beagle does have jstack.  Do not know why I thought it did not have it.

On Feb 20, 2012, at 4:26 PM, Jonathan Monette wrote:

> No.  This was a run Ketan did a while back.  I have been using this as a reference when trying to re-create the issue with a simple catsnsleep job.
> 
> This run was also done on Beagle using the pre-installed java package, which does not have jstack.
> 
> On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote:
> 
>> I'm not sure if I asked this, but did you happen to get a jstack of the
>> hanging swift?
>> 
>> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote:
>>> No.  The last run was run using Beagle.  That is the more interesting one.  That shows jobs failed but the "Failed but can retry" count was not printed very often.  You can see that in the swift.out file.  Eventually the workflow just hung and the hang checker kicked in.  You can also see that Swift got stuck in the initializing state with a count of 61.
>>> 
>>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote:
>>> 
>>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote:
>>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep                                 <----- on /gpfs/pads
>>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002             <----- on any CI machine
>>>> 
>>>> Ok. Sorry. I thought the last one was on beagle.
>>>> 
>>> 
>> 
>> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From ketancmaheshwari at gmail.com  Mon Feb 20 16:42:10 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Mon, 20 Feb 2012 16:42:10 -0600
Subject: [Swift-devel] cybershake hangs in the latest Swift 0.93 does not
	hang in a previous rel
Message-ID: <CAMUuviqtghL8rvTarPDeE3fdW1tTLtVymic7Ju0uShwCYcTaAQ@mail.gmail.com>

Mihael,

Reporting a case of deadlock/hang occurring in the recent swift 0.93 update:

I've been working on the cybershake script with David today and it seems
that the script hangs at around the same on David's Swift installation
which is:
Swift 0.93 swift-r5658 cog-r3361

I successfully tested the same configuration with my swift installation
which is a bit older release:
Swift 0.93 swift-r5609 (swift modified locally) cog-r3361 (cog modified
locally)

The log for the hanged version is:
http://ci.uchicago.edu/~ketan/postproc-20120220-1617-hvcmjs71.log

The jstack for the hang version is:
http://ci.uchicago.edu/~ketan/cybershake.jstack

The log for the successful run is:
http://ci.uchicago.edu/~ketan/postproc-20120220-1454-lfog5xu1.log

Regards,
-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120220/94de3fc0/attachment.html>

From jonmon at utexas.edu  Tue Feb 21 12:56:23 2012
From: jonmon at utexas.edu (Jonathan Monette)
Date: Tue, 21 Feb 2012 12:56:23 -0600
Subject: [Swift-devel] Command Reply Timeout
Message-ID: <DA61DB2E-5C15-4EFD-BEBC-097440245B9A@utexas.edu>

What does this mean?
   Command Command(54, HEARTBEAT): handling reply timeout; sendReqTime=120221-185033.459, sendTime=700101-000000.000, now=120221-185233.649, channel=SC-0221-330346-000016-000001

I see these lines sprinkled throughout my swift run and in the logs are log4j level WARN.  What is it trying to tell me?  Should I be worried?  I cannot tell if my workflow is making progress or not.  It looked like it was even with these messages popping up but now I am not sure if it is.  What is the above line saying?

From hategan at mcs.anl.gov  Tue Feb 21 13:00:57 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 21 Feb 2012 11:00:57 -0800
Subject: [Swift-devel] Command Reply Timeout
In-Reply-To: <DA61DB2E-5C15-4EFD-BEBC-097440245B9A@utexas.edu>
References: <DA61DB2E-5C15-4EFD-BEBC-097440245B9A@utexas.edu>
Message-ID: <1329850857.17237.0.camel@blabla>

It's saying that a connection between the coaster service and a worker
isn't going quite right.

On Tue, 2012-02-21 at 12:56 -0600, Jonathan Monette wrote:
> What does this mean?
>    Command Command(54, HEARTBEAT): handling reply timeout; sendReqTime=120221-185033.459, sendTime=700101-000000.000, now=120221-185233.649, channel=SC-0221-330346-000016-000001
> 
> I see these lines sprinkled throughout my swift run and in the logs are log4j level WARN.  What is it trying to tell me?  Should I be worried?  I cannot tell if my workflow is making progress or not.  It looked like it was even with these messages popping up but now I am not sure if it is.  What is the above line saying?


From jonmon at mcs.anl.gov  Tue Feb 21 13:08:41 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Tue, 21 Feb 2012 13:08:41 -0600
Subject: [Swift-devel] Command Reply Timeout
In-Reply-To: <1329850857.17237.0.camel@blabla>
References: <DA61DB2E-5C15-4EFD-BEBC-097440245B9A@utexas.edu>
	<1329850857.17237.0.camel@blabla>
Message-ID: <798DD6A5-30AA-4273-A792-263D9D792E4C@mcs.anl.gov>

I see?.thanks.  I will figure out what happened.

On Feb 21, 2012, at 1:00 PM, Mihael Hategan wrote:

> It's saying that a connection between the coaster service and a worker
> isn't going quite right.
> 
> On Tue, 2012-02-21 at 12:56 -0600, Jonathan Monette wrote:
>> What does this mean?
>>   Command Command(54, HEARTBEAT): handling reply timeout; sendReqTime=120221-185033.459, sendTime=700101-000000.000, now=120221-185233.649, channel=SC-0221-330346-000016-000001
>> 
>> I see these lines sprinkled throughout my swift run and in the logs are log4j level WARN.  What is it trying to tell me?  Should I be worried?  I cannot tell if my workflow is making progress or not.  It looked like it was even with these messages popping up but now I am not sure if it is.  What is the above line saying?
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From davidk at ci.uchicago.edu  Wed Feb 22 09:12:14 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Wed, 22 Feb 2012 09:12:14 -0600 (CST)
Subject: [Swift-devel] cybershake hangs in the latest Swift 0.93 does
 not	hang in a previous rel
In-Reply-To: <CAMUuviqtghL8rvTarPDeE3fdW1tTLtVymic7Ju0uShwCYcTaAQ@mail.gmail.com>
Message-ID: <1865835731.130174.1329923534838.JavaMail.root@zimbra-mb2.anl.gov>

I changed to r5609 to match Ketan's working version, but am still getting the same errors.

[davidk at communicado run]$ swift -version
no sites file specified, setting to default: /home/davidk/swift-0.93/cog/modules/swift/dist/swift-svn/etc/sites.xml
Swift 0.93 swift-r5609 cog-r3361

I'll dig through the logs a bit and see if I can narrow it down. Here is the message I get via stdout.
 
No events in 10s.

Registered futures:
string[] var_str  Closed, 242 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 242 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 32 elements, no listeners
string[] var_str  Closed, 32 elements, no listeners
string[] var_str  Closed, 2 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
string[] var_str  Closed, 50 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 200 elements, no listeners
string[] var_str  Closed, 2 elements, no listeners
string[] var_str  Closed, 242 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
string[] var_str  Closed, 50 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 50 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 2 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
SgtDim sgt_var - F/sgt_var..y:SgtDim - Open
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 128 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
string[] var_str  Closed, 72 elements, no listeners
string[] var_str  Closed, 2 elements, no listeners
string[] var_str  Closed, 50 elements, no listeners
string[] var_str  Closed, 32 elements, no listeners
string[] var_str  Closed, 50 elements, no listeners
string[] var_str  Closed, 128 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 32 elements, no listeners
string[] var_str  Closed, 2 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 72 elements, no listeners
string[] var_str  Closed, 200 elements, no listeners
string[] var_str  Closed, 32 elements, no listeners
string[] var_str  Closed, 18 elements, no listeners
string[] var_str  Closed, 32 elements, no listeners
string[] var_str  Closed, 200 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
string[] var_str  Closed, 50 elements, no listeners
string[] var_str  Closed, 50 elements, no listeners
string[] var_str  Closed, 50 elements, no listeners
string[] var_str  Closed, 32 elements, no listeners
string[] var_str  Closed, 8 elements, no listeners
----

Waiting threads:
0-13-118-6
0-13-52-6
0-13-30-6
0-13-139-6
0-13-194-6
0-13-24-6
0-13-127-6
0-13-149-6
0-13-42-6
0-13-138-6
0-13-174-6
0-13-89-6
0-13-36-6
0-13-156-6
0-13-53-6
0-13-152-6
0-13-90-6
0-13-158-6
0-13-132-6
0-13-136-6
0-13-87-6
0-13-92-6
0-13-182-6
0-13-9-6
0-13-29-6
0-13-60-6
0-13-70-6
0-13-12-6
0-13-81-6
0-13-178-6
0-13-49-6
0-13-97-6
0-13-65-6
0-13-145-6
0-13-135-6
0-13-190-6
0-13-11-6
0-13-163-6
0-13-155-6
0-13-16-6
0-13-154-6
0-13-167-6
0-13-173-6
0-13-166-6
0-13-0-6
0-13-191-6
0-13-37-6
0-13-17-6
0-13-85-6
0-13-79-6
0-13-134-6
0-13-176-6
0-13-125-6
0-13-38-6
0-13-187-6
0-13-35-6
0-13-171-6
0-13-88-6
0-13-131-6
0-13-106-6
0-13-55-6
0-13-168-6
0-13-147-6
0-13-148-6
0-13-99-6
0-13-34-6
0-13-2-6
0-13-100-6
0-13-48-6
0-13-5-6
0-13-69-6
0-13-80-6
0-13-153-6
0-13-122-6
0-13-105-6
0-13-113-6
0-13-26-6
0-13-124-6
0-13-32-6
0-13-123-6
0-13-98-6
0-13-170-6
0-13-28-6
0-13-22-6
0-13-162-6
0-13-15-6
0-13-64-6
0-13-13-6
0-13-111-6
0-13-66-6
0-13-43-6
0-13-19-6
0-13-78-6
0-13-157-6
0-13-57-6
0-13-142-6
0-13-151-6
0-13-3-6
0-13-140-6
0-13-76-6
0-13-188-6
0-13-91-6
0-13-75-6
0-13-47-6
0-13-50-6
0-13-41-6
0-13-40-6
0-13-21-6
0-13-193-6
0-13-102-6
0-13-59-6
0-13-189-6
0-13-31-6
0-13-197-6
0-13-110-6
0-13-4-6
0-13-20-6
0-13-185-6
0-13-137-6
0-13-121-6
0-13-180-6
0-13-169-6
0-13-58-6
0-13-116-6
0-13-45-6
0-13-93-6
0-13-146-6
0-13-164-6
0-13-101-6
0-13-179-6
0-13-115-6
0-13-23-6
0-13-94-6
0-13-44-6
0-13-177-6
0-13-10-6
0-13-84-6
0-13-186-6
0-13-150-6
0-13-198-6
0-13-195-6
0-13-14-6
0-13-143-6
0-13-63-6
0-13-77-6
0-13-51-6
0-13-25-6
0-13-172-6
0-13-18-6
0-13-68-6
0-13-159-6
0-13-128-6
0-13-104-6
0-13-141-6
0-13-6-6
0-13-126-6
0-13-108-6
0-13-1-6
0-13-199-6
0-13-175-6
0-13-120-6
0-13-119-6
0-13-192-6
0-13-183-6
0-13-103-6
0-13-133-6
0-13-184-6
0-13-161-6
0-13-196-6
0-13-112-6
0-13-129-6
0-13-33-6
0-13-72-6
0-13-74-6
0-13-39-6
0-13-160-6
0-13-54-6
0-13-117-6
0-13-114-6
0-13-95-6
0-13-165-6
0-13-181-6
0-13-46-6
0-13-27-6
0-13-109-6
0-13-130-6
0-13-144-6
0-13-82-6
0-13-67-6
0-13-8-6
0-13-62-6
0-13-73-6
0-13-56-6
0-13-86-6
0-13-7-6
0-13-107-6
0-13-83-6
0-13-71-6
0-13-96-6
0-13-61-6
----


----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "David Kelly" <davidkelly999 at gmail.com>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, February 20, 2012 4:42:10 PM
> Subject: [Swift-devel] cybershake hangs in the latest Swift 0.93 does not hang in a previous rel
> Mihael,
> 
> 
> Reporting a case of deadlock/hang occurring in the recent swift 0.93
> update:
> 
> 
> I've been working on the cybershake script with David today and it
> seems that the script hangs at around the same on David's Swift
> installation which is:
> Swift 0.93 swift-r5658 cog-r3361
> 
> 
> I successfully tested the same configuration with my swift
> installation which is a bit older release:
> Swift 0.93 swift-r5609 (swift modified locally) cog-r3361 (cog
> modified locally)
> 
> 
> 
> The log for the hanged version is:
> http://ci.uchicago.edu/~ketan/postproc-20120220-1617-hvcmjs71.log
> 
> 
> The jstack for the hang version is:
> http://ci.uchicago.edu/~ketan/cybershake.jstack
> 
> 
> The log for the successful run is:
> http://ci.uchicago.edu/~ketan/postproc-20120220-1454-lfog5xu1.log
> 
> 
> Regards, --
> Ketan
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From jonmon at mcs.anl.gov  Wed Feb 22 15:45:53 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Wed, 22 Feb 2012 15:45:53 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
	<1329776652.6072.3.camel@blabla>
	<C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>
	<2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov>
Message-ID: <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov>

Mihael,
   I have a hung Java process showing this error right now, 2 jobs are stuck in the initializing state.  I have a jstack -l <pid> of this hung java process.  Is there anything else you need before I kill it?  Do you need any other probing information from this process other than this jstack output?

On Feb 20, 2012, at 4:27 PM, Jonathan Monette wrote:

> Correction, Beagle does have jstack.  Do not know why I thought it did not have it.
> 
> On Feb 20, 2012, at 4:26 PM, Jonathan Monette wrote:
> 
>> No.  This was a run Ketan did a while back.  I have been using this as a reference when trying to re-create the issue with a simple catsnsleep job.
>> 
>> This run was also done on Beagle using the pre-installed java package, which does not have jstack.
>> 
>> On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote:
>> 
>>> I'm not sure if I asked this, but did you happen to get a jstack of the
>>> hanging swift?
>>> 
>>> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote:
>>>> No.  The last run was run using Beagle.  That is the more interesting one.  That shows jobs failed but the "Failed but can retry" count was not printed very often.  You can see that in the swift.out file.  Eventually the workflow just hung and the hang checker kicked in.  You can also see that Swift got stuck in the initializing state with a count of 61.
>>>> 
>>>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote:
>>>> 
>>>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote:
>>>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep                                 <----- on /gpfs/pads
>>>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002             <----- on any CI machine
>>>>> 
>>>>> Ok. Sorry. I thought the last one was on beagle.
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Wed Feb 22 15:56:24 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 22 Feb 2012 15:56:24 -0600 (CST)
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov>
Message-ID: <73065862.21539.1329947784250.JavaMail.root@zimbra.anl.gov>

Hi Jon, I think Mondays Mihael is pretty swamped with school commitments.

The only other thing I can think of grabbing is worker logs, but I doubt that any provision was made to request worker logging for this run.

I'd go ahead and terminate the run.

- Mike

----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Wednesday, February 22, 2012 3:45:53 PM
> Subject: Re: [Swift-devel] Walltime exceeded error
> Mihael,
> I have a hung Java process showing this error right now, 2 jobs are
> stuck in the initializing state. I have a jstack -l <pid> of this hung
> java process. Is there anything else you need before I kill it? Do you
> need any other probing information from this process other than this
> jstack output?
> 
> On Feb 20, 2012, at 4:27 PM, Jonathan Monette wrote:
> 
> > Correction, Beagle does have jstack. Do not know why I thought it
> > did not have it.
> >
> > On Feb 20, 2012, at 4:26 PM, Jonathan Monette wrote:
> >
> >> No. This was a run Ketan did a while back. I have been using this
> >> as a reference when trying to re-create the issue with a simple
> >> catsnsleep job.
> >>
> >> This run was also done on Beagle using the pre-installed java
> >> package, which does not have jstack.
> >>
> >> On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote:
> >>
> >>> I'm not sure if I asked this, but did you happen to get a jstack
> >>> of the
> >>> hanging swift?
> >>>
> >>> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote:
> >>>> No. The last run was run using Beagle. That is the more
> >>>> interesting one. That shows jobs failed but the "Failed but can
> >>>> retry" count was not printed very often. You can see that in the
> >>>> swift.out file. Eventually the workflow just hung and the hang
> >>>> checker kicked in. You can also see that Swift got stuck in the
> >>>> initializing state with a count of 61.
> >>>>
> >>>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote:
> >>>>
> >>>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote:
> >>>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on
> >>>>>> /gpfs/pads
> >>>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on
> >>>>>> any CI machine
> >>>>>
> >>>>> Ok. Sorry. I thought the last one was on beagle.
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From jonmon at mcs.anl.gov  Wed Feb 22 16:00:34 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Wed, 22 Feb 2012 16:00:34 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <73065862.21539.1329947784250.JavaMail.root@zimbra.anl.gov>
References: <73065862.21539.1329947784250.JavaMail.root@zimbra.anl.gov>
Message-ID: <5E5C647A-E37F-41D2-8AD4-B5C4135BC609@mcs.anl.gov>

Ok.  I shall kill it.

> Hi Jon, I think Mondays Mihael is pretty swamped with school commitments.
> 
> The only other thing I can think of grabbing is worker logs, but I doubt that any provision was made to request worker logging for this run.
> 
> I'd go ahead and terminate the run.
> 
> - Mike
> 
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "Mihael Hategan" <hategan at mcs.anl.gov>
>> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Wednesday, February 22, 2012 3:45:53 PM
>> Subject: Re: [Swift-devel] Walltime exceeded error
>> Mihael,
>> I have a hung Java process showing this error right now, 2 jobs are
>> stuck in the initializing state. I have a jstack -l <pid> of this hung
>> java process. Is there anything else you need before I kill it? Do you
>> need any other probing information from this process other than this
>> jstack output?
>> 
>> On Feb 20, 2012, at 4:27 PM, Jonathan Monette wrote:
>> 
>>> Correction, Beagle does have jstack. Do not know why I thought it
>>> did not have it.
>>> 
>>> On Feb 20, 2012, at 4:26 PM, Jonathan Monette wrote:
>>> 
>>>> No. This was a run Ketan did a while back. I have been using this
>>>> as a reference when trying to re-create the issue with a simple
>>>> catsnsleep job.
>>>> 
>>>> This run was also done on Beagle using the pre-installed java
>>>> package, which does not have jstack.
>>>> 
>>>> On Feb 20, 2012, at 4:24 PM, Mihael Hategan wrote:
>>>> 
>>>>> I'm not sure if I asked this, but did you happen to get a jstack
>>>>> of the
>>>>> hanging swift?
>>>>> 
>>>>> On Mon, 2012-02-20 at 16:19 -0600, Jonathan Monette wrote:
>>>>>> No. The last run was run using Beagle. That is the more
>>>>>> interesting one. That shows jobs failed but the "Failed but can
>>>>>> retry" count was not printed very often. You can see that in the
>>>>>> swift.out file. Eventually the workflow just hung and the hang
>>>>>> checker kicked in. You can also see that Swift got stuck in the
>>>>>> initializing state with a count of 61.
>>>>>> 
>>>>>> On Feb 20, 2012, at 4:16 PM, Mihael Hategan wrote:
>>>>>> 
>>>>>>> On Mon, 2012-02-20 at 16:14 -0600, Jonathan Monette wrote:
>>>>>>>> /gpfs/pads/swift/jonmon/Swift/tests/catsnsleep <----- on
>>>>>>>> /gpfs/pads
>>>>>>>> /home/jonmon/public_html/Swift/bugs/SciColSim/run002 <----- on
>>>>>>>> any CI machine
>>>>>>> 
>>>>>>> Ok. Sorry. I thought the last one was on beagle.
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>> 
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 


From hategan at mcs.anl.gov  Wed Feb 22 16:28:22 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 22 Feb 2012 14:28:22 -0800
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
	<1329776652.6072.3.camel@blabla>
	<C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>
	<2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov>
	<62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov>
Message-ID: <1329949702.23375.0.camel@blabla>

On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote:
> Mihael,
>    I have a hung Java process showing this error right now, 2 jobs are
> stuck in the initializing state.  I have a jstack -l <pid> of this
> hung java process.  Is there anything else you need before I kill it?
> Do you need any other probing information from this process other than
> this jstack output?

I don't think so.


From jonmon at mcs.anl.gov  Wed Feb 22 16:33:03 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Wed, 22 Feb 2012 16:33:03 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <1329949702.23375.0.camel@blabla>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
	<1329776652.6072.3.camel@blabla>
	<C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>
	<2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov>
	<62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov>
	<1329949702.23375.0.camel@blabla>
Message-ID: <5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov>

Ok.  I have killed the process and I am in the process of copying the run directory from the lustre file system on Beagle to /gpfs/pads

On Feb 22, 2012, at 4:28 PM, Mihael Hategan wrote:

> On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote:
>> Mihael,
>>   I have a hung Java process showing this error right now, 2 jobs are
>> stuck in the initializing state.  I have a jstack -l <pid> of this
>> hung java process.  Is there anything else you need before I kill it?
>> Do you need any other probing information from this process other than
>> this jstack output?
> 
> I don't think so.
> 
> 


From jonmon at mcs.anl.gov  Wed Feb 22 17:05:40 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Wed, 22 Feb 2012 17:05:40 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
	<1329776652.6072.3.camel@blabla>
	<C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>
	<2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov>
	<62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov>
	<1329949702.23375.0.camel@blabla>
	<5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov>
Message-ID: <22E47E80-C5C5-46DA-80CA-1A6063E90727@mcs.anl.gov>

This has been done.  I have also moved the run that Ketan had produced to PADS.

/gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run002    <-----Ketan's run
/gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run047    <-----My run(has a jstack.log file, also more recent)

On Feb 22, 2012, at 4:33 PM, Jonathan Monette wrote:

> Ok.  I have killed the process and I am in the process of copying the run directory from the lustre file system on Beagle to /gpfs/pads
> 
> On Feb 22, 2012, at 4:28 PM, Mihael Hategan wrote:
> 
>> On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote:
>>> Mihael,
>>>  I have a hung Java process showing this error right now, 2 jobs are
>>> stuck in the initializing state.  I have a jstack -l <pid> of this
>>> hung java process.  Is there anything else you need before I kill it?
>>> Do you need any other probing information from this process other than
>>> this jstack output?
>> 
>> I don't think so.
>> 
>> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Fri Feb 24 08:38:33 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 24 Feb 2012 08:38:33 -0600 (CST)
Subject: [Swift-devel] Questions on coaster behavior
In-Reply-To: <308182225.23832.1330012227581.JavaMail.root@zimbra.anl.gov>
Message-ID: <1155840334.27995.1330094313609.JavaMail.root@zimbra.anl.gov>

Hi Mihael, All,

I wanted to confirm some aspects of Coaster behavior that are still unclear to me after re-reading the UCC paper:

Scheduling: the coaster provider scheduler starts a number of worker blocks that are sized (in time and nodes) based on the size of its queue when it computes a schedule. This queue consists of jobs that were emitted by Swift to the provider based on the site throttle.

(note that by "job" here I mean the app() execution, not the LRM job).

But the coaster provider does not actually launch a job on a free coaster slot until the slot is available, right? Ie, there is no tight connection between the coaster slot that a job's time estimate contributed to, and the worker that the job is actually run on, right? Jobs are placed on workers at the last possible moment, and thus when a worker can take a job, it can get *any* job that is queued for that site. Is all this correct?  The key point behind this question being "is the coaster scheduler dynamic enough to hand cases where the job runtime estimates were conservative, and make best use of all available worker cores"? 

Staging: There are no cases in which the coaster provider staging mechanism pre-stages input data, right?

Thanks,

- Mike


From hategan at mcs.anl.gov  Fri Feb 24 12:30:50 2012
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 24 Feb 2012 10:30:50 -0800
Subject: [Swift-devel] Questions on coaster behavior
In-Reply-To: <1155840334.27995.1330094313609.JavaMail.root@zimbra.anl.gov>
References: <1155840334.27995.1330094313609.JavaMail.root@zimbra.anl.gov>
Message-ID: <1330108250.4574.7.camel@blabla>

On Fri, 2012-02-24 at 08:38 -0600, Michael Wilde wrote:
> Hi Mihael, All,
> 
> I wanted to confirm some aspects of Coaster behavior that are still unclear to me after re-reading the UCC paper:
> 
> Scheduling: the coaster provider scheduler starts a number of worker blocks that are sized (in time and nodes) based on the size of its queue when it computes a schedule. This queue consists of jobs that were emitted by Swift to the provider based on the site throttle.
> 
> (note that by "job" here I mean the app() execution, not the LRM job).
> 
> But the coaster provider does not actually launch a job on a free
> coaster slot until the slot is available, right?

That is correct. Jobs are queued by the coaster service, blocks are
submitted and killed based on the shape of the queued jobs, and once the
blocks are running, jobs are sent to them.

>  Ie, there is no tight connection between the coaster slot that a
> job's time estimate contributed to, and the worker that the job is
> actually run on, right?

That's right. Only the totals are tightly connected.

>  Jobs are placed on workers at the last possible moment, and thus when
> a worker can take a job, it can get *any* job that is queued for that
> site.

Jobs are placed on workers when workers don't have anything else to do.
Each worker will get the longest job that it can fit.

>  Is all this correct?  The key point behind this question being "is
> the coaster scheduler dynamic enough to hand cases where the job
> runtime estimates were conservative, and make best use of all
> available worker cores"? 

Yes. That's the basic idea.

> 
> Staging: There are no cases in which the coaster provider staging mechanism pre-stages input data, right?

If by pre-staging you mean staging before the job makes it to the
worker, then no. The worker initiates staging.


From jonmon at mcs.anl.gov  Fri Feb 24 16:09:54 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Fri, 24 Feb 2012 16:09:54 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <22E47E80-C5C5-46DA-80CA-1A6063E90727@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
	<1329776652.6072.3.camel@blabla>
	<C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>
	<2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov>
	<62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov>
	<1329949702.23375.0.camel@blabla>
	<5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov>
	<22E47E80-C5C5-46DA-80CA-1A6063E90727@mcs.anl.gov>
Message-ID: <91121EC0-D9A6-4C65-8AE3-29C1F37ED18E@mcs.anl.gov>

I have updated the bugzilla bug with the below directories: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=720

I have also added another directory showing the same behavior
/gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run054

There is a jstack.log file in that directory.  All three of the run directories show that jobs get stuck in the initialized state and the hang checker kicks in.

On Feb 22, 2012, at 5:05 PM, Jonathan Monette wrote:

> This has been done.  I have also moved the run that Ketan had produced to PADS.
> 
> /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run002    <-----Ketan's run
> /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run047    <-----My run(has a jstack.log file, also more recent)
> 
> On Feb 22, 2012, at 4:33 PM, Jonathan Monette wrote:
> 
>> Ok.  I have killed the process and I am in the process of copying the run directory from the lustre file system on Beagle to /gpfs/pads
>> 
>> On Feb 22, 2012, at 4:28 PM, Mihael Hategan wrote:
>> 
>>> On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote:
>>>> Mihael,
>>>> I have a hung Java process showing this error right now, 2 jobs are
>>>> stuck in the initializing state.  I have a jstack -l <pid> of this
>>>> hung java process.  Is there anything else you need before I kill it?
>>>> Do you need any other probing information from this process other than
>>>> this jstack output?
>>> 
>>> I don't think so.
>>> 
>>> 
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120224/21719ec6/attachment.html>

From iraicu at cs.iit.edu  Sun Feb 26 07:57:48 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Sun, 26 Feb 2012 07:57:48 -0600
Subject: [Swift-devel] CFP: The 9th Int. Conf. on Autonomic Computing (ICAC)
	2012
Message-ID: <4F4A3A5C.7010304@cs.iit.edu>

CALL FOR PAPERS

The 9th International Conference on Autonomic Computing (ICAC 2012)

September 16-20, 2012. San Jose, CA, USA
http://icac2012.cs.fiu.edu/
-----------------------------------------------------------------

IMPORTANT DATES

Paper and Poster Submission: March 9, 2012, 11:59pm PST
Notification: May 18, 2012
Camera-ready Due: June 8, 2012
-----------------------------------------------------------------

OVERVIEW
ICAC is the leading conference on autonomic computing techniques,
foundations, and applications. Autonomic computing refers to
methods and means for automated management of performance, fault,
security, and configuration with little involvement of users or
administrators. Systems introducing new autonomic features are
becoming increasingly prevalent, motivating research that spans
a variety of areas, from computer systems, networking, software
engineering, and data management to machine learning, control
theory, and bio-inspired computing. ICAC brings together
researchers and practitioners across these disciplines to
address multiple facets of adaptation and self-management in
computing systems and applications from different perspectives.
Autonomic computing solutions are sought for clouds, grids,
data centers, enterprise software, internet services, data
services, smart phones, embedded systems, and sensor networks.
In these environments, resources and applications must be managed
to maximize performance and minimize cost, while maintaining
predictable and reliable behavior in the face of varying
workloads, failures, and malicious threats. Papers are solicited
from all areas of autonomic computing, including (but not limited
to):

* End-to-end techniques for management of resources, workloads,
   performance, faults, power/cooling, security, and others.

* Self-managing components, such as server, storage, network
   protocols, or specific application elements, and embedded and
   mobile end systems such as smart phones.

* Decision and analysis techniques and their use, such as machine
   learning, control theory, predictive methods, probability and
   stochastic processes, queuing theory methodologies, emergent
   behavior, rule-based systems, and bio-inspired techniques.

* Monitoring systems for autonomic computing.

* Hypervisor, operating systems, hardware, or application support
   for autonomic computing.

* Novel human interfaces for monitoring and controlling autonomic
   systems.

* Management topics, such as specification and modeling of
   service-level agreements, behavior enforcement and tie-in with
   IT governance.

* Toolkits, frameworks, principles and architectures, from
   software engineering practices and experimental methodologies
   to agent-based techniques and virtualization.

* Fundamental science and theory of self-managing systems:
   understanding, controlling or exploiting system behaviors to
   enforce autonomic properties.

* Applications of autonomic computing and experiences with
   prototyped or deployed systems solving real-world problems in
   science, engineering, business and society.

Papers will be judged on originality, significance, interest,
correctness, clarity and relevance to the broader community.
Papers should report on experiences, measurements, user studies,
or other evaluations, as appropriate. Evaluations of a prototype
or large-scale deployment of systems and applications is expected.

PAPER AND POSTER SUBMISSIONS
Full papers (a maximum of 10 pages in the two-column ACM proceedings
format) and posters (2 pages) are invited on a wide variety of
topics relating to autonomic computing. Submitted papers must be
original work, and may not be under consideration for another
conference or journal. Complete formatting and submission
instructions can be found on the conference web site. Accepted
papers and posters will appear in proceedings distributed at the
conference and available electronically. Relevant top ICAC'12
papers will be invited for "fast-track" submissions to the
ACM Transactions on Autonomous and Adaptive Systems (TAAS).

WORKSHOPS, DEMONSTRATIONS AND EXHIBITION
ICAC'12 welcomes proposals for co-located workshops on topics of
interest to the autonomic computing community. Workshop proposals
should be submitted to the Workshop Chair, Fred Douglis
(f.douglis at computer.org) by February 10, 2012. Workshops are
expected to publish proceedings, and should cover areas that
complement the main program. ICAC'12 will also feature a
demonstration and exhibition session consisting of prototypes and
technology artifacts such as demonstrating autonomic software or
autonomic computing principles. Entries will be judged by a
separate committee led by the demo/exhibit chair.

INDUSTRY SESSION
One of ICAC's important roles is to bring together researchers
and practitioners from academia and industry. In its industry
session, ICAC helps fulfill this role by presenting an industry
viewpoint on technologies, products, and market needs. The
industry session also addresses current challenges, and
opportunities for academic and corporate research collaborations.
We encourage industry leaders, including entrepreneurs, product
developers, architects, managers, marketers and end users,
to submit their papers and posters reflecting such industry
perspectives as part of the regular submission process.
------------------------------------------------------------------

ORGANIZERS

GENERAL CHAIR
Dejan Milojicic, HP Labs

PROGRAM CHAIRS
Dongyan Xu, Purdue University
Vanish Talwar, HP Labs

INDUSTRY CHAIR
Xiaoyun Zhu, VMware

WORKSHOPS CHAIR
Fred Douglis, EMC

POSTERS/DEMO/EXHIBITS CHAIR
Eno Thereska, Microsoft Research

FINANCE CHAIR
Michael Kozuch, Intel

LOCAL ARRANGEMENT CHAIR
Jessica Blaine

PUBLICITY CHAIRS
Daniel Batista, University of S?o Paulo
Vartan Padaryan, ISP/Russian Academy of Sci.
Ioan Raicu, Illinois Inst. of Technology
Jianfeng Zhan, ICT/Chinese Academy of Sci.
Ming Zhao, Florida Intl. University

PROGRAM COMMITTEE
Tarek Abdelzaher, UIUC
Umesh Bellur, IIT, Bombay
Ken Birman, Cornell University
Rajkumar Buyya, Univ. of Melbourne
Rocky Chang, Hong Kong Polytechnic University
Yuan Chen, HP Labs
Alva Couch, Tufts University
Peter Dinda, Northwestern University
Fred Douglis, EMC
Renato Figueiredo, University of Florida
Mohamed Hefeeda, Qatar Computing Research Institute
Joe Hellerstein, Google
Geoff Jiang, NEC Labs
Jeff Kephart, IBM Research
Emre Kiciman, Microsoft Research
Fabio Kon, University of S?o Paulo
Michael Kozuch, Intel
Dejan Milojicic, HP Labs
Klara Nahrstedt, UIUC
Priya Narasimhan, CMU
Manish Parashar, Rutgers University
Ioan Raicu, Illinois Inst. of Technology
Omer Rana, Cardiff University
Masoud Sadjadi, Florida Intl. University
Rick Schlichting, AT&T Labs
Hartmut Schmeck, KIT
Karsten Schwan, Georgia Tech
Onn Shehory, IBM Research
Eno Thereska, Microsoft Research
Xiaoyun Zhu, VMware

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel:    1-847-722-0876
Office: 1-312-567-5704
Email:  iraicu at cs.iit.edu
Web:    http://www.cs.iit.edu/~iraicu/
Web:    http://datasys.cs.iit.edu/
=================================================================
=================================================================


From iraicu at cs.iit.edu  Sun Feb 26 08:28:38 2012
From: iraicu at cs.iit.edu (Ioan Raicu)
Date: Sun, 26 Feb 2012 08:28:38 -0600
Subject: [Swift-devel] CFP: IEEE Int. Scalable Computing Challenge (SCALE)
	at CCGrid 2012
Message-ID: <4F4A4196.1090701@cs.iit.edu>

CALL FOR PAPERS

The Fifth IEEE International Scalable Computing Challenge (SCALE)
Co-located with the 11th CCGrid Conference in Ottawa, Canada
Sponsored by the IEEE Computer Society Technical Committee on
Scalable Computing (TCSC)

May 13-16, 2012

http://www.cloudbus.org/ccgrid2012/cfp-scale.html

---------------------------------------------------------------------
Objective and Focus: The objective of the Fifth IEEE International
Scalable Computing Challenge (SCALE 2012), sponsored by the IEEE
Computer Society Technical Committee on Scalable Computing (TCSC), is
to highlight and showcase real-world problem solving using computing
that scales.

Effective solutions to many scientific problems require applications
that can scale. There are different dimensions to application scaling:
for example, applications can scale-up to large number of cores or
compute units, scale-out to utilize multiple distinct compute units,
or scale-down to release resources that are no longer needed. In order
to scale, applications need the support of tools, middleware,
infrastructure, programming systems, etc. SCALE is concerned with
advances in application development and supporting infrastructure that
enable scaling.

Call for Proposals: The Fifth IEEE International Scalable Computing
Challenge (SCALE 2012) contest will focus on end-to-end problem
solving using concepts, technologies and architectures (including
Clusters, Grids and Clouds) that facilitate scaling. Participants in
the challenge will be expected to identify significant current
real-world problems where scalable computing techniques can be
effectively used, and design, implement, evaluate and demonstrate
solutions. SCALE2012 will be held in conjunction with the 11th CCGrid
Conference in Ottawa, Canada  on 13-16 May, 2012.

We invite teams to submit white papers outlining the problem addressed
and the technologies employed to enable applications to scale. White
papers should be up to 4 pages long, 12-pt. font and single column,
and in addition to listing team members and contact information,
should clearly outline:

1. The problem being solved and the technology employed

2. The application scenario and its requirements

3. Performance data and a qualitative description of how the
application scales -- scale-up, scale-out or any other type of scaling

4. The solution -- architecture, underlying concepts and technologies
used -- highlighting the innovative aspects of the solution

5. Impact of the solution, including extensibility and uniqueness of
results, and the extent to which the presented solution pushes the
envelope in scalable computing

6. Analysis of solution and technology employed compared to related
approaches

In addition to the above, finalists will be judged on the quality of
their presentation, which shall include a 5-minute demonstration, as
well as their responses to questions by a technical committee.


Papers will be shortlisted using the above 6 points as merit criteria,
and up to 6 papers will be invited to compete in a final round at
CCGrid 2012.  Selected teams will receive an award of up to $1000 to
help with travel to the conference. At least one member from each
selected team will be expected to present and demonstrate their
project at CCGrid 2012.  Participation from students and young
researchers, especially in leadership roles, is strongly encouraged.

Awards:
First  prize:   Plaque + $1000
Second prize:   Plaque + $500

Tentative timeline:
The deadline for submitting proposals is 15 March, 2012.
Decisions:  01 April 2012.
Final presentation/demo: 13-16 May, 2012.

Coordinator:
Shantenu Jha, Rutgers University, USA, shantenu dot jha at rutgers dot edu

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel:    1-847-722-0876
Office: 1-312-567-5704
Email:  iraicu at cs.iit.edu
Web:    http://www.cs.iit.edu/~iraicu/
Web:    http://datasys.cs.iit.edu/
=================================================================
=================================================================


From wilde at mcs.anl.gov  Sun Feb 26 12:43:30 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 26 Feb 2012 12:43:30 -0600 (CST)
Subject: [Swift-devel] Example files for MATLAB parameter sweep
In-Reply-To: <1915879243.33439.1330281166171.JavaMail.root@zimbra.anl.gov>
Message-ID: <1065251823.33445.1330281810169.JavaMail.root@zimbra.anl.gov>

Hi Lorenzo and Albert,

You can find a new tutorial example of a parameter sweep by following the README at:

  https://svn.ci.uchicago.edu/svn/vdl2/trunk/examples/tutorial/ParameterSweep/README

(which is also pasted below).

This is a simple example which you can run on any local host (e.g. sandbox.beagle) after you do "module load swift".

Over time we will add this to the Swift tutorial document, test it, etc.

Lorenzo: this is meant to give you and Albert a base example (non-MATLAB) from which you can create the MATLAB example(s).  Ideally we will grow this into a tutorial sequence that shows a few useful variations of organizing a parameter sweep or ensemble of simulations, including passing parameters only via files, or via a combination of Swift variables and files. We welcome your help in developing this, starting with the MATLAB version of it.  The first thing for that would be to develop the MATLAB replacements for gensweep.sh and simulate.sh.  These two "apps" are meant to be stand-ins for the equivalent MATLAB programs. They use a simple two-column "name value" file format to simulate a .mat file.

I'll add you to the Swift committers list so you can place anything you add in SVN.

David: this doesn't yet use gensites. Can we add gensites without adding any complexity to the sweep.sh script?  Or do we want a version with and without?  Hopefully only with.

We should extend to use PADS, Fusion, Beagle, MCS servers, FutureGrid, TrySwift, and more.

Can you start adding this to the tutorial asciidoc?

Jon: I simplified the handling of run dir creation. Maybe we can refit this into swiftopt.sh?

We can do this as a collaborative exercise because the result will be of great benefit to all new Swift users, MATLAB and non-MATLAB alike. In fact we should do a version of it for Python, Octave, MATLAB, and R.

Regards,

- Mike

$ cat README

This directory contains an example of running a "parameter sweep" or
"ensemble" of N simulations or "members".

To run:

  # make sure Swift 0.93 or trunk is in your $PATH

  svn co https://svn.ci.uchicago.edu/svn/vdl2/trunk/examples/tutorial/ParameterSweep
  cd ParameterSweep

  ./sweep.sh    # Runs default sweep of 5 members with 3 common data/parameter files

  ./sweep.sh -nMembers=20 -nCommon=2 # 20  members with 2 common data/parameter files

  # Each run is executed in a new unique runNNN directory: run001, run002, ...

  # tc, sites file (local.xml), and Swift properties files (cf) are generated by sweep.sh
$ 


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From jonmon at mcs.anl.gov  Sun Feb 26 13:13:41 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Sun, 26 Feb 2012 13:13:41 -0600
Subject: [Swift-devel] Example files for MATLAB parameter sweep
In-Reply-To: <1065251823.33445.1330281810169.JavaMail.root@zimbra.anl.gov>
References: <1065251823.33445.1330281810169.JavaMail.root@zimbra.anl.gov>
Message-ID: <0D14FE27-7B46-440B-B371-A43A49D7F119@mcs.anl.gov>


On Feb 26, 2012, at 12:43 PM, Michael Wilde wrote:

> Jon: I simplified the handling of run dir creation. Maybe we can refit this into swiftopt.sh?

This has been done.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120226/ec4995d4/attachment.html>

From jonmon at mcs.anl.gov  Sun Feb 26 17:12:18 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Sun, 26 Feb 2012 17:12:18 -0600
Subject: [Swift-devel] Walltime exceeded error
In-Reply-To: <91121EC0-D9A6-4C65-8AE3-29C1F37ED18E@mcs.anl.gov>
References: <F23E582A-C2C9-4D4A-B142-589BBA526E6B@mcs.anl.gov>
	<1329775876.6072.1.camel@blabla>
	<A33523AD-B805-4D38-A3DD-4E419922E6B8@mcs.anl.gov>
	<1329776194.6072.2.camel@blabla>
	<BE11201C-E18A-4C99-89F4-A5EFC5467A80@mcs.anl.gov>
	<1329776652.6072.3.camel@blabla>
	<C937F201-2F8A-4A4F-9CB6-2769899F198B@mcs.anl.gov>
	<2E6ADE45-898F-4B0A-BD52-F7EAF00654E0@mcs.anl.gov>
	<62FED0FE-F7AD-4305-9496-CDFEBC798764@mcs.anl.gov>
	<1329949702.23375.0.camel@blabla>
	<5B2D062C-8E9B-4DA3-A317-7FE5EAC2DABB@mcs.anl.gov>
	<22E47E80-C5C5-46DA-80CA-1A6063E90727@mcs.anl.gov>
	<91121EC0-D9A6-4C65-8AE3-29C1F37ED18E@mcs.anl.gov>
Message-ID: <77EC683C-7A68-4FB0-AD6D-C229196126E3@mcs.anl.gov>

I have again updated the bug: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=720
There are steps now how to reproduce it with a small test from the application we are running.  The steps outlined sets up the application to be run on whatever machine you are testing on.

This turns out not to be a coaster bug but a swift bug.  The test in /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run014 is a local test, it did not use coasters at all and still the hang checker kicked in.

On Feb 24, 2012, at 4:09 PM, Jonathan Monette wrote:

> I have updated the bugzilla bug with the below directories: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=720
> 
> I have also added another directory showing the same behavior
> /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run054
> 
> There is a jstack.log file in that directory.  All three of the run directories show that jobs get stuck in the initialized state and the hang checker kicks in.
> 
> On Feb 22, 2012, at 5:05 PM, Jonathan Monette wrote:
> 
>> This has been done.  I have also moved the run that Ketan had produced to PADS.
>> 
>> /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run002    <-----Ketan's run
>> /gpfs/pads/swift/jonmon/Swift/bugs/SciColSim/run047    <-----My run(has a jstack.log file, also more recent)
>> 
>> On Feb 22, 2012, at 4:33 PM, Jonathan Monette wrote:
>> 
>>> Ok.  I have killed the process and I am in the process of copying the run directory from the lustre file system on Beagle to /gpfs/pads
>>> 
>>> On Feb 22, 2012, at 4:28 PM, Mihael Hategan wrote:
>>> 
>>>> On Wed, 2012-02-22 at 15:45 -0600, Jonathan Monette wrote:
>>>>> Mihael,
>>>>> I have a hung Java process showing this error right now, 2 jobs are
>>>>> stuck in the initializing state.  I have a jstack -l <pid> of this
>>>>> hung java process.  Is there anything else you need before I kill it?
>>>>> Do you need any other probing information from this process other than
>>>>> this jstack output?
>>>> 
>>>> I don't think so.
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120226/884a9c91/attachment.html>

From wilde at mcs.anl.gov  Mon Feb 27 12:58:05 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 27 Feb 2012 12:58:05 -0600 (CST)
Subject: [Swift-devel] Did we add:
Message-ID: <1963723478.36503.1330369085034.JavaMail.root@zimbra.anl.gov>

Hi All,

Does anyone know if the enhancement described in bug 359 ("Add ability to set ENV vars, maxwalltime, and RAM requirements on app invocation") was ever done?

  https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359

I thought some form of it *was*, but I cant find any discussion of that feature in the devel archive, my email, or bugzilla.  Was this just wishful thinking, or does some form of the ability to set profile values on a per-app-call basis actually exist?

Thanks,

- Mike


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From ketancmaheshwari at gmail.com  Mon Feb 27 13:03:55 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Mon, 27 Feb 2012 13:03:55 -0600
Subject: [Swift-devel] Did we add:
In-Reply-To: <1963723478.36503.1330369085034.JavaMail.root@zimbra.anl.gov>
References: <1963723478.36503.1330369085034.JavaMail.root@zimbra.anl.gov>
Message-ID: <CAMUuviopSE2gZvkM1eR_V_G9Nfoex1CsdeuMvxpZzF5vYJbUKg@mail.gmail.com>

I think this on sites.xml:

<profile namespace="env" key="CLASSPATH"><val></profile>

is intended to do env for application.


On Mon, Feb 27, 2012 at 12:58 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Hi All,
>
> Does anyone know if the enhancement described in bug 359 ("Add ability to
> set ENV vars, maxwalltime, and RAM requirements on app invocation") was
> ever done?
>
>  https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359
>
> I thought some form of it *was*, but I cant find any discussion of that
> feature in the devel archive, my email, or bugzilla.  Was this just wishful
> thinking, or does some form of the ability to set profile values on a
> per-app-call basis actually exist?
>
> Thanks,
>
> - Mike
>
>
>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120227/bb78d155/attachment.html>

From wilde at mcs.anl.gov  Mon Feb 27 13:16:21 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 27 Feb 2012 13:16:21 -0600 (CST)
Subject: [Swift-devel] Did we add: dynamic profile entries?
In-Reply-To: <CAMUuviopSE2gZvkM1eR_V_G9Nfoex1CsdeuMvxpZzF5vYJbUKg@mail.gmail.com>
Message-ID: <1797820808.36665.1330370181668.JavaMail.root@zimbra.anl.gov>

Ketan,

That <profile> element sets a profile entry for all jobs on a site.

Setting the profile on a tc entry sets it for all calls of the given app.

What bug 359 is asking for is the ability to set profile entries dynamically on a per-app-invocation basis, e.g. to give each invocation a specific time or memory limit or env var value, dynamically calculated in the Swift script.

- Mike


----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, February 27, 2012 1:03:55 PM
> Subject: Re: [Swift-devel] Did we add:
> I think this on sites.xml:
> 
> 
> <profile namespace="env" key="CLASSPATH"><val></profile>
> 
> is intended to do env for application.
> 
> 
> 
> On Mon, Feb 27, 2012 at 12:58 PM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> Hi All,
> 
> Does anyone know if the enhancement described in bug 359 ("Add ability
> to set ENV vars, maxwalltime, and RAM requirements on app invocation")
> was ever done?
> 
> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359
> 
> I thought some form of it *was*, but I cant find any discussion of
> that feature in the devel archive, my email, or bugzilla. Was this
> just wishful thinking, or does some form of the ability to set profile
> values on a per-app-call basis actually exist?
> 
> Thanks,
> 
> - Mike
> 
> 
> 
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> 
> 
> 
> --
> Ketan

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wozniak at mcs.anl.gov  Mon Feb 27 13:42:00 2012
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Mon, 27 Feb 2012 13:42:00 -0600 (CST)
Subject: [Swift-devel] Did we add:
In-Reply-To: <CAMUuviopSE2gZvkM1eR_V_G9Nfoex1CsdeuMvxpZzF5vYJbUKg@mail.gmail.com>
References: <1963723478.36503.1330369085034.JavaMail.root@zimbra.anl.gov>
	<CAMUuviopSE2gZvkM1eR_V_G9Nfoex1CsdeuMvxpZzF5vYJbUKg@mail.gmail.com>
Message-ID: <alpine.DEB.2.02.1202271340180.2396@wozniak-laptop-u>


It's in there:

http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_dynamic_profiles

 	Justin

On Mon, 27 Feb 2012, Ketan Maheshwari wrote:

> I think this on sites.xml:
>
> <profile namespace="env" key="CLASSPATH"><val></profile>
>
> is intended to do env for application.
>
>
> On Mon, Feb 27, 2012 at 12:58 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>
>> Hi All,
>>
>> Does anyone know if the enhancement described in bug 359 ("Add ability to
>> set ENV vars, maxwalltime, and RAM requirements on app invocation") was
>> ever done?
>>
>>  https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359
>>
>> I thought some form of it *was*, but I cant find any discussion of that
>> feature in the devel archive, my email, or bugzilla.  Was this just wishful
>> thinking, or does some form of the ability to set profile values on a
>> per-app-call basis actually exist?
>>
>> Thanks,
>>
>> - Mike
>>
>>
>>
>>
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>
>
>
>
>

-- 
Justin M Wozniak


From wilde at mcs.anl.gov  Mon Feb 27 14:16:58 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 27 Feb 2012 14:16:58 -0600 (CST)
Subject: [Swift-devel] Did we add:
In-Reply-To: <alpine.DEB.2.02.1202271340180.2396@wozniak-laptop-u>
Message-ID: <808444397.37136.1330373818039.JavaMail.root@zimbra.anl.gov>

Awesome - thanks, I recall now!

- Mike

----- Original Message -----
> From: "Justin M Wozniak" <wozniak at mcs.anl.gov>
> To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, February 27, 2012 1:42:00 PM
> Subject: Re: [Swift-devel] Did we add:
> It's in there:
> 
> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_dynamic_profiles
> 
> Justin
> 
> On Mon, 27 Feb 2012, Ketan Maheshwari wrote:
> 
> > I think this on sites.xml:
> >
> > <profile namespace="env" key="CLASSPATH"><val></profile>
> >
> > is intended to do env for application.
> >
> >
> > On Mon, Feb 27, 2012 at 12:58 PM, Michael Wilde <wilde at mcs.anl.gov>
> > wrote:
> >
> >> Hi All,
> >>
> >> Does anyone know if the enhancement described in bug 359 ("Add
> >> ability to
> >> set ENV vars, maxwalltime, and RAM requirements on app invocation")
> >> was
> >> ever done?
> >>
> >>  https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=359
> >>
> >> I thought some form of it *was*, but I cant find any discussion of
> >> that
> >> feature in the devel archive, my email, or bugzilla. Was this just
> >> wishful
> >> thinking, or does some form of the ability to set profile values on
> >> a
> >> per-app-call basis actually exist?
> >>
> >> Thanks,
> >>
> >> - Mike
> >>
> >>
> >>
> >>
> >> --
> >> Michael Wilde
> >> Computation Institute, University of Chicago
> >> Mathematics and Computer Science Division
> >> Argonne National Laboratory
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >>
> >
> >
> >
> >
> 
> --
> Justin M Wozniak
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From svemalayan at yahoo.com  Mon Feb 27 19:33:27 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 27 Feb 2012 17:33:27 -0800 (PST)
Subject: [Swift-devel] coaster-service.conf
In-Reply-To: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com>
References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com>
Message-ID: <1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com>

Hi All,

When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to"zepto-vn-eval/mosatest".
How can I do this ? Is there any configuration parameter available to change qsub command?


If so how I can specify / pass this parameter?? (via start-coaster-service's command-line parameter or via a setting in coaster-service.conf)
Also could you please tell me the exact format ? Please point to me a document if there is any.


Thank you
Emalayan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120227/f32d23ff/attachment.html>

From jonmon at mcs.anl.gov  Mon Feb 27 20:13:18 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Mon, 27 Feb 2012 20:13:18 -0600
Subject: [Swift-devel] [Swift-user] coaster-service.conf
In-Reply-To: <1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com>
References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com>
	<1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com>
Message-ID: <53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov>

I assume you mean the "-k" option in the cqsub command.  So currently this is hard coded into the start-coaster-service script, it always uses zeptoos.  I have made a quick change that uses a KERNEL variable that needs to be set in the coaster-service.conf file.  Are you using your own checkout of trunk or the one in Justin's home directory?

On Feb 27, 2012, at 7:33 PM, Emalayan Vairavanathan wrote:

> Hi All,
> 
> When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to "zepto-vn-eval/mosatest".
> How can I do this ? Is there any configuration parameter available to change qsub command?
> 
> If so how I can specify / pass this parameter?  (via start-coaster-service 's command-line parameter or via a setting in coaster-service.conf)
> 
> Also could you please tell me the exact format ? Please point to me a document if there is any.
> 
> 
> Thank you
> Emalayan
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120227/e5bfe555/attachment.html>

From svemalayan at yahoo.com  Mon Feb 27 21:06:33 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 27 Feb 2012 19:06:33 -0800 (PST)
Subject: [Swift-devel] [Swift-user] coaster-service.conf
In-Reply-To: <53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov>
References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com>
	<1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com>
	<53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov>
Message-ID: <1330398393.5937.YahooMailNeo@web39503.mail.mud.yahoo.com>

Hi Jon

Thank you very much. Please find my ans below.

I assume you mean the "-k" option in the cqsub command. 


What I meant was "--kernel" option in qsub. 


Are you using your own checkout of trunk or the one in Justin's home directory?

Code from Justin's home directory


Thank you very much.

Emalayan


________________________________
 From: Jonathan Monette <jonmon at mcs.anl.gov>
To: Emalayan Vairavanathan <svemalayan at yahoo.com> 
Cc: swift user <swift-user at ci.uchicago.edu>; "swift-devel at ci.uchicago.edu" <swift-devel at ci.uchicago.edu>; MosaStore <mosastore at googlegroups.com> 
Sent: Monday, 27 February 2012 6:13 PM
Subject: Re: [Swift-user] coaster-service.conf
 

I assume you mean the "-k" option in the cqsub command. ?So currently this is hard coded into the start-coaster-service script, it always uses zeptoos. ?I have made a quick change that uses a KERNEL variable that needs to be set in the coaster-service.conf file. ?Are you using your own checkout of trunk or the one in Justin's home directory?


On Feb 27, 2012, at 7:33 PM, Emalayan Vairavanathan wrote:

Hi All,
>
>
>When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to"zepto-vn-eval/mosatest".
>How can I do this ? Is there any configuration parameter available to change qsub command?
>
>
>If so how I can specify / pass this parameter?? (via start-coaster-service's command-line parameter or via a setting in coaster-service.conf)
>
>Also could you please tell me the exact format ? Please point to me a document if there is any.
>
>
>
>
>Thank you
>Emalayan
>
>
>_______________________________________________
>Swift-user mailing list
>Swift-user at ci.uchicago.edu
>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120227/15224ca9/attachment.html>

From jonmon at mcs.anl.gov  Mon Feb 27 21:39:14 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Mon, 27 Feb 2012 21:39:14 -0600
Subject: [Swift-devel] [Swift-user] coaster-service.conf
In-Reply-To: <1330398393.5937.YahooMailNeo@web39503.mail.mud.yahoo.com>
References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com>
	<1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com>
	<53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov>
	<1330398393.5937.YahooMailNeo@web39503.mail.mud.yahoo.com>
Message-ID: <37C4069A-300B-4FB8-BCFD-C8E97143FDBA@mcs.anl.gov>


On Feb 27, 2012, at 9:06 PM, Emalayan Vairavanathan wrote:

> Hi Jon
> 
> Thank you very much. Please find my ans below.
> 
> I assume you mean the "-k" option in the cqsub command. 
> 
> What I meant was "--kernel" option in qsub. 

So the start-coaster-service for cobalt uses cqsub and not qsub.  qsub has the --kernel option while cqsub has the -k option.  Looking at the man pages they seem to be the same, but not sure.  Justin can probably provide more information on the difference, or point to a document that explains when one should be used over the other.
> 
> 
> Are you using your own checkout of trunk or the one in Justin's home directory?
> 
> Code from Justin's home directory

Then Justin will have to do a svn up tomorrow for you.  We probably should figure out a way for you to have your own stable copy to make changes too, so that us updating that copy is not a blocker for you progressing further.
> 
> 
> Thank you very much.
> Emalayan
> From: Jonathan Monette <jonmon at mcs.anl.gov>
> To: Emalayan Vairavanathan <svemalayan at yahoo.com> 
> Cc: swift user <swift-user at ci.uchicago.edu>; "swift-devel at ci.uchicago.edu" <swift-devel at ci.uchicago.edu>; MosaStore <mosastore at googlegroups.com> 
> Sent: Monday, 27 February 2012 6:13 PM
> Subject: Re: [Swift-user] coaster-service.conf
> 
> I assume you mean the "-k" option in the cqsub command.  So currently this is hard coded into the start-coaster-service script, it always uses zeptoos.  I have made a quick change that uses a KERNEL variable that needs to be set in the coaster-service.conf file.  Are you using your own checkout of trunk or the one in Justin's home directory?
> 
> On Feb 27, 2012, at 7:33 PM, Emalayan Vairavanathan wrote:
> 
>> Hi All,
>> 
>> When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to "zepto-vn-eval/mosatest".
>> How can I do this ? Is there any configuration parameter available to change qsub command?
>> 
>> If so how I can specify / pass this parameter?  (via start-coaster-service 's command-line parameter or via a setting in coaster-service.conf)
>> 
>> Also could you please tell me the exact format ? Please point to me a document if there is any.
>> 
>> 
>> Thank you
>> Emalayan
>> 
>> 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120227/aa5717ae/attachment.html>

From svemalayan at yahoo.com  Mon Feb 27 22:59:20 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Mon, 27 Feb 2012 20:59:20 -0800 (PST)
Subject: [Swift-devel] [Swift-user] coaster-service.conf
In-Reply-To: <37C4069A-300B-4FB8-BCFD-C8E97143FDBA@mcs.anl.gov>
References: <1330042098.422.YahooMailNeo@web39507.mail.mud.yahoo.com>
	<1330392807.45327.YahooMailNeo@web39506.mail.mud.yahoo.com>
	<53278513-44C1-48AF-9026-92F9A34487F1@mcs.anl.gov>
	<1330398393.5937.YahooMailNeo@web39503.mail.mud.yahoo.com>
	<37C4069A-300B-4FB8-BCFD-C8E97143FDBA@mcs.anl.gov>
Message-ID: <1330405160.59260.YahooMailNeo@web39508.mail.mud.yahoo.com>

So the start-coaster-service for cobalt uses cqsub and not qsub. ?qsub 
has the --kernel option while cqsub has the -k option. ?Looking at the 
man pages they seem to be the same, but not sure. ?Justin can probably 
provide more information on the difference, or point to a document that 
explains when one should be used over the other.

>> Thank youvery much for fixing it Jon. I am not sure about the trade-offs though. I can switch to cqsub if it is necessary.


Then Justin will have to do a svn up tomorrow for you. ?We probably 
should figure out a way for you to have your own stable copy to make 
changes too, so that us updating that copy is not a blocker for you 
progressing further.

>> Yes. We need to talk about this in Wednesday meeting.


Justin could you please take an update ?

Thank you
Emalayan


________________________________
 From: Jonathan Monette <jonmon at mcs.anl.gov>
To: Emalayan Vairavanathan <svemalayan at yahoo.com> 
Cc: swift user <swift-user at ci.uchicago.edu>; "swift-devel at ci.uchicago.edu" <swift-devel at ci.uchicago.edu>; MosaStore <mosastore at googlegroups.com> 
Sent: Monday, 27 February 2012 7:39 PM
Subject: Re: [Swift-user] coaster-service.conf
 

On Feb 27, 2012, at 9:06 PM, Emalayan Vairavanathan wrote:

Hi Jon
>
>
>Thank you very much. Please find my ans below.
>
>
>I assume you mean the "-k" option in the cqsub command. 
>
>
>
>What I meant was "--kernel" option in qsub. 
>
So the start-coaster-service for cobalt uses cqsub and not qsub. ?qsub has the --kernel option while cqsub has the -k option. ?Looking at the man pages they seem to be the same, but not sure. ?Justin can probably provide more information on the difference, or point to a document that explains when one should be used over the other.


>
>
>
>Are you using your own checkout of trunk or the one in Justin's home directory?
>
>
>Code from Justin's home directory
Then Justin will have to do a svn up tomorrow for you. ?We probably should figure out a way for you to have your own stable copy to make changes too, so that us updating that copy is not a blocker for you progressing further.


>
>
>
>Thank you very much.
>
>Emalayan
>
>
>________________________________
> From: Jonathan Monette <jonmon at mcs.anl.gov>
>To: Emalayan Vairavanathan <svemalayan at yahoo.com> 
>Cc: swift user <swift-user at ci.uchicago.edu>; "swift-devel at ci.uchicago.edu" <swift-devel at ci.uchicago.edu>; MosaStore <mosastore at googlegroups.com> 
>Sent: Monday, 27 February 2012 6:13 PM
>Subject: Re: [Swift-user] coaster-service.conf
> 
>
>I assume you mean the "-k" option in the cqsub command. ?So currently this is hard coded into the start-coaster-service script, it always uses zeptoos. ?I have made a quick change that uses a KERNEL variable that needs to be set in the coaster-service.conf file. ?Are you using your own checkout of trunk or the one in Justin's home directory?
>
>
>On Feb 27, 2012, at 7:33 PM, Emalayan Vairavanathan wrote:
>
>Hi All,
>>
>>
>>When I launch coaster-service, it allocates job with default kernel profile "zeptoos". I want to change the profile to"zepto-vn-eval/mosatest".
>>How can I do this ? Is there any configuration parameter available to change qsub command?
>>
>>
>>If so how I can specify / pass this parameter?? (via start-coaster-service's command-line parameter or via a setting in coaster-service.conf)
>>
>>Also could you please tell me the exact format ? Please point to me a document if there is any.
>>
>>
>>
>>
>>Thank you
>>Emalayan
>>
>>
>>_______________________________________________
>>Swift-user mailing list
>>Swift-user at ci.uchicago.edu
>>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120227/7ccbc683/attachment.html>

From svemalayan at yahoo.com  Tue Feb 28 13:53:57 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Tue, 28 Feb 2012 11:53:57 -0800 (PST)
Subject: [Swift-devel] Running applications with Swift on Surveyor
Message-ID: <1330458837.43066.YahooMailNeo@web39506.mail.mud.yahoo.com>

Hi All,

I have a quick question. 


It seems the step I was following to run the applications on BG/P with swift is different from the steps suggested by https://sites.google.com/site/exmproject/development/mosaswift. I was running applications+Swift from head node by just submitting a command below.

swift -config cf? -tc.file tc -sites.file sites.xml ftdock.swift -n=1 -list=pdb.list -grid=10

I didnt start the coaster-service but my site file was using coaster as execution-provider (in site files). Then Swift allocated some nodes and executed the job and placed the result in my home directory. (My assumption here was coaster-service and workers will be started automatically by swift). 


But the above link suggests me to use persistent-coasters, changes to coaster-config files and also to start coaster-service in the head node.


Basically I have three questions:

1) What is the different between Coasters and Persistent-Coasters?
2) How I was able to run the swift+application without starting the coaster-service, since coaster-service is expected to be started manually (according to the above link) ? Does swift use some other mechanisms to send a job if coaster-service is not started explicitly?
3) How I need to run my experiments in future with MosaStore and Swift ? Should I use  Coasters / Persistent-Coasters ?


Thank you
Emalayan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120228/d688312c/attachment.html>

From jonmon at mcs.anl.gov  Tue Feb 28 14:21:18 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Tue, 28 Feb 2012 14:21:18 -0600
Subject: [Swift-devel] Running applications with Swift on Surveyor
In-Reply-To: <1330458837.43066.YahooMailNeo@web39506.mail.mud.yahoo.com>
References: <1330458837.43066.YahooMailNeo@web39506.mail.mud.yahoo.com>
Message-ID: <68E4F9BB-E06E-4057-84B2-AC38FF14396E@mcs.anl.gov>

Hey Emalayan,
   My answers are below.

On Feb 28, 2012, at 1:53 PM, Emalayan Vairavanathan wrote:

> Hi All,
> 
> I have a quick question. 
> 
> It seems the step I was following to run the applications on BG/P with swift is different from the steps suggested by https://sites.google.com/site/exmproject/development/mosaswift. I was running applications+Swift from head node by just submitting a command below.
> 
> swift -config cf  -tc.file tc -sites.file sites.xml ftdock.swift -n=1 -list=pdb.list -grid=10
> 
> I didnt start the coaster-service but my site file was using coaster as execution-provider (in site files). Then Swift allocated some nodes and executed the job and placed the result in my home directory. (My assumption here was coaster-service and workers will be started automatically by swift). 
> 
> But the above link suggests me to use persistent-coasters, changes to coaster-config files and also to start coaster-service in the head node.
> 
> 
> Basically I have three questions:
> 
> 1) What is the different between Coasters and Persistent-Coasters?

The mechanism is name Coaster.  The persistent part of the name is for the workers, they are persistent through swift executions.  You can re-use the same workers you started with the 'start-coaster-service' script for many swift executions.  When not running in a persistent mode(i.e. the automatic mode) they coaster service and the workers are killed before swift comes to completion.

> 2) How I was able to run the swift+application without starting the coaster-service, since coaster-service is expected to be started manually (according to the above link) ? Does swift use some other mechanisms to send a job if coaster-service is not started explicitly?

So if you just say the execution provider is coaster, then Swift will start the coaster service and the workers automatically.  Swift  will then shut down the coaster service and the workers once they the swift script is done executing.  Persistent-coasters will wait until you stop the service explicitly with the 'stop-coaster-service' script.  This script will shutdown the service and any workers connected to it.

> 3) How I need to run my experiments in future with MosaStore and Swift ? Should I use Coasters / Persistent-Coasters ?

I think you want to use the persistent-coaster mode where you start the workers manually with the 'start-coaster-service' script.
> 
> Thank you
> Emalayan
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120228/83a0da4c/attachment.html>

From wilde at mcs.anl.gov  Tue Feb 28 14:23:50 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 28 Feb 2012 14:23:50 -0600 (CST)
Subject: [Swift-devel] Making Swift run on Eureka
In-Reply-To: <1959174301.41783.1330460542638.JavaMail.root@zimbra.anl.gov>
Message-ID: <596189083.41792.1330460630800.JavaMail.root@zimbra.anl.gov>

I asked Jon to make Swift run (in automatic coaster mode, then manual coaster mode) on Eureka. First step is to find the status of the relevant ALCF Cobalt ticket against Cobalt on Eureka. Thats listed in the Swift ticket for this problem:

  https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=245

If anyone has knowledge or advice, please reply.

Thanks,

- Mike


From svemalayan at yahoo.com  Tue Feb 28 14:33:06 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Tue, 28 Feb 2012 12:33:06 -0800 (PST)
Subject: [Swift-devel] Running applications with Swift on Surveyor
In-Reply-To: <68E4F9BB-E06E-4057-84B2-AC38FF14396E@mcs.anl.gov>
References: <1330458837.43066.YahooMailNeo@web39506.mail.mud.yahoo.com>
	<68E4F9BB-E06E-4057-84B2-AC38FF14396E@mcs.anl.gov>
Message-ID: <1330461186.83110.YahooMailNeo@web39504.mail.mud.yahoo.com>

Great. Thank you very much Jon. I have a better understanding about both approaches now.


________________________________
 From: Jonathan Monette <jonmon at mcs.anl.gov>
To: Emalayan Vairavanathan <svemalayan at yahoo.com> 
Cc: "swift-devel at ci.uchicago.edu" <swift-devel at ci.uchicago.edu>; MosaStore <mosastore at googlegroups.com> 
Sent: Tuesday, 28 February 2012 12:21 PM
Subject: Re: [Swift-devel] Running applications with Swift on Surveyor
 

Hey Emalayan,
? ?My answers are below.


On Feb 28, 2012, at 1:53 PM, Emalayan Vairavanathan wrote:

Hi All,
>
>
>I have a quick question. 
>
>
>
>It seems the step I was following to run the applications on BG/P with swift is different from the steps suggested by https://sites.google.com/site/exmproject/development/mosaswift. I was running applications+Swift from head node by just submitting a command below.
>
>
>swift -config cf? -tc.file tc -sites.file sites.xml ftdock.swift -n=1 -list=pdb.list -grid=10
>
>
>I didnt start the coaster-service but my site file was using coaster as execution-provider (in site files). Then Swift allocated some nodes and executed the job and placed the result in my home directory. (My assumption here was coaster-service and workers will be started automatically by swift). 
>
>
>
>But the above link suggests me to use persistent-coasters, changes to coaster-config files and also to start coaster-service in the head node.
>
>
>
>
>
>Basically I have three questions:
>
>
>1) What is the different between Coasters and Persistent-Coasters?
The mechanism is name Coaster. ?The persistent part of the name is for the workers, they are persistent through swift executions. ?You can re-use the same workers you started with the 'start-coaster-service' script for many swift executions. ?When not running in a persistent mode(i.e. the automatic mode) they coaster service and the workers are killed before swift comes to completion.


2) How I was able to run the swift+application without starting the coaster-service, since coaster-service is expected to be started manually (according to the above link) ? Does swift use some other mechanisms to send a job if coaster-service is not started explicitly?
So if you just say the execution provider is coaster, then Swift will start the coaster service and the workers automatically. ?Swift ?will then shut down the coaster service and the workers once they the swift script is done executing. ?Persistent-coasters will wait until you stop the service explicitly with the 'stop-coaster-service' script. ?This script will shutdown the service and any workers connected to it.


3) How I need to run my experiments in future with MosaStore and Swift ? Should I use  Coasters / Persistent-Coasters ?
>
I think you want to use the persistent-coaster mode where you start the workers manually with the 'start-coaster-service' script.


>
>Thank you
>Emalayan
>_______________________________________________
>Swift-devel mailing list
>Swift-devel at ci.uchicago.edu
>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120228/44cd1262/attachment.html>

From wilde at mcs.anl.gov  Tue Feb 28 23:07:35 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 28 Feb 2012 23:07:35 -0600 (CST)
Subject: [Swift-devel] Problems running Swift on BG/P
In-Reply-To: <969265748.43469.1330491577144.JavaMail.root@zimbra.anl.gov>
Message-ID: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov>

Emalayan and I spent a considerable amount of time debugging Swift on surveyor tonight.

As far as I can tell, after fixing a few config problems, it seems like the workers are unable to connect the coaster service. They seem to be trying to connect on the correct address. The workers start, and produce logs, but dont seem to make connections.

I noticed the following email thread:
  http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html

which talk about the sites attribute "alcfbgpnat" and state:
---
This code snippet may be of relevance:
if (settings.getAlcfbgpnat()) {
	spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true");
}

So you should set that env variable for the job if you want NAT.
---

Is this being done in the current start-coaster-service job? (Presumably needs to be done in the cobalt job?)

We also noticed that Emalayan was unable to follow the standard recipe for logging into the compute nodes of a running job. He could get to the IOP, but from there, got something like "no route to host" when he tried to telnet (or ping?) to the compute nodes.

I'll check on the ZOID_ENABLE_NAT setting, but any thoughts?

Thanks,

- Mike

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From jonmon at mcs.anl.gov  Tue Feb 28 23:09:28 2012
From: jonmon at mcs.anl.gov (Jonathan Monette)
Date: Tue, 28 Feb 2012 23:09:28 -0600
Subject: [Swift-devel] Problems running Swift on BG/P
In-Reply-To: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov>
References: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov>
Message-ID: <74728418-B3CE-484A-A81D-2BBEE8199922@mcs.anl.gov>

Is the internalHostname variable being set in the sites file? It should be set to the 172.*.* address returned from ifconfig

On Feb 28, 2012, at 11:07 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Emalayan and I spent a considerable amount of time debugging Swift on surveyor tonight.
> 
> As far as I can tell, after fixing a few config problems, it seems like the workers are unable to connect the coaster service. They seem to be trying to connect on the correct address. The workers start, and produce logs, but dont seem to make connections.
> 
> I noticed the following email thread:
>  http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html
> 
> which talk about the sites attribute "alcfbgpnat" and state:
> ---
> This code snippet may be of relevance:
> if (settings.getAlcfbgpnat()) {
>    spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true");
> }
> 
> So you should set that env variable for the job if you want NAT.
> ---
> 
> Is this being done in the current start-coaster-service job? (Presumably needs to be done in the cobalt job?)
> 
> We also noticed that Emalayan was unable to follow the standard recipe for logging into the compute nodes of a running job. He could get to the IOP, but from there, got something like "no route to host" when he tried to telnet (or ping?) to the compute nodes.
> 
> I'll check on the ZOID_ENABLE_NAT setting, but any thoughts?
> 
> Thanks,
> 
> - Mike
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel


From wilde at mcs.anl.gov  Tue Feb 28 23:18:56 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 28 Feb 2012 23:18:56 -0600 (CST)
Subject: [Swift-devel] Problems running Swift on BG/P
In-Reply-To: <74728418-B3CE-484A-A81D-2BBEE8199922@mcs.anl.gov>
Message-ID: <597307474.43481.1330492736016.JavaMail.root@zimbra.anl.gov>

I asked Emalayan to set GLOBUS_HOSTNAME to that value.

Its not being set in the sites file.  But somehow that is getting through (I think) because the workers are trying to connect to that address.

The sites file was:

<config>
  <pool handle="persistent-coasters">
    <execution provider="coaster-persistent"
               url="http://172.17.3.12:22356"
               jobmanager="local:local"/>
    <profile namespace="globus" key="workerManager">passive</profile>
    <profile namespace="globus" key="jobsPerNode">4</profile>
    <profile key="jobThrottle" namespace="karajan">1000</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local" url="none" />
    <workdirectory>/home/emalayan/work</workdirectory>
  </pool>
</config>

I also see that start-coaster-service is trying to set ZOID_ENABLE_NAT:

  ENV="WORKER_LOGGING_LEVEL=DEBUG:ZOID_ENABLE_NAT=true"
  if [ -n $WORKER_ENVIRONMENT ]; then
     ENV+=:$WORKER_ENVIRONMENT
  fi
  set -x
  cqsub -q ${QUEUE}   \
        -k zeptoos    \
        -t ${MAXTIME} \
        -n ${NODES}   \
        -C ${PWD}/${LOG_DIR} \
	-E cobalt.${$}.stderr \
        -o cobalt.${$}.stdout \
        -e $ENV \
        $SWIFT_BIN/$WORKER $EXECUTION_URL $ID $PWD/$LOG_DIR

Im thinking that one possibility is that without NAT enabled, the workers cant connect back to the login host's 172. network, which is a different subnet than the 172. net of the login host.

Jon, did this mechanism work for you?

Also, is it possible that somehow the ":"-separated envvars are not getting from cqsub to the job's environment? Could something have changed in cobalt in yesterday' maintenance window?

- Mike

----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Justin M Wozniak" <wozniak at mcs.anl.gov>, "Jonathan Monette" <jon.monette at gmail.com>, emalayan at ece.ubc.ca, "Matei
> Ripeanu" <matei at ece.ubc.ca>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, February 28, 2012 11:09:28 PM
> Subject: Re: [Swift-devel] Problems running Swift on BG/P
> Is the internalHostname variable being set in the sites file? It
> should be set to the 172.*.* address returned from ifconfig
> 
> On Feb 28, 2012, at 11:07 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> 
> > Emalayan and I spent a considerable amount of time debugging Swift
> > on surveyor tonight.
> >
> > As far as I can tell, after fixing a few config problems, it seems
> > like the workers are unable to connect the coaster service. They
> > seem to be trying to connect on the correct address. The workers
> > start, and produce logs, but dont seem to make connections.
> >
> > I noticed the following email thread:
> >  http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html
> >
> > which talk about the sites attribute "alcfbgpnat" and state:
> > ---
> > This code snippet may be of relevance:
> > if (settings.getAlcfbgpnat()) {
> >    spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true");
> > }
> >
> > So you should set that env variable for the job if you want NAT.
> > ---
> >
> > Is this being done in the current start-coaster-service job?
> > (Presumably needs to be done in the cobalt job?)
> >
> > We also noticed that Emalayan was unable to follow the standard
> > recipe for logging into the compute nodes of a running job. He could
> > get to the IOP, but from there, got something like "no route to
> > host" when he tried to telnet (or ping?) to the compute nodes.
> >
> > I'll check on the ZOID_ENABLE_NAT setting, but any thoughts?
> >
> > Thanks,
> >
> > - Mike
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From zhaozhang at uchicago.edu  Tue Feb 28 23:21:14 2012
From: zhaozhang at uchicago.edu (ZHAO ZHANG)
Date: Tue, 28 Feb 2012 23:21:14 -0600
Subject: [Swift-devel] Problems running Swift on BG/P
In-Reply-To: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov>
References: <1905074140.43476.1330492055348.JavaMail.root@zimbra.anl.gov>
Message-ID: <4F4DB5CA.8070301@uchicago.edu>

Hi, Mike, All,

Please refer to 
http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ#How_to_open_a_socket_from_a_CN_to_the_outside_world 
for the NAT feature of ZeptoOS.
It could be enabled in the cqsub command line. Keep in mind that, if we 
use this feature, we have to start a server a the login node, and let 
compute nodes
connect the server socket. Once the server socket got the connection, it 
can send message back.

To access CNs from IO Node, we need to use the tree network, which range 
from 192.168.1.1 to 192.168.1.64. There is a overlay mapping of the tree 
network
and the torus network. But I never figured it out. We could work around 
the problem by login one of the compute nodes, then telnet the torus 
network
address.

An simple example is we could login 192.168.1.64. PS: in any scale, 
192.168.1.68 in the first pset is always the one with Rank 0. From 
there, we could login
12.0.0.2 and etc..

best
zhao

On 2/28/2012 11:07 PM, Michael Wilde wrote:
> Emalayan and I spent a considerable amount of time debugging Swift on surveyor tonight.
>
> As far as I can tell, after fixing a few config problems, it seems like the workers are unable to connect the coaster service. They seem to be trying to connect on the correct address. The workers start, and produce logs, but dont seem to make connections.
>
> I noticed the following email thread:
>    http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html
>
> which talk about the sites attribute "alcfbgpnat" and state:
> ---
> This code snippet may be of relevance:
> if (settings.getAlcfbgpnat()) {
> 	spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true");
> }
>
> So you should set that env variable for the job if you want NAT.
> ---
>
> Is this being done in the current start-coaster-service job? (Presumably needs to be done in the cobalt job?)
>
> We also noticed that Emalayan was unable to follow the standard recipe for logging into the compute nodes of a running job. He could get to the IOP, but from there, got something like "no route to host" when he tried to telnet (or ping?) to the compute nodes.
>
> I'll check on the ZOID_ENABLE_NAT setting, but any thoughts?
>
> Thanks,
>
> - Mike
>


From davidk at ci.uchicago.edu  Tue Feb 28 23:34:14 2012
From: davidk at ci.uchicago.edu (David Kelly)
Date: Tue, 28 Feb 2012 23:34:14 -0600 (CST)
Subject: [Swift-devel] Making Swift run on Eureka
In-Reply-To: <596189083.41792.1330460630800.JavaMail.root@zimbra.anl.gov>
Message-ID: <1552374827.140785.1330493654834.JavaMail.root@zimbra-mb2.anl.gov>

I don't know the details of this bug, but I remember seeing this email a few months ago if it helps..

---- Original Message -----
> From: "Paul Rich" <pmrich at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: support at alcf.anl.gov, "Robert Jacob" <jacob at mcs.anl.gov>, "swift-devel" <swift-devel at ci.uchicago.edu>, "Andrew
> Cherry" <acherry at alcf.anl.gov>
> Sent: Wednesday, June 22, 2011 2:39:02 PM
> Subject: Re: [Swift-devel] [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed?
> Michael,
> 
> I wanted to let you know that a recent patch to Cobalt on Eureka
> should allow you to pass command-line arguments into the program
> supplied to the Cobalt job. Let us know if you encounter any further
> difficulties, and I am sorry that this took so long to deploy.
> 
> Thank you for your patience,
> 
> --
> Paul Rich
> ALCF Operations -- AIG
> richp at alcf.anl.gov
> 
> 
> ----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Paul M. Rich" <richp at alcf.anl.gov>, "Andrew Cherry"
> <acherry at alcf.anl.gov>
> Cc: "swift-devel" <swift-devel at ci.uchicago.edu>, "Robert Jacob"
> <jacob at mcs.anl.gov>, support at alcf.anl.gov
> Sent: Tuesday, January 11, 2011 7:30:30 PM
> Subject: Re: [alcf-support #60887] Can Cobalt command-line bug on
> Eureka be fixed?
> 
> Paul, Andrew,
> 
> What I think we're going to do on this from the Swift side is
> temporarily try to use Eureka in a mode where we manually start Swift
> workers on the cluster using a batch job.
> 
> We'll wait on testing the Swift Cobolt interface (which is different
> than the above) until we hear from you that the bug is fixed and ready
> for testing.
> 
> So even though it may be many weeks or more away, we'd like to put in
> our vote for fixing this issue (realizing that you have many other
> priorities :)
> 
> Thanks,
> 
> MIke


From svemalayan at yahoo.com  Tue Feb 28 23:44:08 2012
From: svemalayan at yahoo.com (Emalayan Vairavanathan)
Date: Tue, 28 Feb 2012 21:44:08 -0800 (PST)
Subject: [Swift-devel] WORKER_INIT_CMD - with log file
In-Reply-To: <1330043315.75850.YahooMailNeo@web39506.mail.mud.yahoo.com>
References: <EDB0A089-6A80-4C1A-B84A-510B6826C10E@mcs.anl.gov>
	<1329174515.43886.YahooMailNeo@web39505.mail.mud.yahoo.com>
	<1329771862.41926.YahooMailNeo@web39501.mail.mud.yahoo.com>
	<1329772390.14447.YahooMailNeo@web39504.mail.mud.yahoo.com>
	<8742CFD8-8C54-4D67-A0D0-945BB6AF9948@mcs.anl.gov>
	<1329781225.86597.YahooMailNeo@web39505.mail.mud.yahoo.com>
	<AF473C97-D9E0-466E-8AC5-A1C7955C37CA@mcs.anl.gov>
	<alpine.DEB.2.02.1202211439030.4947@wozniak-laptop-u>
	<1330041782.64906.YahooMailNeo@web39506.mail.mud.yahoo.com>
	<alpine.DEB.2.02.1202231806390.4947@wozniak-laptop-u>
	<1330043315.75850.YahooMailNeo@web39506.mail.mud.yahoo.com>
Message-ID: <1330494248.2990.YahooMailNeo@web39501.mail.mud.yahoo.com>

Hi All,

Today I tried run 001-catsn-surveyor.swift script with Mike's help. But I am still facing some issues and it would be great if? you can shed some light on this. 

Brief overview about what I am doing:
I am trying to run a simple swift script (001-catsn-surveyor.swift) with persistent-coasters. The goal is to trying out more complex applications such as Montage and ModFTDock and then ultimately integrating MosaStore+Swift+Applications. 


Current Problem: Workers could not connect to coaster-service with 001-catsn-surveyor.swift.

Steps taken:

I was doing the steps below (in Surveyor with Swift available in ~/wozniak/Public/swift/bin/swift).


1) Set the port-numbers, IP address and Nodes in coaster-service.conf? 

?? (LOCAL_PORT=22346, SERVICE_PORT=22356, IPADDR=172.17.3.12 and NODES=3)

2) Set environment variable GLOBUS_HOSTNAME=172.17.3.12

3) Launched coaster-service from 172.17.3.12 (inside ~/emalayan/app/swift-test folder)


4) Launched 001-catsn-surveyor.swift using run.sh from 172.17.3.12 (inside ~/emalayan/app/swift-test folder)

But the 001-catsn-surveyor.swift did not make any progress. I have observed that the nodes were allocated and running via cqstat. When I checked the worker log files I figured-out that the workers were unable to connect to coaster-service.?

Then I tried to connect to the compute-nodes to see whether workers are actually running there. But Icould not connect to compute-nodes from IO node.

I repeated the same steps againwith NODES=64 and 1024 to see whether this problem (inability to connect to coaster-service) is coupled with the number of nodes setting in coaster-service.conf? (which was initially? 3). But I observed the same behavior.


In order to find-out whether this is because of some network configuration issues in Surveyor, I tried to run ModFTDock+Swift (available in ~/emalayan/app/forEmalayan_ccGrdid) with coasters. It was successfully running and also I were able to connect to compute nodes without any issues during the application run.


You can find the  001-catsn-surveyor.swift script, config files and log files inside ~/emalayan/app/swift-test and ~/emalayan/app/swift-test/log folders.

I highly appreciate your input. Please let me know if you have questions.

Thank you
Emalayan


From: Emalayan Vairavanathan <svemalayan at yahoo.com>
To: Justin M Wozniak <wozniak at mcs.anl.gov> 
Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>; matei <matei at ece.ubc.ca> 
Sent: Thursday, 23 February 2012 4:28 PM
Subject: Re: WORKER_INIT_CMD - with log file
 

Thank you. I was not aware about that.

Now I am getting the error below. Am I missing some configurations ?

Thank you
Emalayan


Swift trunk swift-r5662 (swift modified locally) cog-r3361 (cog modified locally)

RunID: 20120224-0021-jtfvfc90
Progress:? time: Fri, 24 Feb 2012 00:21:32 +0000
Find: http://172.17.3.12:12346
Find:? keepalive(120), reconnect - http://172.17.3.12:12346
Failed to transfer wrapper log for job cat-leiykink
Failed to transfer wrapper log for job cat-qeiykink
Failed to transfer wrapper log for job cat-jeiykink
Failed to transfer wrapper log for job cat-meiykink
Failed to transfer wrapper log for job cat-oeiykink
Failed to transfer wrapper log for job cat-peiykink
Failed to transfer wrapper log for job cat-keiykink
Failed to transfer wrapper log for job cat-seiykink
Failed to transfer wrapper log for job cat-neiykink
Progress:? time: Fri, 24 Feb 2012 00:21:33 +0000? Stage in:3? Submitting:5 Failed but can retry:2
Failed to transfer wrapper log for job cat-reiykink
Failed to transfer wrapper log for job cat-3fiykink
Failed to transfer wrapper log for job cat-7fiykink
Failed to transfer wrapper log for job cat-8fiykink
Failed to transfer wrapper log for job cat-bfiykink
Failed to transfer wrapper log for job cat-zeiykink
Failed to transfer wrapper log for job cat-0fiykink
Failed to transfer wrapper log for job cat-afiykink
Failed to transfer wrapper log for job cat-yeiykink
Failed to transfer wrapper log for job cat-1fiykink
Failed to transfer wrapper log for job cat-dfiykink
Failed to transfer wrapper log for job cat-ffiykink
EXCEPTION Exception in cat:
Arguments: [data.txt]
Host: persistent-coasters
Directory: 001-catsn-surveyor-20120224-0021-jtfvfc90/jobs/f/cat-ffiykink
stderr.txt: 

stdout.txt: 

----

Caused by: Task failed: null
org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel died and no contact available
??? at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:234)
??? at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:256)
??? at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226)
??? at org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:132)
??? at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:258)
??? at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:213)
??? at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:199)
??? at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:169)
??? at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:114)


Execution failed:
??? Failed to transfer wrapper log for job cat-mfiykink
EXCEPTION Exception in cat:
Arguments: [data.txt]
Host: persistent-coasters
Directory: 001-catsn-surveyor-20120224-0021-jtfvfc90/jobs/m/cat-mfiykink
stderr.txt: 

stdout.txt: 

----


________________________________
 From: Justin M Wozniak <wozniak at mcs.anl.gov>
To: Emalayan Vairavanathan <svemalayan at yahoo.com> 
Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>; matei <matei at ece.ubc.ca> 
Sent: Thursday, 23 February 2012 4:08 PM
Subject: Re: WORKER_INIT_CMD - with log file
 

Do the pool names agree?? You may want to check that you are using the 
right tc file.? You might need tc.persistent.data .

On Thu, 23 Feb 2012, Emalayan Vairavanathan wrote:

> Hi Justin,
>
> I copied swift-test from ~wozniak/Public/swift-test and try to run it. I followed the steps below.
>
> - Modified the
 coaster-service-conf
>
> - Started the coaster service.
> - Started swift
>
>
> The tc.data has entries for cat but I am getting the error below. Do you have any ideas ?
>
>
> Thank you
> Emalayan
>
> emalayan at login2.surveyor:~/swift-test> run.sh 
>
> Swift trunk swift-r5662 (swift modified locally) cog-r3361 (cog modified locally)
>
> RunID: 20120224-0000-rec5hd4a
> Progress:? time: Fri, 24 Feb 2012 00:00:45 +0000
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by:
 org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by:
 org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> Execution failed:
> ?? ?EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> The application "cat" is not available in the given site/pool in your tc.data catalog?
>
>
>
>
>
> ________________________________
> From: Justin M Wozniak <wozniak at mcs.anl.gov>
> To: Jonathan Monette <jonmon at mcs.anl.gov> 
> Cc: Emalayan Vairavanathan <svemalayan at yahoo.com>; matei <matei at ece.ubc.ca> 
> Sent: Tuesday, 21 February 2012 12:40 PM
> Subject: Re: WORKER_INIT_CMD - with log file
> 
> Hi guys
>
> Yes, that version is based on trunk and is up-to-date with 
> WORKER_INIT_CMD.? The recent bug fix for the BG/P is in there, I just 
> tested it.
>
> I moved the location to ~wozniak/Public/swift .? The test case I used is 
> in ~wozniak/Public/swift-test .? Both should be readable.
>
> ???
 Justin
>
> On Mon, 20 Feb 2012, Jonathan Monette wrote:
>
>> It might....that is a question that Justin can answer.? If it doesn't I am sure the feature can be quickly added.
>>
>> On Feb 20, 2012, at 5:40 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
>>
>>> Hi Jon,
>>>
>>> I didn't try with the swift-version available in Justin's home directory. I can try and tell it now.
>>>
>>> But just a quick question: Does this version has WORKER_INIT_CMD ?
>>>
>>> Thank you
>>> Emalayan
>>>
>>> From: Jonathan Monette <jonmon at mcs.anl.gov>
>>> To: Emalayan Vairavanathan <svemalayan at yahoo.com>
>>> Cc: Justin Wozniak <wozniak at mcs.anl.gov>; matei <matei at ece.ubc.ca>
>>> Sent: Monday, 20 February 2012 3:35 PM
>>> Subject: Re: WORKER_INIT_CMD - with log file
>>>
>>> I suggest to use the swift version in Justin's directory. This is the stable version for the bg/p. If you already are using it, then let me debug further.
>>>
>>> What login host where you running on, login1 or login2?
>>>
>>> On Feb 20, 2012, at 3:13 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
>>>
>>>>
>>>> Hi Jon and Justin,
>>>>
>>>> I checkout the swift-code from trunk and try to see whether ModFTDock+Swift works on Surveyor (without MosaStore). But the job did not complete for a long time.
>>>>
>>>> Could you please have a look ?
>>>>
>>>> (As I can remember, last time there was a bug when I tried to launch swift on Surveyor. Justin fixed the bug and asked me to use the swift executables from his home directory. May be this fix is not available in the trunk ?)
>>>>
>>>> Thank you
>>>> Emalayan
>>>>
>>>> From: Emalayan Vairavanathan <svemalayan at yahoo.com>
>>>> To: Jonathan Monette <jonmon at mcs.anl.gov>; "emalayan at ece.ubc.ca" <emalayan at ece.ubc.ca>
>>>> Cc: Justin Wozniak <wozniak at mcs.anl.gov>; matei <matei at ece.ubc.ca>; MosaStore <mosastore at googlegroups.com>
>>>> Sent: Monday, 13 February 2012 3:08 PM
>>>> Subject: Re: WORKER_INIT_CMD
>>>>
>>>> Thank you very much Jon. I will ask you if I have questions.
>>>>
>>>> Regards
>>>>
 Emalayan
>>>>
>>>> From: Jonathan Monette <jonmon at mcs.anl.gov>
>>>> To: emalayan at ece.ubc.ca
>>>> Cc: Justin Wozniak <wozniak at mcs.anl.gov>
>>>> Sent: Sunday, 12 February 2012 4:50 PM
>>>> Subject: WORKER_INIT_CMD
>>>>
>>>> Emalayan,
>>>> ?? We have now added an environment variable to the worker script.? The variable is called WORKER_INIT_CMD and works like so:
>>>>
>>>> export WORKER_INIT_CMD=<path/to/script.to.run>
>>>>
>>>> The worker will then run this script before entering it's main loop that
 waits for Swift apps to run.? You have to use manual coasters to use this variable, which I believe you already are doing.
>>>>
>>>> Let me know if you have any questions about this env variable.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> <modftdock-log.tar.gz>
>>>
>>>
>>
>
> -- 
> Justin M Wozniak

-- 
Justin M Wozniak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120228/cf22a3da/attachment.html>

From wilde at mcs.anl.gov  Tue Feb 28 23:53:21 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 28 Feb 2012 23:53:21 -0600 (CST)
Subject: [Swift-devel] Making Swift run on Eureka
In-Reply-To: <1552374827.140785.1330493654834.JavaMail.root@zimbra-mb2.anl.gov>
Message-ID: <1663252148.43503.1330494801381.JavaMail.root@zimbra.anl.gov>

Thanks, David - that is indeed the mail we were looking for.

I think Jon confirmed today that 0.93 now works on Eureka with no changes.

Eureka!

:) Mike

----- Original Message -----
> From: "David Kelly" <davidk at ci.uchicago.edu>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, February 28, 2012 11:34:14 PM
> Subject: Re: [Swift-devel] Making Swift run on Eureka
> I don't know the details of this bug, but I remember seeing this email
> a few months ago if it helps..
> 
> ---- Original Message -----
> > From: "Paul Rich" <pmrich at gmail.com>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: support at alcf.anl.gov, "Robert Jacob" <jacob at mcs.anl.gov>,
> > "swift-devel" <swift-devel at ci.uchicago.edu>, "Andrew
> > Cherry" <acherry at alcf.anl.gov>
> > Sent: Wednesday, June 22, 2011 2:39:02 PM
> > Subject: Re: [Swift-devel] [alcf-support #60887] Can Cobalt
> > command-line bug on Eureka be fixed?
> > Michael,
> >
> > I wanted to let you know that a recent patch to Cobalt on Eureka
> > should allow you to pass command-line arguments into the program
> > supplied to the Cobalt job. Let us know if you encounter any further
> > difficulties, and I am sorry that this took so long to deploy.
> >
> > Thank you for your patience,
> >
> > --
> > Paul Rich
> > ALCF Operations -- AIG
> > richp at alcf.anl.gov
> >
> >
> > ----- Original Message -----
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "Paul M. Rich" <richp at alcf.anl.gov>, "Andrew Cherry"
> > <acherry at alcf.anl.gov>
> > Cc: "swift-devel" <swift-devel at ci.uchicago.edu>, "Robert Jacob"
> > <jacob at mcs.anl.gov>, support at alcf.anl.gov
> > Sent: Tuesday, January 11, 2011 7:30:30 PM
> > Subject: Re: [alcf-support #60887] Can Cobalt command-line bug on
> > Eureka be fixed?
> >
> > Paul, Andrew,
> >
> > What I think we're going to do on this from the Swift side is
> > temporarily try to use Eureka in a mode where we manually start
> > Swift
> > workers on the cluster using a batch job.
> >
> > We'll wait on testing the Swift Cobolt interface (which is different
> > than the above) until we hear from you that the bug is fixed and
> > ready
> > for testing.
> >
> > So even though it may be many weeks or more away, we'd like to put
> > in
> > our vote for fixing this issue (realizing that you have many other
> > priorities :)
> >
> > Thanks,
> >
> > MIke

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Wed Feb 29 00:43:04 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 29 Feb 2012 00:43:04 -0600 (CST)
Subject: [Swift-devel] Problems running Swift on BG/P
In-Reply-To: <4F4DB5CA.8070301@uchicago.edu>
Message-ID: <1291594982.43559.1330497784865.JavaMail.root@zimbra.anl.gov>

Thanks, Zhao.  In this case we are using start-coaster-service, which does start a service on the login nodes.  Its a procedure that has been tested and has worked for Justin.  Buts its failing for Emalayan and I think Jon just verified that it is failing for him as well. This script does set ZOID_ENABLE_NAT via the cqsub -e option.

Ive just verified that in at least a simple cqsub model on what start-coaster-service uses, that with ZOID_ENABLE_NAT=true I am able to ping the login host, and with that variable not set, I can not.  I also tested with that variable set in between two other var settings, sandwiched between :'s, as it is in start-coaster-service, then NAT still works:

/usr/bin/cqsub.py -q default -p MTCScienceApps -k zeptoos -t 60 -n 1 -C /home/wilde -E cobalt.17074.stderr -o cobalt.17074.stdout -e WORKER_LOGGING_LEVEL=debug:ZOID_ENABLE_NAT=true:WORKER_INIT_CMD=/home/wilde/bin/worker-init.pl /bin/ping -c 5 172.17.3.12
Command: '/bgsys/drivers/ppcfloor/bin/mpirun' '-host' '172.17.3.1' '-np' '1' '-partition' 'ANL-R00-M1-N02-64' '-mode' 'smp' '-cwd' '/home/wilde' '-exe' '/bin/ping' '-args' '-c 5 172.17.3.12' '-env' 'COBALT_JOBID=273236 WORKER_LOGGING_LEVEL=debug WORKER_INIT_CMD=/home/wilde/bin/worker-init.pl ZOID_ENABLE_NAT=true'

So the behavior we are seeing suggests that somehow in Emalayan's tests, the ZOID_ENABLD_NAT setting is not getting through.

Next I think we need to re-create the problem using the exact scripts and environment, conf, etc that Emalayan is using, and then debug it form there, ideally snapping the cqsub it uses and testing with just that to start with.

Jon said he will do this in the morning, and I think we can nail the problem then.

- Mike


----- Original Message -----
> From: "ZHAO ZHANG" <zhaozhang at uchicago.edu>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Justin M Wozniak" <wozniak at mcs.anl.gov>, "Jonathan Monette" <jon.monette at gmail.com>, emalayan at ece.ubc.ca, "Matei
> Ripeanu" <matei at ece.ubc.ca>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, February 28, 2012 11:21:14 PM
> Subject: Re: [Swift-devel] Problems running Swift on BG/P
> Hi, Mike, All,
> 
> Please refer to
> http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ#How_to_open_a_socket_from_a_CN_to_the_outside_world
> for the NAT feature of ZeptoOS.
> It could be enabled in the cqsub command line. Keep in mind that, if
> we
> use this feature, we have to start a server a the login node, and let
> compute nodes
> connect the server socket. Once the server socket got the connection,
> it
> can send message back.
> 
> To access CNs from IO Node, we need to use the tree network, which
> range
> from 192.168.1.1 to 192.168.1.64. There is a overlay mapping of the
> tree
> network
> and the torus network. But I never figured it out. We could work
> around
> the problem by login one of the compute nodes, then telnet the torus
> network
> address.
> 
> An simple example is we could login 192.168.1.64. PS: in any scale,
> 192.168.1.68 in the first pset is always the one with Rank 0. From
> there, we could login
> 12.0.0.2 and etc..
> 
> best
> zhao
> 
> On 2/28/2012 11:07 PM, Michael Wilde wrote:
> > Emalayan and I spent a considerable amount of time debugging Swift
> > on surveyor tonight.
> >
> > As far as I can tell, after fixing a few config problems, it seems
> > like the workers are unable to connect the coaster service. They
> > seem to be trying to connect on the correct address. The workers
> > start, and produce logs, but dont seem to make connections.
> >
> > I noticed the following email thread:
> >    http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html
> >
> > which talk about the sites attribute "alcfbgpnat" and state:
> > ---
> > This code snippet may be of relevance:
> > if (settings.getAlcfbgpnat()) {
> > 	spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true");
> > }
> >
> > So you should set that env variable for the job if you want NAT.
> > ---
> >
> > Is this being done in the current start-coaster-service job?
> > (Presumably needs to be done in the cobalt job?)
> >
> > We also noticed that Emalayan was unable to follow the standard
> > recipe for logging into the compute nodes of a running job. He could
> > get to the IOP, but from there, got something like "no route to
> > host" when he tried to telnet (or ping?) to the compute nodes.
> >
> > I'll check on the ZOID_ENABLE_NAT setting, but any thoughts?
> >
> > Thanks,
> >
> > - Mike
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From ketancmaheshwari at gmail.com  Wed Feb 29 22:19:51 2012
From: ketancmaheshwari at gmail.com (Ketan Maheshwari)
Date: Wed, 29 Feb 2012 22:19:51 -0600
Subject: [Swift-devel] visualize your code as it executes
Message-ID: <CAMUuvirpw-uKTQbGv6YAEmGRUp9OYMqanvBPRhuJ8L0vi9JXeA@mail.gmail.com>

This is a nice page showing visualize as you run code:

http://people.csail.mit.edu/pgbovine/python/tutor.html#mode=edit

Relavant to the try Swift online venture.

(from google+ python stream)


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120229/edef5fbf/attachment.html>

From wilde at mcs.anl.gov  Wed Feb 29 22:54:45 2012
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 29 Feb 2012 22:54:45 -0600 (CST)
Subject: [Swift-devel] visualize your code as it executes
In-Reply-To: <CAMUuvirpw-uKTQbGv6YAEmGRUp9OYMqanvBPRhuJ8L0vi9JXeA@mail.gmail.com>
Message-ID: <611892001.47880.1330577685077.JavaMail.root@zimbra.anl.gov>

Very nice!  I think its also relevant to Swift documentation and to understanding ExM Swift/Turbine semantics.

- Mike

----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Wednesday, February 29, 2012 10:19:51 PM
> Subject: [Swift-devel] visualize your code as it executes
> This is a nice page showing visualize as you run code:
> 
> 
> http://people.csail.mit.edu/pgbovine/python/tutor.html#mode=edit
> 
> 
> Relavant to the try Swift online venture.
> 
> 
> (from google+ python stream)
> 
> 
> 
> --
> Ketan
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory