From jon.monette at gmail.com  Tue Aug  3 21:12:52 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 21:12:52 -0500
Subject: [Swift-user] Montage wrapper error
Message-ID: <4C58CCA4.7030805@gmail.com>

Hello,
     Has anyone ever ran into this error:

     Failed to transfer wrapper log from 
m101_montage-20100803-2101-4ihqvdv9/info/s on teraport
Execution failed:
     Exception in mProjectPP:
Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits, 
proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr]
Host: teraport
Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj
stderr.txt:

stdout.txt:

----

Caused by:
     
org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: 
Cannot determine the existence of the file
Caused by:
     The connection has been closed [Unnamed Channel]
Cleaning up...
Shutting down service at https://128.135.125.117:52276
Got channel MetaChannel: 1867624887[1293086287: {}] -> 
GSSSChannel-11234326669(1)[1293086287: {}]
+ Done

I am testing my wrappers to Montage on a larger scale and I keep getting 
this error.  There is about 640 images but it only projects about 142 
images before this error pops up.  If my run will help my run exists at 
"/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09" 
on the ci machines.  Any help is much appreciated.

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From hategan at mcs.anl.gov  Tue Aug  3 22:40:17 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 22:40:17 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58CCA4.7030805@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
Message-ID: <1280893217.28553.7.camel@blabla2.none>

On the remote site there should be something called
~/.globus/coasters/coasters.log. It tends to contain useful information.
As usual, the swift log also tends to contain useful information.

However, Mike has mentioned some problems when using the coaster
filesystem provider. In the effort to implement provider staging for
coasters, that may have broke. Is that what you are using? (i.e. post
sites.xml).

Mihael

On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote:
> Hello,
>      Has anyone ever ran into this error:
> 
>      Failed to transfer wrapper log from 
> m101_montage-20100803-2101-4ihqvdv9/info/s on teraport
> Execution failed:
>      Exception in mProjectPP:
> Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits, 
> proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr]
> Host: teraport
> Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj
> stderr.txt:
> 
> stdout.txt:
> 
> ----
> 
> Caused by:
>      
> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: 
> Cannot determine the existence of the file
> Caused by:
>      The connection has been closed [Unnamed Channel]
> Cleaning up...
> Shutting down service at https://128.135.125.117:52276
> Got channel MetaChannel: 1867624887[1293086287: {}] -> 
> GSSSChannel-11234326669(1)[1293086287: {}]
> + Done
> 
> I am testing my wrappers to Montage on a larger scale and I keep getting 
> this error.  There is about 640 images but it only projects about 142 
> images before this error pops up.  If my run will help my run exists at 
> "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09" 
> on the ci machines.  Any help is much appreciated.
> 


From jon.monette at gmail.com  Tue Aug  3 22:49:56 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 22:49:56 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280893492.28553.13.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>	
	<1280893217.28553.7.camel@blabla2.none>
	<4C58E15D.3040705@gmail.com>
	<1280893492.28553.13.camel@blabla2.none>
Message-ID: <4C58E364.7040207@gmail.com>

to use the coaster filesystem i should use local:pbs in the jobmanager?

On 8/3/10 10:44 PM, Mihael Hategan wrote:
> Ok. Maybe not. Looks like an SSH issue. Can you try the coaster fs
> provider instead?
>
> On Tue, 2010-08-03 at 22:41 -0500, Jonathan Monette wrote:
>    
>> <pool handle="localhost">
>> <filesystem provider="local" />
>> <execution provider="local" />
>> <workdirectory>/home/jonmon/Library/Swift/work/localhost</workdirectory>
>> <profile namespace="karajan" key="jobThrottle">.05</profile>
>> </pool>
>>
>> <pool handle="teraport">
>> <execution provider="coaster" url="tp-login2.ci.uchicago.edu"
>> jobmanager="ssh:pbs" />
>> <profile namespace="globus" key="maxtime">3000</profile>
>> <profile namespace="globus" key="workersPerNode">8</profile>
>> <profile namespace="globus" key="slots">1</profile>
>> <profile namespace="globus" key="nodeGranularity">1</profile>
>> <profile namespace="globus" key="maxNodes">10</profile>
>> <profile namespace="globus" key="queue">short</profile>
>> <profile namespace="karajan" key="jobThrottle">0.7</profile>
>> <profile namespace="karajan" key="initialScore">10000</profile>
>> <filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
>> <workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
>> </pool>
>>
>> This is my sites file
>>
>> On 8/3/10 10:40 PM, Mihael Hategan wrote:
>>      
>>> On the remote site there should be something called
>>> ~/.globus/coasters/coasters.log. It tends to contain useful information.
>>> As usual, the swift log also tends to contain useful information.
>>>
>>> However, Mike has mentioned some problems when using the coaster
>>> filesystem provider. In the effort to implement provider staging for
>>> coasters, that may have broke. Is that what you are using? (i.e. post
>>> sites.xml).
>>>
>>> Mihael
>>>
>>> On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote:
>>>
>>>        
>>>> Hello,
>>>>        Has anyone ever ran into this error:
>>>>
>>>>        Failed to transfer wrapper log from
>>>> m101_montage-20100803-2101-4ihqvdv9/info/s on teraport
>>>> Execution failed:
>>>>        Exception in mProjectPP:
>>>> Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits,
>>>> proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr]
>>>> Host: teraport
>>>> Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>>
>>>> ----
>>>>
>>>> Caused by:
>>>>
>>>> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException:
>>>> Cannot determine the existence of the file
>>>> Caused by:
>>>>        The connection has been closed [Unnamed Channel]
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.117:52276
>>>> Got channel MetaChannel: 1867624887[1293086287: {}] ->
>>>> GSSSChannel-11234326669(1)[1293086287: {}]
>>>> + Done
>>>>
>>>> I am testing my wrappers to Montage on a larger scale and I keep getting
>>>> this error.  There is about 640 images but it only projects about 142
>>>> images before this error pops up.  If my run will help my run exists at
>>>> "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09"
>>>> on the ci machines.  Any help is much appreciated.
>>>>
>>>>
>>>>          
>>>
>>>        
>>      
>
>    

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From hategan at mcs.anl.gov  Tue Aug  3 22:55:01 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 22:55:01 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58E364.7040207@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
	<1280893217.28553.7.camel@blabla2.none>  <4C58E15D.3040705@gmail.com>
	<1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com>
Message-ID: <1280894101.28882.1.camel@blabla2.none>

<filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu" />

On Tue, 2010-08-03 at 22:49 -0500, Jonathan Monette wrote:
> to use the coaster filesystem i should use local:pbs in the jobmanager?
> 
> On 8/3/10 10:44 PM, Mihael Hategan wrote:
> > Ok. Maybe not. Looks like an SSH issue. Can you try the coaster fs
> > provider instead?
> >
> > On Tue, 2010-08-03 at 22:41 -0500, Jonathan Monette wrote:
> >    
> >> <pool handle="localhost">
> >> <filesystem provider="local" />
> >> <execution provider="local" />
> >> <workdirectory>/home/jonmon/Library/Swift/work/localhost</workdirectory>
> >> <profile namespace="karajan" key="jobThrottle">.05</profile>
> >> </pool>
> >>
> >> <pool handle="teraport">
> >> <execution provider="coaster" url="tp-login2.ci.uchicago.edu"
> >> jobmanager="ssh:pbs" />
> >> <profile namespace="globus" key="maxtime">3000</profile>
> >> <profile namespace="globus" key="workersPerNode">8</profile>
> >> <profile namespace="globus" key="slots">1</profile>
> >> <profile namespace="globus" key="nodeGranularity">1</profile>
> >> <profile namespace="globus" key="maxNodes">10</profile>
> >> <profile namespace="globus" key="queue">short</profile>
> >> <profile namespace="karajan" key="jobThrottle">0.7</profile>
> >> <profile namespace="karajan" key="initialScore">10000</profile>
> >> <filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
> >> <workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
> >> </pool>
> >>
> >> This is my sites file
> >>
> >> On 8/3/10 10:40 PM, Mihael Hategan wrote:
> >>      
> >>> On the remote site there should be something called
> >>> ~/.globus/coasters/coasters.log. It tends to contain useful information.
> >>> As usual, the swift log also tends to contain useful information.
> >>>
> >>> However, Mike has mentioned some problems when using the coaster
> >>> filesystem provider. In the effort to implement provider staging for
> >>> coasters, that may have broke. Is that what you are using? (i.e. post
> >>> sites.xml).
> >>>
> >>> Mihael
> >>>
> >>> On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote:
> >>>
> >>>        
> >>>> Hello,
> >>>>        Has anyone ever ran into this error:
> >>>>
> >>>>        Failed to transfer wrapper log from
> >>>> m101_montage-20100803-2101-4ihqvdv9/info/s on teraport
> >>>> Execution failed:
> >>>>        Exception in mProjectPP:
> >>>> Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits,
> >>>> proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr]
> >>>> Host: teraport
> >>>> Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj
> >>>> stderr.txt:
> >>>>
> >>>> stdout.txt:
> >>>>
> >>>> ----
> >>>>
> >>>> Caused by:
> >>>>
> >>>> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException:
> >>>> Cannot determine the existence of the file
> >>>> Caused by:
> >>>>        The connection has been closed [Unnamed Channel]
> >>>> Cleaning up...
> >>>> Shutting down service at https://128.135.125.117:52276
> >>>> Got channel MetaChannel: 1867624887[1293086287: {}] ->
> >>>> GSSSChannel-11234326669(1)[1293086287: {}]
> >>>> + Done
> >>>>
> >>>> I am testing my wrappers to Montage on a larger scale and I keep getting
> >>>> this error.  There is about 640 images but it only projects about 142
> >>>> images before this error pops up.  If my run will help my run exists at
> >>>> "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09"
> >>>> on the ci machines.  Any help is much appreciated.
> >>>>
> >>>>
> >>>>          
> >>>
> >>>        
> >>      
> >
> >    
> 


From jon.monette at gmail.com  Tue Aug  3 22:59:53 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 22:59:53 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280894101.28882.1.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>	
	<1280893217.28553.7.camel@blabla2.none>
	<4C58E15D.3040705@gmail.com>	
	<1280893492.28553.13.camel@blabla2.none>
	<4C58E364.7040207@gmail.com>
	<1280894101.28882.1.camel@blabla2.none>
Message-ID: <4C58E5B9.80607@gmail.com>

<pool handle="teraport">
<execution provider="coaster" url="tp-login2.ci.uchicago.edu" 
jobmanager="ssh:pbs" />
<filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu" />
<profile namespace="globus" key="maxtime">3000</profile>
<profile namespace="globus" key="workersPerNode">8</profile>
<profile namespace="globus" key="slots">1</profile>
<profile namespace="globus" key="nodeGranularity">1</profile>
<profile namespace="globus" key="maxNodes">10</profile>
<profile namespace="globus" key="queue">short</profile>
<profile namespace="karajan" key="jobThrottle">0.7</profile>
<profile namespace="karajan" key="initialScore">10000</profile>
<filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
<workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
</pool>

here is the new sites entry.  I tried running my code and got this error.

class 
org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws 
exception in doStuff. Fix it!
java.lang.NullPointerException
     at 
org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
     at 
org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
     at 
org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
     at 
org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)


On 8/3/10 10:55 PM, Mihael Hategan wrote:
> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu"  />
>    

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From hategan at mcs.anl.gov  Tue Aug  3 23:04:19 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 23:04:19 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58E5B9.80607@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
	<1280893217.28553.7.camel@blabla2.none>  <4C58E15D.3040705@gmail.com>
	<1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com>
	<1280894101.28882.1.camel@blabla2.none>  <4C58E5B9.80607@gmail.com>
Message-ID: <1280894659.29101.2.camel@blabla2.none>

Eh. I'll see if I can fix those.

Though if you are using coasters, and given that I do think I fixed the
existing issues, there is one more thing you could try:
in swift.properties, at the end, say use.provider.staging=true

Mihael

On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
> <pool handle="teraport">
> <execution provider="coaster" url="tp-login2.ci.uchicago.edu" 
> jobmanager="ssh:pbs" />
> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu" />
> <profile namespace="globus" key="maxtime">3000</profile>
> <profile namespace="globus" key="workersPerNode">8</profile>
> <profile namespace="globus" key="slots">1</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> <profile namespace="globus" key="maxNodes">10</profile>
> <profile namespace="globus" key="queue">short</profile>
> <profile namespace="karajan" key="jobThrottle">0.7</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
> <workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
> </pool>
> 
> here is the new sites entry.  I tried running my code and got this error.
> 
> class 
> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws 
> exception in doStuff. Fix it!
> java.lang.NullPointerException
>      at 
> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
>      at 
> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
>      at 
> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
>      at 
> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
> 
> 
> On 8/3/10 10:55 PM, Mihael Hategan wrote:
> > <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu"  />
> >    
> 


From jon.monette at gmail.com  Tue Aug  3 23:15:52 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 23:15:52 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280894659.29101.2.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>	
	<1280893217.28553.7.camel@blabla2.none>
	<4C58E15D.3040705@gmail.com>	
	<1280893492.28553.13.camel@blabla2.none>
	<4C58E364.7040207@gmail.com>	
	<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>
	<1280894659.29101.2.camel@blabla2.none>
Message-ID: <4C58E978.5080208@gmail.com>

with use.provider.staging=true i get
Execution failed:
     Exception in mProjectPP:
Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits, 
proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
Host: teraport
Directory: 
m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
----

Caused by:
     Job failed with an exit code of 254

and with that line commented out i get a script printed to the screen 
saying it doesn't know what #!/BIN?BASH

Execution failed:
     Could not initialize shared directory on teraport
Caused by:
     org.globus.cog.abstraction.impl.file.FileResourceException: 
org.globus.cog.karajan.workflow.service.ProtocolException: Unknown 
command: #!/BIN/BASH


On 8/3/10 11:04 PM, Mihael Hategan wrote:
> Eh. I'll see if I can fix those.
>
> Though if you are using coasters, and given that I do think I fixed the
> existing issues, there is one more thing you could try:
> in swift.properties, at the end, say use.provider.staging=true
>
> Mihael
>
> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
>    
>> <pool handle="teraport">
>> <execution provider="coaster" url="tp-login2.ci.uchicago.edu"
>> jobmanager="ssh:pbs" />
>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu" />
>> <profile namespace="globus" key="maxtime">3000</profile>
>> <profile namespace="globus" key="workersPerNode">8</profile>
>> <profile namespace="globus" key="slots">1</profile>
>> <profile namespace="globus" key="nodeGranularity">1</profile>
>> <profile namespace="globus" key="maxNodes">10</profile>
>> <profile namespace="globus" key="queue">short</profile>
>> <profile namespace="karajan" key="jobThrottle">0.7</profile>
>> <profile namespace="karajan" key="initialScore">10000</profile>
>> <filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
>> <workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
>> </pool>
>>
>> here is the new sites entry.  I tried running my code and got this error.
>>
>> class
>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
>> exception in doStuff. Fix it!
>> java.lang.NullPointerException
>>       at
>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
>>       at
>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
>>       at
>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
>>       at
>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
>>
>>
>> On 8/3/10 10:55 PM, Mihael Hategan wrote:
>>      
>>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu"  />
>>>
>>>        
>>      
>
>    

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From hategan at mcs.anl.gov  Tue Aug  3 23:20:31 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 23:20:31 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58E978.5080208@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
	<1280893217.28553.7.camel@blabla2.none>  <4C58E15D.3040705@gmail.com>
	<1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com>
	<1280894101.28882.1.camel@blabla2.none>  <4C58E5B9.80607@gmail.com>
	<1280894659.29101.2.camel@blabla2.none>  <4C58E978.5080208@gmail.com>
Message-ID: <1280895631.29340.0.camel@blabla2.none>

Ok. I'll need to take a look at these.

On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote:
> with use.provider.staging=true i get
> Execution failed:
>      Exception in mProjectPP:
> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits, 
> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
> Host: teraport
> Directory: 
> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
> ----
> 
> Caused by:
>      Job failed with an exit code of 254
> 
> and with that line commented out i get a script printed to the screen 
> saying it doesn't know what #!/BIN?BASH
> 
> Execution failed:
>      Could not initialize shared directory on teraport
> Caused by:
>      org.globus.cog.abstraction.impl.file.FileResourceException: 
> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown 
> command: #!/BIN/BASH
> 
> 
> On 8/3/10 11:04 PM, Mihael Hategan wrote:
> > Eh. I'll see if I can fix those.
> >
> > Though if you are using coasters, and given that I do think I fixed the
> > existing issues, there is one more thing you could try:
> > in swift.properties, at the end, say use.provider.staging=true
> >
> > Mihael
> >
> > On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
> >    
> >> <pool handle="teraport">
> >> <execution provider="coaster" url="tp-login2.ci.uchicago.edu"
> >> jobmanager="ssh:pbs" />
> >> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu" />
> >> <profile namespace="globus" key="maxtime">3000</profile>
> >> <profile namespace="globus" key="workersPerNode">8</profile>
> >> <profile namespace="globus" key="slots">1</profile>
> >> <profile namespace="globus" key="nodeGranularity">1</profile>
> >> <profile namespace="globus" key="maxNodes">10</profile>
> >> <profile namespace="globus" key="queue">short</profile>
> >> <profile namespace="karajan" key="jobThrottle">0.7</profile>
> >> <profile namespace="karajan" key="initialScore">10000</profile>
> >> <filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
> >> <workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
> >> </pool>
> >>
> >> here is the new sites entry.  I tried running my code and got this error.
> >>
> >> class
> >> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
> >> exception in doStuff. Fix it!
> >> java.lang.NullPointerException
> >>       at
> >> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
> >>       at
> >> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
> >>       at
> >> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
> >>       at
> >> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
> >>
> >>
> >> On 8/3/10 10:55 PM, Mihael Hategan wrote:
> >>      
> >>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu"  />
> >>>
> >>>        
> >>      
> >
> >    
> 


From jon.monette at gmail.com  Tue Aug  3 23:29:22 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 23:29:22 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280895631.29340.0.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>	
	<1280893217.28553.7.camel@blabla2.none>
	<4C58E15D.3040705@gmail.com>	
	<1280893492.28553.13.camel@blabla2.none>
	<4C58E364.7040207@gmail.com>	
	<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>	
	<1280894659.29101.2.camel@blabla2.none>
	<4C58E978.5080208@gmail.com>
	<1280895631.29340.0.camel@blabla2.none>
Message-ID: <4C58ECA2.1010305@gmail.com>

Ok.  Let me know if you need more of my input files or configurations.

On 8/3/10 11:20 PM, Mihael Hategan wrote:
> Ok. I'll need to take a look at these.
>
> On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote:
>    
>> with use.provider.staging=true i get
>> Execution failed:
>>       Exception in mProjectPP:
>> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits,
>> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
>> Host: teraport
>> Directory:
>> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
>> ----
>>
>> Caused by:
>>       Job failed with an exit code of 254
>>
>> and with that line commented out i get a script printed to the screen
>> saying it doesn't know what #!/BIN?BASH
>>
>> Execution failed:
>>       Could not initialize shared directory on teraport
>> Caused by:
>>       org.globus.cog.abstraction.impl.file.FileResourceException:
>> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown
>> command: #!/BIN/BASH
>>
>>
>> On 8/3/10 11:04 PM, Mihael Hategan wrote:
>>      
>>> Eh. I'll see if I can fix those.
>>>
>>> Though if you are using coasters, and given that I do think I fixed the
>>> existing issues, there is one more thing you could try:
>>> in swift.properties, at the end, say use.provider.staging=true
>>>
>>> Mihael
>>>
>>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
>>>
>>>        
>>>> <pool handle="teraport">
>>>> <execution provider="coaster" url="tp-login2.ci.uchicago.edu"
>>>> jobmanager="ssh:pbs" />
>>>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu" />
>>>> <profile namespace="globus" key="maxtime">3000</profile>
>>>> <profile namespace="globus" key="workersPerNode">8</profile>
>>>> <profile namespace="globus" key="slots">1</profile>
>>>> <profile namespace="globus" key="nodeGranularity">1</profile>
>>>> <profile namespace="globus" key="maxNodes">10</profile>
>>>> <profile namespace="globus" key="queue">short</profile>
>>>> <profile namespace="karajan" key="jobThrottle">0.7</profile>
>>>> <profile namespace="karajan" key="initialScore">10000</profile>
>>>> <filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
>>>> <workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
>>>> </pool>
>>>>
>>>> here is the new sites entry.  I tried running my code and got this error.
>>>>
>>>> class
>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
>>>> exception in doStuff. Fix it!
>>>> java.lang.NullPointerException
>>>>        at
>>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
>>>>        at
>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
>>>>        at
>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
>>>>        at
>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
>>>>
>>>>
>>>> On 8/3/10 10:55 PM, Mihael Hategan wrote:
>>>>
>>>>          
>>>>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu"  />
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>>
>>>        
>>      
>
>    

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From hategan at mcs.anl.gov  Tue Aug  3 23:33:55 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 23:33:55 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58ECA2.1010305@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
	<1280893217.28553.7.camel@blabla2.none>  <4C58E15D.3040705@gmail.com>
	<1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com>
	<1280894101.28882.1.camel@blabla2.none>  <4C58E5B9.80607@gmail.com>
	<1280894659.29101.2.camel@blabla2.none>  <4C58E978.5080208@gmail.com>
	<1280895631.29340.0.camel@blabla2.none>  <4C58ECA2.1010305@gmail.com>
Message-ID: <1280896435.29512.1.camel@blabla2.none>

It may be useful for quickly reproducing it. You already know what I
need. Config files, input files, table files, and scripts if they
changed.

Mihael

On Tue, 2010-08-03 at 23:29 -0500, Jonathan Monette wrote:
> Ok.  Let me know if you need more of my input files or configurations.
> 
> On 8/3/10 11:20 PM, Mihael Hategan wrote:
> > Ok. I'll need to take a look at these.
> >
> > On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote:
> >    
> >> with use.provider.staging=true i get
> >> Execution failed:
> >>       Exception in mProjectPP:
> >> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits,
> >> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
> >> Host: teraport
> >> Directory:
> >> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
> >> ----
> >>
> >> Caused by:
> >>       Job failed with an exit code of 254
> >>
> >> and with that line commented out i get a script printed to the screen
> >> saying it doesn't know what #!/BIN?BASH
> >>
> >> Execution failed:
> >>       Could not initialize shared directory on teraport
> >> Caused by:
> >>       org.globus.cog.abstraction.impl.file.FileResourceException:
> >> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown
> >> command: #!/BIN/BASH
> >>
> >>
> >> On 8/3/10 11:04 PM, Mihael Hategan wrote:
> >>      
> >>> Eh. I'll see if I can fix those.
> >>>
> >>> Though if you are using coasters, and given that I do think I fixed the
> >>> existing issues, there is one more thing you could try:
> >>> in swift.properties, at the end, say use.provider.staging=true
> >>>
> >>> Mihael
> >>>
> >>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
> >>>
> >>>        
> >>>> <pool handle="teraport">
> >>>> <execution provider="coaster" url="tp-login2.ci.uchicago.edu"
> >>>> jobmanager="ssh:pbs" />
> >>>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu" />
> >>>> <profile namespace="globus" key="maxtime">3000</profile>
> >>>> <profile namespace="globus" key="workersPerNode">8</profile>
> >>>> <profile namespace="globus" key="slots">1</profile>
> >>>> <profile namespace="globus" key="nodeGranularity">1</profile>
> >>>> <profile namespace="globus" key="maxNodes">10</profile>
> >>>> <profile namespace="globus" key="queue">short</profile>
> >>>> <profile namespace="karajan" key="jobThrottle">0.7</profile>
> >>>> <profile namespace="karajan" key="initialScore">10000</profile>
> >>>> <filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
> >>>> <workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
> >>>> </pool>
> >>>>
> >>>> here is the new sites entry.  I tried running my code and got this error.
> >>>>
> >>>> class
> >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
> >>>> exception in doStuff. Fix it!
> >>>> java.lang.NullPointerException
> >>>>        at
> >>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
> >>>>        at
> >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
> >>>>        at
> >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
> >>>>        at
> >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
> >>>>
> >>>>
> >>>> On 8/3/10 10:55 PM, Mihael Hategan wrote:
> >>>>
> >>>>          
> >>>>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu"  />
> >>>>>
> >>>>>
> >>>>>            
> >>>>
> >>>>          
> >>>
> >>>        
> >>      
> >
> >    
> 


From jon.monette at gmail.com  Tue Aug  3 23:40:37 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 23:40:37 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280896435.29512.1.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>	
	<1280893217.28553.7.camel@blabla2.none>
	<4C58E15D.3040705@gmail.com>	
	<1280893492.28553.13.camel@blabla2.none>
	<4C58E364.7040207@gmail.com>	
	<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>	
	<1280894659.29101.2.camel@blabla2.none>
	<4C58E978.5080208@gmail.com>	
	<1280895631.29340.0.camel@blabla2.none>
	<4C58ECA2.1010305@gmail.com>
	<1280896435.29512.1.camel@blabla2.none>
Message-ID: <4C58EF45.3010506@gmail.com>

all files used for this run and created (log files and such) are in 
$HOME/Workspace/Swift/Montage/m101_j_4x4/runs on the ci machines.  If 
you would like I can tar up one of the runs.  Not sure which you would 
prefer.

On 8/3/10 11:33 PM, Mihael Hategan wrote:
> It may be useful for quickly reproducing it. You already know what I
> need. Config files, input files, table files, and scripts if they
> changed.
>
> Mihael
>
> On Tue, 2010-08-03 at 23:29 -0500, Jonathan Monette wrote:
>    
>> Ok.  Let me know if you need more of my input files or configurations.
>>
>> On 8/3/10 11:20 PM, Mihael Hategan wrote:
>>      
>>> Ok. I'll need to take a look at these.
>>>
>>> On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote:
>>>
>>>        
>>>> with use.provider.staging=true i get
>>>> Execution failed:
>>>>        Exception in mProjectPP:
>>>> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits,
>>>> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
>>>> Host: teraport
>>>> Directory:
>>>> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
>>>> ----
>>>>
>>>> Caused by:
>>>>        Job failed with an exit code of 254
>>>>
>>>> and with that line commented out i get a script printed to the screen
>>>> saying it doesn't know what #!/BIN?BASH
>>>>
>>>> Execution failed:
>>>>        Could not initialize shared directory on teraport
>>>> Caused by:
>>>>        org.globus.cog.abstraction.impl.file.FileResourceException:
>>>> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown
>>>> command: #!/BIN/BASH
>>>>
>>>>
>>>> On 8/3/10 11:04 PM, Mihael Hategan wrote:
>>>>
>>>>          
>>>>> Eh. I'll see if I can fix those.
>>>>>
>>>>> Though if you are using coasters, and given that I do think I fixed the
>>>>> existing issues, there is one more thing you could try:
>>>>> in swift.properties, at the end, say use.provider.staging=true
>>>>>
>>>>> Mihael
>>>>>
>>>>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> <pool handle="teraport">
>>>>>> <execution provider="coaster" url="tp-login2.ci.uchicago.edu"
>>>>>> jobmanager="ssh:pbs" />
>>>>>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu" />
>>>>>> <profile namespace="globus" key="maxtime">3000</profile>
>>>>>> <profile namespace="globus" key="workersPerNode">8</profile>
>>>>>> <profile namespace="globus" key="slots">1</profile>
>>>>>> <profile namespace="globus" key="nodeGranularity">1</profile>
>>>>>> <profile namespace="globus" key="maxNodes">10</profile>
>>>>>> <profile namespace="globus" key="queue">short</profile>
>>>>>> <profile namespace="karajan" key="jobThrottle">0.7</profile>
>>>>>> <profile namespace="karajan" key="initialScore">10000</profile>
>>>>>> <filesystem provider="ssh" url="tp-login2.ci.uchicago.edu"/>
>>>>>> <workdirectory>/home/jonmon/Library/swift/work/teraport</workdirectory>
>>>>>> </pool>
>>>>>>
>>>>>> here is the new sites entry.  I tried running my code and got this error.
>>>>>>
>>>>>> class
>>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
>>>>>> exception in doStuff. Fix it!
>>>>>> java.lang.NullPointerException
>>>>>>         at
>>>>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
>>>>>>         at
>>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
>>>>>>         at
>>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
>>>>>>         at
>>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
>>>>>>
>>>>>>
>>>>>> On 8/3/10 10:55 PM, Mihael Hategan wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> <filesystem provider="coaster" url="ssh://tp-login2.ci.uchicago.edu"  />
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>
>>>>>>              
>>>>>
>>>>>            
>>>>
>>>>          
>>>
>>>        
>>      
>
>    

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From iraicu at cs.uchicago.edu  Wed Aug  4 03:15:03 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 04 Aug 2010 03:15:03 -0500
Subject: [Swift-user] CFP: The 3rd IEEE Workshop on Many-Task Computing on
 Grids and
 Supercomputers (MTAGS) 2010, co-located with Supercomputing 2010 -- November
 15th, 2010
Message-ID: <4C592187.6050702@cs.uchicago.edu>

Call for Papers

------------------------------------------------------------------------------------------------
The 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010
http://dsl.cs.uchicago.edu/MTAGS10/  
------------------------------------------------------------------------------------------------
November 15th, 2010
New Orleans, Louisiana, USA

Co-located with with IEEE/ACM International Conference for 
High Performance Computing, Networking, Storage and Analysis (SC10) 

================================================================================================
The 3rd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide 
the scientific community a dedicated forum for presenting new research, development, and 
deployment efforts of large-scale many-task computing (MTC) applications on large scale 
clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of 
the workshop encompasses loosely coupled applications, which are generally composed of 
many tasks (both independent and dependent tasks) to achieve some larger application 
goal.  This workshop will cover challenges that can hamper efficiency and utilization in 
running applications on large-scale systems, such as local resource manager scalability 
and granularity, efficient utilization of raw hardware, parallel file system contention 
and scalability, data management, I/O management, reliability at scale, and application 
scalability. We welcome paper submissions on all topics related to MTC on large scale 
systems.  Papers will be peer-reviewed, and accepted papers will be published in the 
workshop proceedings as part of the IEEE digital library (pending approval). The workshop 
will be co-located with the IEEE/ACM Supercomputing 2010 Conference in New Orleans 
Louisiana on November 15th, 2010. For more information, please see 
http://dsl.cs.uchicago.edu/MTAGS010/.


Topics
------------------------------------------------------------------------------------------------
We invite the submission of original work that is related to the topics below. The papers can be 
either short (5 pages) position papers, or long (10 pages) research papers. Topics of interest 
include (in the context of Many-Task Computing):
* Compute Resource Management 
  * Scheduling
  * Job execution frameworks
  * Local resource manager extensions
  * Performance evaluation of resource managers in use on large scale systems
  * Dynamic resource provisioning
  * Techniques to manage many-core resources and/or GPUs
  * Challenges and opportunities in running many-task workloads on HPC systems
  * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure
* Storage architectures and implementations
  * Distributed file systems
  * Parallel file systems
  * Distributed meta-data management 
  * Content distribution systems for large data
  * Data caching frameworks and techniques
  * Data management within and across data centers
  * Data-aware scheduling
  * Data-intensive computing applications
  * Eventual-consistency storage usage and management
* Programming models and tools
  * Map-reduce and its generalizations
  * Many-task computing middleware and applications
  * Parallel programming frameworks
  * Ensemble MPI techniques and frameworks
  * Service-oriented science applications
* Large-Scale Workflow Systems
  * Workflow system performance and scalability analysis
  * Scalability of workflow systems
  * Workflow infrastructure and e-Science middleware
  * Programming Paradigms and Models
* Large-Scale Many-Task Applications
  * High-throughput computing (HTC) applications
  * Data-intensive applications
  * Quasi-supercomputing applications, deployments, and experiences
  * Performance Evaluation
* Performance evaluation
  * Real systems
  * Simulations
  * Reliability of large systems


Paper Submission and Publication
------------------------------------------------------------------------------------------------
Authors are invited to submit papers with unpublished, original work of not more than 10 pages 
of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 
8.5 x 11 manuscript guidelines; document templates can be found at 
ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.pdf and 
ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.doc. We are also seeking 
position papers of no more than 5 pages in length. A 250 word abstract (PDF format) must be 
submitted online at https://cmt.research.microsoft.com/MTAGS2010/ before the deadline of August 
25th, 2010 at 11:59PM PST; the final 5/10 page papers in PDF format will be due on September 1st, 
2010 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the 
workshop proceedings as part of the IEEE digital library (pending approval). Notifications of the 
paper decisions will be sent out by October 1st, 2010. Selected excellent work may be eligible 
for additional post-conference publication as journal articles or book chapters; see this year's 
ongoing special issue in the IEEE Transactions on Parallel and Distributed Systems (TPDS) at 
http://dsl.cs.uchicago.edu/TPDS_MTC/.  Submission implies the willingness of at least one of the 
authors to register and present the paper. For more information, please visit 
http://dsl.cs.uchicago.edu/MTAGS10/. 


Important Dates
------------------------------------------------------------------------------------------------
*	Abstract Due:			August 25th, 2010
*	Papers Due:			September 1st, 2010
*	Notification of Acceptance:	October 1st, 2010
*	Camera Ready Papers Due:	November 1st, 2010
*	Workshop Date:			November 15th, 2010


Committee Members
------------------------------------------------------------------------------------------------
Workshop Chairs
*	Ioan Raicu, Illinois Institute of Technology 
*	Ian Foster, University of Chicago & Argonne National Laboratory
*	Yong Zhao, Microsoft

Steering Committee
*	David Abramson, Monash University, Australia
*	Alok Choudhary, Northwestern University, USA 
*	Jack Dongara, University of Tennessee, USA 
*	Geoffrey Fox, Indiana University, USA 
*	Robert Grossman, University of Illinois at Chicago, USA 
*	Arthur Maccabe, Oak Ridge National Labs, USA 
*	Dan Reed, Microsoft Research, USA
*	Marc Snir, University of Illinois at Urbana Champaign, USA 
*	Xian-He Sun, Illinois Institute of Technology, USA 
*	Manish Parashar, Rutgers University, USA 

Technical Committee
*	Roger Barga, Microsoft Research, USA
*	Mihai Budiu, Microsoft Research, USA
*	Rajkumar Buyya, University of Melbourne, Australia
*	Henri Casanova, University of Hawaii at Manoa, USA
*	Jeff Chase, Duke University, USA
*	Peter Dinda, Northwestern University, USA 
*	Catalin Dumitrescu, Fermi National Labs, USA
*	Evangelinos Constantinos, Massachusetts Institute of Technology, USA
*	Indranil Gupta, University of Illinois at Urbana Champaign, USA
*	Alexandru Iosup, Delft University of Technology, Netherlands 
*	Florin Isaila, Universidad Carlos III de Madrid, Spain
*	Michael Isard, Microsoft Research, USA
*	Kamil Iskra, Argonne National Laboratory, USA
*	Daniel Katz, University of Chicago, USA
*	Tevfik Kosar, Louisiana State University, USA
*	Zhiling Lan, Illinois Institute of Technology, USA
*	Ignacio Llorente, Universidad Complutense de Madrid, Spain 
*	Reagan Moore, University of North Carolina, Chappel Hill, USA
*	Jose Moreira, IBM Research, USA
*	Marlon Pierce, Indiana University, USA
*	Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA
*	Matei Ripeanu, University of British Columbia, Canada
*	Alain Roy, University of Wisconsin Madison, USA 
*	Edward Walker, Texas Advanced Computing Center, USA
*	Mike Wilde, University of Chicago & Argonne National Laboratory, USA 
*	Matthew Woitaszek, The University Coorporation for Atmospheric Research, USA
*	Justin Wozniak, Argonne National Laboratory, USA
*	Ken Yocum, University of California San Diego, USA


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel:   1-847-722-0876
Email: iraicu at cs.iit.edu
Web:   http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================


From jon.monette at gmail.com  Thu Aug  5 13:36:52 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Thu, 05 Aug 2010 13:36:52 -0500
Subject: [Swift-user] Pads submit problem
Message-ID: <4C5B04C4.9030508@gmail.com>

Has anyone every seen this error and know what causes it?

Caused by:
     Exitcode file not found 5 queue polls after the job was reported done

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From jon.monette at gmail.com  Thu Aug  5 14:11:23 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Thu, 05 Aug 2010 14:11:23 -0500
Subject: [Swift-user] Coaster error
Message-ID: <4C5B0CDB.1010404@gmail.com>

Hello again,
     Also has anyone seen this error and know what it means?

Worker task failed:
org.globus.cog.abstraction.impl.common.execution.JobException: Job 
failed with an exit code of 29
     at 
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95)
     at 
org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240)
     at 
org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104)
     at 
org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81)
     at 
org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
     at 
org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
     at 
org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
     at 
org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
     at java.lang.Thread.run(Thread.java:619)

-- 
Jon

Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein


From hategan at mcs.anl.gov  Thu Aug  5 15:33:45 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Aug 2010 15:33:45 -0500
Subject: [Swift-user] Coaster error
In-Reply-To: <4C5B0CDB.1010404@gmail.com>
References: <4C5B0CDB.1010404@gmail.com>
Message-ID: <1281040425.8893.3.camel@blabla2.none>

It means that qstat failed for some reason.

Can you reproduce it?

On Thu, 2010-08-05 at 14:11 -0500, Jonathan Monette wrote:
> Hello again,
>      Also has anyone seen this error and know what it means?
> 
> Worker task failed:
> org.globus.cog.abstraction.impl.common.execution.JobException: Job 
> failed with an exit code of 29
>      at 
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95)
>      at 
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240)
>      at 
> org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104)
>      at 
> org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81)
>      at 
> org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
>      at 
> org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
>      at 
> org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
>      at 
> org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
>      at java.lang.Thread.run(Thread.java:619)
> 


From hategan at mcs.anl.gov  Thu Aug  5 15:34:58 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Aug 2010 15:34:58 -0500
Subject: [Swift-user] Coaster error
In-Reply-To: <1281040425.8893.3.camel@blabla2.none>
References: <4C5B0CDB.1010404@gmail.com> <1281040425.8893.3.camel@blabla2.none>
Message-ID: <1281040498.8893.4.camel@blabla2.none>

Ignore the message I just sent.

On Thu, 2010-08-05 at 15:33 -0500, Mihael Hategan wrote:
> It means that qstat failed for some reason.
> 
> Can you reproduce it?
> 
> On Thu, 2010-08-05 at 14:11 -0500, Jonathan Monette wrote:
> > Hello again,
> >      Also has anyone seen this error and know what it means?
> > 
> > Worker task failed:
> > org.globus.cog.abstraction.impl.common.execution.JobException: Job 
> > failed with an exit code of 29
> >      at 
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95)
> >      at 
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240)
> >      at 
> > org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104)
> >      at 
> > org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81)
> >      at 
> > org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
> >      at 
> > org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
> >      at 
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
> >      at 
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
> >      at java.lang.Thread.run(Thread.java:619)
> > 
> 


From iraicu at cs.uchicago.edu  Sat Aug  7 14:28:18 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 07 Aug 2010 14:28:18 -0500
Subject: [Swift-user] CFP: The 5th Workshop on Workflows in Support of
 Large-Scale Science 2010
Message-ID: <4C5DB3D2.2040703@cs.uchicago.edu>

Call for Papers

The 5th Workshop on Workflows in Support of Large-Scale Science
in conjunction with SC?10

New Orleans, LA
November 14, 2010
http://www.isi.edu/works10

Scientific workflows are a key technology that enables large-scale computations 
and service management on distributed resources. Workflows enable scientists to 
design complex analysis that are composed of individual application components 
or services and often such components and services are designed, developed, and 
tested collaboratively.

The size of the data and the complexity of the analysis often lead to large 
amounts of shared resources, such as clusters and storage systems, being used 
to store the data sets and execute the workflows. The process of workflow design 
and execution in a distributed environment can be very complex and can involve 
multiple stages including their textual or graphical specification, the mapping 
of the high-level workflow descriptions onto the available resources, as well as 
monitoring and debugging of the subsequent execution.  Further, since 
computations and data access operations are performed on shared resources, there 
is an increased interest in managing the fair allocation and management of those 
resources at the workflow level.

Large-scale scientific applications pose several requirements on the workflow 
systems.  Besides the magnitude of data processed by the workflow components, 
the  intermediate and resulting data needs to be annotated with provenance and 
other information to evaluate the quality of the data and support the 
repeatability of the analysis. Further, adequate workflow descriptions are 
needed to support the complex workflow management process which includes workflow 
creation, workflow reuse, and modifications made to the workflow over time?for 
example modifications to the individual workflow components. Additional workflow 
annotations may provide guidelines and requirements for resource mapping and 
execution.

The Fifth Workshop on Workflows in Support of Large-Scale Science focuses on the 
entire workflow lifecycle including the workflow composition, mapping, robust 
execution and the recording of provenance information.  The workshop also welcomes 
contributions in the applications area, where the requirements on the workflow 
management systems can be derived. Special attention will be paid to Bio-Computing 
applications which are the theme for SC10. The topics of the workshop include but 
are not limited to:

    * Workflow applications and their requirements with special emphasis on 
	Bio-Computing applications.
    * Workflow composition, tools and languages.
    * Workflow user environments, including portals.
    * Workflow refinement tools that can manage the workflow mapping process.
    * Workflow execution in distributed environments.
    * Workflow fault-tolerance and recovery techniques.
    * Data-driven workflow processing.
    * Adaptive workflows.
    * Workflow monitoring.
    * Workflow optimizations.
    * Performance analysis of workflows
    * Workflow debugging.
    * Workflow provenance.
    * Interactive workflows.
    * Workflow interoperability
    * Mashups and workflows
    * Workflows on the cloud.

Important Dates:
Papers due September 3, 2010
Notifications of acceptance September 30, 2010
Final papers due October 8, 2010

We will accept both short (6pages) and long (10 page) papers. The papers should 
be in IEEE format. To submit the papers, please submit to EasyChair at 
http://www.easychair.org/conferences/?conf=works10

If you have questions, please email works10 at isi.edu


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel:   1-847-722-0876
Email: iraicu at cs.iit.edu
Web:   http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================


From iraicu at cs.uchicago.edu  Tue Aug 10 20:03:02 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 10 Aug 2010 20:03:02 -0500
Subject: [Swift-user] CFP: The 20th International ACM Symposium on
 High-Performance Parallel and Distributed Computing (HPDC) 2011
Message-ID: <4C61F6C6.3020203@cs.uchicago.edu>

Call For Papers

The 20th International ACM Symposium on 
High-Performance Parallel and Distributed Computing
http://www.hpdc.org/2011/

San Jose, California, June 8-11, 2011

The ACM International Symposium on High-Performance Parallel and Distributed 
Computing is the premier conference for presenting the latest research on the 
design, implementation, evaluation, and use of parallel and distributed systems 
for high end computing. The 20th installment of HPDC will take place in 
San Jose, California, in the heart of Silicon Valley. This year, HPDC is 
affiliated with the ACM Federated Computing Research Conference, consisting of 
fifteen leading ACM conferences all in one week. HPDC will be held on June 9-11 
(Thursday through Saturday) with affiliated workshops taking place on June 8th 
(Wednesday).

Submissions are welcomed on all forms of high performance parallel and 
distributed computing, including but not limited to clusters, clouds, grids, 
utility computing, data-intensive computing, multicore and parallel computing. 
All papers will be reviewed by a distinguished program committee, with a strong 
preference for rigorous results obtained in operational parallel and distributed 
systems. All papers will be evaluated for correctness, originality, potential 
impact, quality of presentation, and interest and relevance to the conference.

In addition to traditional technical papers, we also invite experience papers. 
Such papers should present operational details of a production high end system 
or application, and draw out conclusions gained from operating the system or 
application. The evaluation of experience papers will place a greater weight on 
the real-world impact of the system and the value of conclusions to future 
system designs.


Topics of interest include, but are not limited to:
-------------------------------------------------------------------------------
# Applications of parallel and distributed computing.
# Systems, networks, and architectures for high end computing.
# Parallel and multicore issues and opportunities.
# Virtualization of machines, networks, and storage.
# Programming languages and environments.
# I/O, file systems, and data management.
# Data intensive computing.
# Resource management, scheduling, and load-balancing.
# Performance modeling, simulation, and prediction.
# Fault tolerance, reliability and availability.
# Security, configuration, policy, and management issues.
# Models and use cases for utility, grid, and cloud computing.

Authors are invited to submit technical papers of at most 12 pages in PDF 
format, including all figures and references. Papers should be formatted in the 
ACM Proceedings Style and submitted via the conference web site. Accepted 
papers will appear in the conference proceedings, and will be incorporated into 
the ACM Digital Library.

Papers must be self-contained and provide the technical substance required for 
the program committee to evaluate the paper's contribution. Papers should 
thoughtfully address all related work, particularly work presented at previous 
HPDC events. Submitted papers must be original work that has not appeared in 
and is not under consideration for another conference or a journal. See the ACM 
Prior Publication Policy for more details.

Workshops
-------------------------------------------------------------------------------
We invite proposals for workshops affiliated with HPDC to be held on Wednesday, 
June 8th. For more information, see the Call for Workshops at 
http://www.hpdc.org/2011/cfw.php.

Important Dates
-------------------------------------------------------------------------------
Workshop Proposals Due          1 October 2010
Technical Papers Due:           17 January 2011
PAPER DEADLINE EXTENDED:        24 January 2011 (No further extensions!)
Author Notifications:           28 February 2011
Final Papers Due:               24 March 2011
Conference Dates:               8-11 June 2011

Organization
-------------------------------------------------------------------------------
General Chair
Barney Maccabe, Oak Ridge National Laboratory

Program Chair
Douglas Thain, University of Notre Dame

Workshops Chair
Mike Lewis, Binghamton University

Local Arrangements Chair
Nick Wright, Lawrence Berkeley National Laboratory

Publicity Chairs
Alexandru Iosup, Delft University
John Lange, University of Pittsburgh
Ioan Raicu, Illinois Institute of Technology
Yong Zhao, Microsoft

Program Committee
Kento Aida, National Institute of Informatics
Henri Bal, Vrije Universiteit
Roger Barga, Microsoft
Jim Basney, NCSA
John Bent, Los Alamos National Laboratory
Ron Brightwell, Sandia National Laboratories
Shawn Brown, Pittsburgh Supercomputer Center
Claris Castillo, IBM
Andrew A. Chien, UC San Diego and SDSC
Ewa Deelman, USC Information Sciences Institute
Peter Dinda, Northwestern University
Scott Emrich, University of Notre Dame
Dick Epema, TU-Delft
Gilles Fedak, INRIA
Renato Figuierdo, University of Florida
Ian Foster, University of Chicago and Argonne National Laboratory
Gabriele Garzoglio, Fermi National Accelerator Laboratory
Rong Ge, Marquette University
Sebastien Goasguen, Clemson University
Kartik Gopalan, Binghamton University
Dean Hildebrand, IBM Almaden
Adriana Iamnitchi, University of South Florida
Alexandru Iosup, TU-Delft
Keith Jackson, Lawrence Berkeley
Shantenu Jha, Louisiana State University
Daniel S. Katz, University of Chicago and Argonne National Laboratory
Thilo Kielmann, Vrije Universiteit
Charles Killian, Purdue University
Tevfik Kosar, Louisiana State University
John Lange, University of Pittsburgh
Mike Lewis, Binghamton University
Barney Maccabe, Oak Ridge National Laboratory
Grzegorz Malewicz, Google
Satoshi Matsuoka, Tokyo Institute of Technology
Jarek Nabrzyski, University of Notre Dame
Manish Parashar, Rutgers University
Beth Plale, Indiana University
Ioan Raicu, Illinois Institute of Technology
Philip Rhodes, University of Mississippi
Philip Roth, Oak Ridge National Laboratory
Karsten Schwan, Georgia Tech
Martin Swany, University of Delaware
Jon Weissman, University of Minnesota
Dongyan Xu, Purdue University
Ken Yocum, UCSD
Yong Zhao, Microsoft

Steering Committee
Henri Bal, Vrije Universiteit
Andrew A. Chien, UC San Diego and SDSC
Peter Dinda, Northwestern University
Ian Foster, Argonne National Laboratory and University of Chicago
Dennis Gannon, Microsoft
Salim Hariri, University of Arizona
Dieter Kranzlmueller, Ludwig-Maximilians-Univ. Muenchen
Manish Parashar, Rutgers University
Karsten Schwan, Georgia Tech
Jon Weissman, University of Minnesota (Chair)


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel:   1-847-722-0876
Email: iraicu at cs.iit.edu
Web:   http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100810/bd95be15/attachment.html>

From iraicu at cs.uchicago.edu  Tue Aug 10 20:47:53 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 10 Aug 2010 20:47:53 -0500
Subject: [Swift-user] Call for Workshops at ACM HPDC 2011
Message-ID: <4C620149.3060609@cs.uchicago.edu>

Call for Workshops

The 20th International ACM Symposium on 
High-Performance Parallel and Distributed Computing
http://www.hpdc.org/2011/

San Jose, California, June 8-11, 2011

-------------------------------------------------------------------------------
The ACM Symposium on High Performance Distributed Computing (HPDC) conference 
organizers invite proposals for Workshops to be held with HPDC in San Jose, 
California in June 2011. Workshops will run on June 8, preceding the main 
conference sessions June 9-11.

HPDC 2011 is the 20th anniversary of HPDC, a preeminent conference in high 
performance computing, including cloud and grid computing. This year's 
conference will be held in conjunction with the Federated Computing Research 
Conference (FCRC), which includes high profile conferences in complementary 
research areas, providing a unique opportunity for a broader technical audience 
and wider impact for successful workshops.

Workshops provide forums for discussion among researchers and practitioners on 
focused topics, emerging research areas, or both. Organizers may structure 
workshops as they see fit, possibly including invited talks, panel discussions, 
presentations of work in progress, fully peer-reviewed papers, or some 
combination. Workshops could be scheduled for a half day or a full day, 
depending on interest, space constraints, and organizer preference. Organizers 
should design workshops for approximately 20-40 participants, to balance impact 
and effective discussion.

A workshop proposal must be made in writing, sent to Mike Lewis at 
mlewis at cs.binghamton.edu, and should include:
# The name of the workshop
# Several paragraphs describing the theme of the workshop and how it relates to 
  the HPDC conference
# Data about previous offerings of the workshop (if any), including attendance, 
  number of papers, or presentations submitted and accepted
# Names and affiliations of the workshop organizers, and if applicable, a 
  significant portion of the program committee
# A plan for attracting submissions and attendees

Due to publication deadlines, workshops must operate within roughly the following 
timeline: papers due in early February (2-3 weeks after the HPDC deadline) and 
selected and sent to the publisher by late February.

IMPORTANT DATES:
# Workshop Proposals Deadline:                  October 1, 2010
# Notification:                                 October 25, 2010
# Workshop CFPs Online and Distributed:         November 8, 2010 


-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel:   1-847-722-0876
Email: iraicu at cs.iit.edu
Web:   http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100810/e3f949b9/attachment.html>

From dk0966 at cs.ship.edu  Fri Aug 20 11:04:30 2010
From: dk0966 at cs.ship.edu (David Kelly)
Date: Fri, 20 Aug 2010 12:04:30 -0400
Subject: [Swift-user] Exitcode file not found
Message-ID: <AANLkTim+t4Uccy0VpVvqC9ZRsH+yOm1aQh08qQUrvqxZ@mail.gmail.com>

Hello,

While running Mike's MODIS demo on PADS with pbs and coasters, I receive the
following error:

Worker task failed:
org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exitcode
file not found 5 queue polls after the job was reported done
    at
org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:66)
    at
org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
    at
org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
    at
org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
    at
org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
    at java.lang.Thread.run(Thread.java:619)

I also receive errors relating to qdel:

Canceling job
Failed to shut down block
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Failed
to cancel task. qdel returned with an exit code of 1
    at
org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:159)
    at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
    at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
    at
org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:101)
    at
org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:90)
    at
org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44)
    at
org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:293)
    at
org.globus.cog.abstraction.coaster.service.job.manager.Block.shutdown(Block.java:274)
    at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.shutdownBlocks(BlockQueueProcessor.java:518)
    at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.shutdown(BlockQueueProcessor.java:510)
    at
org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.shutdown(JobQueue.java:108)
    at
org.globus.cog.abstraction.coaster.service.CoasterService.shutdown(CoasterService.java:249)
    at
org.globus.cog.abstraction.coaster.service.ServiceShutdownHandler.requestComplete(ServiceShutdownHandler.java:28)
    at
org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84)
    at
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:387)
    at
org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel.actualSend(AbstractPipedChannel.java:86)
    at
org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:115)
Canceling job

Checking through the mailing list archives, I found an instance where this
was happening when the work directory was /var/tmp and not consistent across
all nodes. The work directory in my configuration is /home/davidk/swiftwork,
so I'm not sure what's causing it. Attached are the sites.xml, tc.data,
swift.properties and the script I'm using. The full log can be found in
/home/davidk/modis/run.0019.

Thanks,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100820/f6f0154b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: modis.swift
Type: application/octet-stream
Size: 1270 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100820/f6f0154b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sites.xml
Type: text/xml
Size: 730 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100820/f6f0154b/attachment.xml>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: swift.properties
Type: application/octet-stream
Size: 11600 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100820/f6f0154b/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tc.data
Type: application/octet-stream
Size: 2026 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100820/f6f0154b/attachment-0002.obj>

From dk0966 at cs.ship.edu  Fri Aug 20 12:25:37 2010
From: dk0966 at cs.ship.edu (David Kelly)
Date: Fri, 20 Aug 2010 13:25:37 -0400
Subject: [Swift-user] Exitcode file not found
In-Reply-To: <4C6EA95F.1000106@gmail.com>
References: <AANLkTim+t4Uccy0VpVvqC9ZRsH+yOm1aQh08qQUrvqxZ@mail.gmail.com>
	<4C6EA95F.1000106@gmail.com>
Message-ID: <1282325137.2492.11.camel@hbk>


I added the internalhostname profile like you suggested, but for some
reason I am still getting the exitcode and qdel errors. Is there
anything else I'm missing? The updated info is
in /home/davidk/modis/run.0020. Thanks.

<config>
<pool handle="pads-local-pbs-coasters">
   <execution jobmanager="local:pbs" provider="coaster" url="none" />
   <filesystem provider="local" url="none" />
   <profile key="maxtime" namespace="globus">3500</profile>
   <profile key="workersPerNode" namespace="globus">8</profile>
   <profile key="slots" namespace="globus">4</profile>
   <profile key="nodeGranularity" namespace="globus">4</profile>
   <profile key="maxNodes" namespace="globus">32</profile>
   <profile key="queue" namespace="globus">fast</profile>
   <profile key="jobThrottle" namespace="karajan">3</profile>
   <profile key="initialScore" namespace="karajan">10000</profile>
   <profile key="internalhostname"
namespace="globus">192.5.86.5</profile>
   <workdirectory>/home/davidk/swiftwork</workdirectory>
 </pool>
</config>


On Fri, 2010-08-20 at 11:12 -0500, Jonathan Monette wrote:
> You must set the interanlhostname parameter for coasters.
> 
> <profile key="internalhostname"
> namespace="globus">192.5.86.6</profile>.  This is when you are on the
> login2 node for PADS.
> <profile key="internalhostname"
> namespace="globus">192.5.86.5</profile>.  This is when you are on the
> login1 node for PADS.


From aespinosa at cs.uchicago.edu  Mon Aug 23 14:41:35 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 23 Aug 2010 14:41:35 -0500
Subject: [Swift-user] coaster maxtime
Message-ID: <AANLkTi=LmTkcaDhXYgXycgSS+U6oKr3T6T3Dcc5UtsFg@mail.gmail.com>

Hi,

Can someone remind me what is the units of the maxtime parameter for
coasters?  Table 13 of the user guide does not specify it.

Thanks,
-Allan

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From wilde at mcs.anl.gov  Mon Aug 23 14:44:15 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 23 Aug 2010 13:44:15 -0600 (GMT-06:00)
Subject: [Swift-user] coaster maxtime
In-Reply-To: <AANLkTi=LmTkcaDhXYgXycgSS+U6oKr3T6T3Dcc5UtsFg@mail.gmail.com>
Message-ID: <409785339.1210161282592655201.JavaMail.root@zimbra.anl.gov>

seconds.

- Mike

----- "Allan Espinosa" <aespinosa at cs.uchicago.edu> wrote:

> Hi,
> 
> Can someone remind me what is the units of the maxtime parameter for
> coasters?  Table 13 of the user guide does not specify it.
> 
> Thanks,
> -Allan
> 
> -- 
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Mon Aug 23 14:46:42 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 23 Aug 2010 14:46:42 -0500
Subject: [Swift-user] coaster maxtime
In-Reply-To: <AANLkTi=LmTkcaDhXYgXycgSS+U6oKr3T6T3Dcc5UtsFg@mail.gmail.com>
References: <AANLkTi=LmTkcaDhXYgXycgSS+U6oKr3T6T3Dcc5UtsFg@mail.gmail.com>
Message-ID: <1282592802.2046.0.camel@blabla2.none>

Seconds. This should be changed to be the same as a walltime spec.

On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote:
> Hi,
> 
> Can someone remind me what is the units of the maxtime parameter for
> coasters?  Table 13 of the user guide does not specify it.
> 
> Thanks,
> -Allan
> 


From wilde at mcs.anl.gov  Mon Aug 23 14:49:55 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 23 Aug 2010 13:49:55 -0600 (GMT-06:00)
Subject: [Swift-user] coaster maxtime
In-Reply-To: <1282592802.2046.0.camel@blabla2.none>
Message-ID: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>

I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields:
hh:mm:ss or nnnn seconds.

- Mike

----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> Seconds. This should be changed to be the same as a walltime spec.
> 
> On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote:
> > Hi,
> > 
> > Can someone remind me what is the units of the maxtime parameter
> for
> > coasters?  Table 13 of the user guide does not specify it.
> > 
> > Thanks,
> > -Allan
> > 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From aespinosa at cs.uchicago.edu  Mon Aug 23 14:54:31 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 23 Aug 2010 14:54:31 -0500
Subject: [Swift-user] coaster maxtime
In-Reply-To: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
References: <1282592802.2046.0.camel@blabla2.none>
	<571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
Message-ID: <AANLkTik_W38h-8wcV=BSjgRrEfHNHyMS8FAcbZLtQKac@mail.gmail.com>

In other fields like "maxwalltime",  the default in nnnn format is minutes

2010/8/23 Michael Wilde <wilde at mcs.anl.gov>:
> I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields:
> hh:mm:ss or nnnn seconds.
>
> - Mike
>
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
>
>> Seconds. This should be changed to be the same as a walltime spec.
>>
>> On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote:
>> > Hi,
>> >
>> > Can someone remind me what is the units of the maxtime parameter
>> for
>> > coasters? ?Table 13 of the user guide does not specify it.
>> >
>> > Thanks,
>> > -Allan


From aespinosa at cs.uchicago.edu  Mon Aug 23 14:55:05 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 23 Aug 2010 14:55:05 -0500
Subject: [Swift-user] coaster maxtime
In-Reply-To: <1282592802.2046.0.camel@blabla2.none>
References: <AANLkTi=LmTkcaDhXYgXycgSS+U6oKr3T6T3Dcc5UtsFg@mail.gmail.com>
	<1282592802.2046.0.camel@blabla2.none>
Message-ID: <AANLkTinnV3A6PaxPBBktZk_LV7DPdk93Cb4RD+f3ZbBJ@mail.gmail.com>

Thanks guys.  Looking at my old sites.xml files make sense now :)

-Allan

2010/8/23 Mihael Hategan <hategan at mcs.anl.gov>:
> Seconds. This should be changed to be the same as a walltime spec.
>
> On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote:
>> Hi,
>>
>> Can someone remind me what is the units of the maxtime parameter for
>> coasters? ?Table 13 of the user guide does not specify it.
>>
>> Thanks,
>> -Allan


From hategan at mcs.anl.gov  Mon Aug 23 15:01:56 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 23 Aug 2010 15:01:56 -0500
Subject: [Swift-user] coaster maxtime
In-Reply-To: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
References: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
Message-ID: <1282593716.2228.2.camel@blabla2.none>

On Mon, 2010-08-23 at 13:49 -0600, Michael Wilde wrote:
> I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields:
> hh:mm:ss or nnnn seconds.

Except that for walltimes in general that would mean minutes. Which is
exactly the cause for some funny troubles. Justin an I stared at
(maxtime="3600" walltime="3600" ) for quite a while before figuring out
that one was minutes and the other seconds.

Mihael


From wilde at mcs.anl.gov  Thu Aug 26 23:11:55 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 26 Aug 2010 22:11:55 -0600 (GMT-06:00)
Subject: [Swift-user] Errors in 13-site OSG run: lazy error question
In-Reply-To: <AANLkTiktszT4zWjL4zPD8JGnrYietXfJRBZkcrrFQmHF@mail.gmail.com>
Message-ID: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>

Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution?

Can anyone confirm that this is whats happening, and if it is the expected behavior?

Also, Glen, 2 questions:

1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week?

2) Do you know what errors the "Failed but can retry:8" message is referring to?

Where is the log/run directory for this run?  How long did it take to get the 589 jobs finished?  It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing.

- Mike


----- "Glen Hocky" <hockyg at uchicago.edu> wrote:

> here's the result of my 13 site run that ran while i was out this
> evening. It did pretty well!
> but seems to have that problem of not quite lazy errors
> ........
> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> Stage out:1 Finished successfully:586
> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> Stage out:2 Finished successfully:587
> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished
> successfully:587 Failed but can retry:6
> Progress: Submitting:3 Submitted:262 Active:140 Finished
> successfully:589 Failed but can retry:8
> Failed to transfer wrapper log from
> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> UCHC_CBG_vdgateway.vcell.uchc.edu
> Execution failed:
> org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile)
> for org.griphyn.vdl.mapping.DataNode identifier
> tag:benc at ci.uchicago.edu
> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut
> with no value at dataset=modelOut path=[3][1][11] (not closed)

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Thu Aug 26 23:15:44 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Aug 2010 23:15:44 -0500
Subject: [Swift-user] Errors in 13-site OSG run: lazy error question
In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
Message-ID: <1282882544.17690.0.camel@blabla2.none>

Wait, wait, wait. Is this a new "invalid path (..logfile)" error?

On Thu, 2010-08-26 at 22:11 -0600, Michael Wilde wrote:
> Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution?
> 
> Can anyone confirm that this is whats happening, and if it is the expected behavior?
> 
> Also, Glen, 2 questions:
> 
> 1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week?
> 
> 2) Do you know what errors the "Failed but can retry:8" message is referring to?
> 
> Where is the log/run directory for this run?  How long did it take to get the 589 jobs finished?  It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing.
> 
> - Mike
> 
> 
> ----- "Glen Hocky" <hockyg at uchicago.edu> wrote:
> 
> > here's the result of my 13 site run that ran while i was out this
> > evening. It did pretty well!
> > but seems to have that problem of not quite lazy errors
> > ........
> > Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> > Stage out:1 Finished successfully:586
> > Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> > Stage out:2 Finished successfully:587
> > Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished
> > successfully:587 Failed but can retry:6
> > Progress: Submitting:3 Submitted:262 Active:140 Finished
> > successfully:589 Failed but can retry:8
> > Failed to transfer wrapper log from
> > glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> > UCHC_CBG_vdgateway.vcell.uchc.edu
> > Execution failed:
> > org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile)
> > for org.griphyn.vdl.mapping.DataNode identifier
> > tag:benc at ci.uchicago.edu
> > ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut
> > with no value at dataset=modelOut path=[3][1][11] (not closed)
> 


From hategan at mcs.anl.gov  Thu Aug 26 23:27:38 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Aug 2010 23:27:38 -0500
Subject: [Swift-user] Errors in 13-site OSG run: lazy error question
In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
Message-ID: <1282883258.17811.2.camel@blabla2.none>

On Thu, 2010-08-26 at 22:11 -0600, Michael Wilde wrote:
> Glen, I wonder if whats happening here is that Swift will retry and
> lazily run past *job* errors, but the error below (a mapping error) is
> maybe being treated as an error in Swift's interpretation of the
> script itself, and this causes an immediate halt to execution?
> 
> Can anyone confirm that this is whats happening, and if it is the expected behavior?

Right. Some errors are re-triable. Jobs get retried in the hope that
they will go away. Which means that they don't get reported until the
last round (and currently only the last error is reported).

Some errors, such as the ones considered to be internal inconsistencies,
will cause everything to fail immediately.


From hockyg at uchicago.edu  Thu Aug 26 23:54:22 2010
From: hockyg at uchicago.edu (Glen Hocky)
Date: Fri, 27 Aug 2010 00:54:22 -0400
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <-3390454574925164218@unknownmsgid>
References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
	<-3390454574925164218@unknownmsgid>
Message-ID: <AANLkTimV7-3XTWAqp9EAbeATJZXTEcG-FoDqxcFZB=WL@mail.gmail.com>

log is on engage-submit
/home/hockyg/swift_logs/glassRunCavities-20100826-1718-7gi0dzs1.log

On Fri, Aug 27, 2010 at 12:35 AM, Glen Hocky <hockyg at gmail.com> wrote:

> Yes nominally the same error but it's not at the beginning but in the
> middle now for some reason. I think it's a mid-stated error message.
> I'll attach the log soon
>
> On Aug 27, 2010, at 12:11 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:
>
> > Glen, I wonder if whats happening here is that Swift will retry and
> lazily run past *job* errors, but the error below (a mapping error) is maybe
> being treated as an error in Swift's interpretation of the script itself,
> and this causes an immediate halt to execution?
> >
> > Can anyone confirm that this is whats happening, and if it is the
> expected behavior?
> >
> > Also, Glen, 2 questions:
> >
> > 1) Isn't the error below the one that was fixed by Mihael in a recent
> revision - the same one I looked at earlier in the week?
> >
> > 2) Do you know what errors the "Failed but can retry:8" message is
> referring to?
> >
> > Where is the log/run directory for this run?  How long did it take to get
> the 589 jobs finished?  It would be good to start plotting these large
> multi-site runs to get a sense of how the scheduler is doing.
> >
> > - Mike
> >
> >
> > ----- "Glen Hocky" <hockyg at uchicago.edu> wrote:
> >
> >> here's the result of my 13 site run that ran while i was out this
> >> evening. It did pretty well!
> >> but seems to have that problem of not quite lazy errors
> >> ........
> >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> >> Stage out:1 Finished successfully:586
> >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> >> Stage out:2 Finished successfully:587
> >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished
> >> successfully:587 Failed but can retry:6
> >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> >> successfully:589 Failed but can retry:8
> >> Failed to transfer wrapper log from
> >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> >> UCHC_CBG_vdgateway.vcell.uchc.edu
> >> Execution failed:
> >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile)
> >> for org.griphyn.vdl.mapping.DataNode identifier
> >> tag:benc at ci.uchicago.edu <tag%3Abenc at ci.uchicago.edu>
> >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut
> >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100827/ba9b1fa3/attachment.html>

From wilde at mcs.anl.gov  Fri Aug 27 10:06:03 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 27 Aug 2010 09:06:03 -0600 (GMT-06:00)
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <-3390454574925164218@unknownmsgid>
Message-ID: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov>

Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct?

Is it possible to re-create this similar error in a similar test script?

Mihael, any thoughts on whether its likely that the prior fix did not address all cases?

Thanks,

- Mike


----- "Glen Hocky" <hockyg at gmail.com> wrote:

> Yes nominally the same error but it's not at the beginning but in the
> middle now for some reason. I think it's a mid-stated error message.
> I'll attach the log soon
> 
> On Aug 27, 2010, at 12:11 AM, Michael Wilde <wilde at mcs.anl.gov>
> wrote:
> 
> > Glen, I wonder if whats happening here is that Swift will retry and
> lazily run past *job* errors, but the error below (a mapping error) is
> maybe being treated as an error in Swift's interpretation of the
> script itself, and this causes an immediate halt to execution?
> >
> > Can anyone confirm that this is whats happening, and if it is the
> expected behavior?
> >
> > Also, Glen, 2 questions:
> >
> > 1) Isn't the error below the one that was fixed by Mihael in a
> recent revision - the same one I looked at earlier in the week?
> >
> > 2) Do you know what errors the "Failed but can retry:8" message is
> referring to?
> >
> > Where is the log/run directory for this run?  How long did it take
> to get the 589 jobs finished?  It would be good to start plotting
> these large multi-site runs to get a sense of how the scheduler is
> doing.
> >
> > - Mike
> >
> >
> > ----- "Glen Hocky" <hockyg at uchicago.edu> wrote:
> >
> >> here's the result of my 13 site run that ran while i was out this
> >> evening. It did pretty well!
> >> but seems to have that problem of not quite lazy errors
> >> ........
> >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> >> Stage out:1 Finished successfully:586
> >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> >> Stage out:2 Finished successfully:587
> >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2
> Finished
> >> successfully:587 Failed but can retry:6
> >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> >> successfully:589 Failed but can retry:8
> >> Failed to transfer wrapper log from
> >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> >> UCHC_CBG_vdgateway.vcell.uchc.edu
> >> Execution failed:
> >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path
> (..logfile)
> >> for org.griphyn.vdl.mapping.DataNode identifier
> >> tag:benc at ci.uchicago.edu
> >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type
> GlassOut
> >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Fri Aug 27 11:34:05 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 27 Aug 2010 11:34:05 -0500
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov>
References: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov>
Message-ID: <1282926845.19454.6.camel@blabla2.none>

Or if you can find the stack trace of that specific error in the log,
that might be useful.

On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote:
> Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct?
> 
> Is it possible to re-create this similar error in a similar test script?
> 
> Mihael, any thoughts on whether its likely that the prior fix did not address all cases?
> 
> Thanks,
> 
> - Mike
> 
> 
> ----- "Glen Hocky" <hockyg at gmail.com> wrote:
> 
> > Yes nominally the same error but it's not at the beginning but in the
> > middle now for some reason. I think it's a mid-stated error message.
> > I'll attach the log soon
> > 
> > On Aug 27, 2010, at 12:11 AM, Michael Wilde <wilde at mcs.anl.gov>
> > wrote:
> > 
> > > Glen, I wonder if whats happening here is that Swift will retry and
> > lazily run past *job* errors, but the error below (a mapping error) is
> > maybe being treated as an error in Swift's interpretation of the
> > script itself, and this causes an immediate halt to execution?
> > >
> > > Can anyone confirm that this is whats happening, and if it is the
> > expected behavior?
> > >
> > > Also, Glen, 2 questions:
> > >
> > > 1) Isn't the error below the one that was fixed by Mihael in a
> > recent revision - the same one I looked at earlier in the week?
> > >
> > > 2) Do you know what errors the "Failed but can retry:8" message is
> > referring to?
> > >
> > > Where is the log/run directory for this run?  How long did it take
> > to get the 589 jobs finished?  It would be good to start plotting
> > these large multi-site runs to get a sense of how the scheduler is
> > doing.
> > >
> > > - Mike
> > >
> > >
> > > ----- "Glen Hocky" <hockyg at uchicago.edu> wrote:
> > >
> > >> here's the result of my 13 site run that ran while i was out this
> > >> evening. It did pretty well!
> > >> but seems to have that problem of not quite lazy errors
> > >> ........
> > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> > >> Stage out:1 Finished successfully:586
> > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> > >> Stage out:2 Finished successfully:587
> > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2
> > Finished
> > >> successfully:587 Failed but can retry:6
> > >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> > >> successfully:589 Failed but can retry:8
> > >> Failed to transfer wrapper log from
> > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> > >> UCHC_CBG_vdgateway.vcell.uchc.edu
> > >> Execution failed:
> > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path
> > (..logfile)
> > >> for org.griphyn.vdl.mapping.DataNode identifier
> > >> tag:benc at ci.uchicago.edu
> > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type
> > GlassOut
> > >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> > >
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> > >
> 


From hategan at mcs.anl.gov  Fri Aug 27 11:41:07 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 27 Aug 2010 11:41:07 -0500
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <1282926845.19454.6.camel@blabla2.none>
References: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov>
	<1282926845.19454.6.camel@blabla2.none>
Message-ID: <1282927267.19454.7.camel@blabla2.none>

Or even the log itself, because I don't think I have access to
engage-submit.

On Fri, 2010-08-27 at 11:34 -0500, Mihael Hategan wrote:
> Or if you can find the stack trace of that specific error in the log,
> that might be useful.
> 
> On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote:
> > Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct?
> > 
> > Is it possible to re-create this similar error in a similar test script?
> > 
> > Mihael, any thoughts on whether its likely that the prior fix did not address all cases?
> > 
> > Thanks,
> > 
> > - Mike
> > 
> > 
> > ----- "Glen Hocky" <hockyg at gmail.com> wrote:
> > 
> > > Yes nominally the same error but it's not at the beginning but in the
> > > middle now for some reason. I think it's a mid-stated error message.
> > > I'll attach the log soon
> > > 
> > > On Aug 27, 2010, at 12:11 AM, Michael Wilde <wilde at mcs.anl.gov>
> > > wrote:
> > > 
> > > > Glen, I wonder if whats happening here is that Swift will retry and
> > > lazily run past *job* errors, but the error below (a mapping error) is
> > > maybe being treated as an error in Swift's interpretation of the
> > > script itself, and this causes an immediate halt to execution?
> > > >
> > > > Can anyone confirm that this is whats happening, and if it is the
> > > expected behavior?
> > > >
> > > > Also, Glen, 2 questions:
> > > >
> > > > 1) Isn't the error below the one that was fixed by Mihael in a
> > > recent revision - the same one I looked at earlier in the week?
> > > >
> > > > 2) Do you know what errors the "Failed but can retry:8" message is
> > > referring to?
> > > >
> > > > Where is the log/run directory for this run?  How long did it take
> > > to get the 589 jobs finished?  It would be good to start plotting
> > > these large multi-site runs to get a sense of how the scheduler is
> > > doing.
> > > >
> > > > - Mike
> > > >
> > > >
> > > > ----- "Glen Hocky" <hockyg at uchicago.edu> wrote:
> > > >
> > > >> here's the result of my 13 site run that ran while i was out this
> > > >> evening. It did pretty well!
> > > >> but seems to have that problem of not quite lazy errors
> > > >> ........
> > > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> > > >> Stage out:1 Finished successfully:586
> > > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> > > >> Stage out:2 Finished successfully:587
> > > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2
> > > Finished
> > > >> successfully:587 Failed but can retry:6
> > > >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> > > >> successfully:589 Failed but can retry:8
> > > >> Failed to transfer wrapper log from
> > > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> > > >> UCHC_CBG_vdgateway.vcell.uchc.edu
> > > >> Execution failed:
> > > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path
> > > (..logfile)
> > > >> for org.griphyn.vdl.mapping.DataNode identifier
> > > >> tag:benc at ci.uchicago.edu
> > > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type
> > > GlassOut
> > > >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> > 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From matthew.woitaszek at gmail.com  Fri Aug 27 14:10:54 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Fri, 27 Aug 2010 13:10:54 -0600
Subject: [Swift-user] Deleting no longer necessary anonymous files in
	_concurrent
Message-ID: <AANLkTinZSyRjt=FGxmXpgOfOQe1CNQo0P3vvwEEGfpcf@mail.gmail.com>

Good afternoon,

I'm working with a script that creates arrays of intermediate files
using the anonymous concurrent mapper, such as:

  file wgt_file[];

As I expect, all of these files get generated in the remote swift
temporary directory and are then returned to the _concurrent directory
on the host executing Swift. However, in this particular application,
they're then immediately consumed by a subsequent procedure and never
needed again.

Is there a way to configure Swift or the file mapper declaration to
delete these files after the remaining script "consumes" them? (That
is, after all procedures relying on them as inputs have been
executed?) Or can (should?) that be done manually?

More speculatively, is there a way to keep files like these on the
execution host and not even bring them back to _concurrent? (With loss
of generality, I'm executing on a single site, and don't really ever
need the file locally, for restarts or staging to another site.)

Any advice about managing copies of large intermediate data files in
the Swift execution context would be appreciated!

Matthew


From wozniak at mcs.anl.gov  Mon Aug 30 16:54:29 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Mon, 30 Aug 2010 16:54:29 -0500 (CDT)
Subject: [Swift-user] Deleting no longer necessary anonymous files in
	_concurrent
In-Reply-To: <AANLkTinZSyRjt=FGxmXpgOfOQe1CNQo0P3vvwEEGfpcf@mail.gmail.com>
References: <AANLkTinZSyRjt=FGxmXpgOfOQe1CNQo0P3vvwEEGfpcf@mail.gmail.com>
Message-ID: <alpine.DEB.2.00.1008271502220.3319@wozniak-desktop.mcs.anl.gov>

Hi Matthew
 	Deleting files is out of the scope of the Swift language.  You can 
of course remove them yourself in your scripts, and as long as Swift does 
not try to stage them out you should be fine.
 	You may want to look at external variables as another way to 
approach this (manual 2.5).  Using external variables you can manage the 
files in your scripts while maintaining the Swift progress model.
 	Justin

On Fri, 27 Aug 2010, Matthew Woitaszek wrote:
> Good afternoon,
>
> I'm working with a script that creates arrays of intermediate files
> using the anonymous concurrent mapper, such as:
>
>  file wgt_file[];
>
> As I expect, all of these files get generated in the remote swift
> temporary directory and are then returned to the _concurrent directory
> on the host executing Swift. However, in this particular application,
> they're then immediately consumed by a subsequent procedure and never
> needed again.
>
> Is there a way to configure Swift or the file mapper declaration to
> delete these files after the remaining script "consumes" them? (That
> is, after all procedures relying on them as inputs have been
> executed?) Or can (should?) that be done manually?
>
> More speculatively, is there a way to keep files like these on the
> execution host and not even bring them back to _concurrent? (With loss
> of generality, I'm executing on a single site, and don't really ever
> need the file locally, for restarts or staging to another site.)
>
> Any advice about managing copies of large intermediate data files in
> the Swift execution context would be appreciated!
>
> Matthew
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>

-- 
Justin M Wozniak


From hockyg at gmail.com  Thu Aug 26 23:36:23 2010
From: hockyg at gmail.com (Glen Hocky)
Date: Fri, 27 Aug 2010 04:36:23 -0000
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
Message-ID: <-3390454574925164218@unknownmsgid>

Yes nominally the same error but it's not at the beginning but in the
middle now for some reason. I think it's a mid-stated error message.
I'll attach the log soon

On Aug 27, 2010, at 12:11 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution?
>
> Can anyone confirm that this is whats happening, and if it is the expected behavior?
>
> Also, Glen, 2 questions:
>
> 1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week?
>
> 2) Do you know what errors the "Failed but can retry:8" message is referring to?
>
> Where is the log/run directory for this run?  How long did it take to get the 589 jobs finished?  It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing.
>
> - Mike
>
>
> ----- "Glen Hocky" <hockyg at uchicago.edu> wrote:
>
>> here's the result of my 13 site run that ran while i was out this
>> evening. It did pretty well!
>> but seems to have that problem of not quite lazy errors
>> ........
>> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
>> Stage out:1 Finished successfully:586
>> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
>> Stage out:2 Finished successfully:587
>> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished
>> successfully:587 Failed but can retry:6
>> Progress: Submitting:3 Submitted:262 Active:140 Finished
>> successfully:589 Failed but can retry:8
>> Failed to transfer wrapper log from
>> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
>> UCHC_CBG_vdgateway.vcell.uchc.edu
>> Execution failed:
>> org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile)
>> for org.griphyn.vdl.mapping.DataNode identifier
>> tag:benc at ci.uchicago.edu
>> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut
>> with no value at dataset=modelOut path=[3][1][11] (not closed)
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>