From jon.monette at gmail.com Tue Aug 3 21:12:52 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 21:12:52 -0500
Subject: [Swift-user] Montage wrapper error
Message-ID: <4C58CCA4.7030805@gmail.com>
Hello,
Has anyone ever ran into this error:
Failed to transfer wrapper log from
m101_montage-20100803-2101-4ihqvdv9/info/s on teraport
Execution failed:
Exception in mProjectPP:
Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits,
proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr]
Host: teraport
Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj
stderr.txt:
stdout.txt:
----
Caused by:
org.globus.cog.abstraction.impl.file.IrrecoverableResourceException:
Cannot determine the existence of the file
Caused by:
The connection has been closed [Unnamed Channel]
Cleaning up...
Shutting down service at https://128.135.125.117:52276
Got channel MetaChannel: 1867624887[1293086287: {}] ->
GSSSChannel-11234326669(1)[1293086287: {}]
+ Done
I am testing my wrappers to Montage on a larger scale and I keep getting
this error. There is about 640 images but it only projects about 142
images before this error pops up. If my run will help my run exists at
"/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09"
on the ci machines. Any help is much appreciated.
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
From hategan at mcs.anl.gov Tue Aug 3 22:40:17 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 22:40:17 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58CCA4.7030805@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
Message-ID: <1280893217.28553.7.camel@blabla2.none>
On the remote site there should be something called
~/.globus/coasters/coasters.log. It tends to contain useful information.
As usual, the swift log also tends to contain useful information.
However, Mike has mentioned some problems when using the coaster
filesystem provider. In the effort to implement provider staging for
coasters, that may have broke. Is that what you are using? (i.e. post
sites.xml).
Mihael
On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote:
> Hello,
> Has anyone ever ran into this error:
>
> Failed to transfer wrapper log from
> m101_montage-20100803-2101-4ihqvdv9/info/s on teraport
> Execution failed:
> Exception in mProjectPP:
> Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits,
> proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr]
> Host: teraport
> Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
>
> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException:
> Cannot determine the existence of the file
> Caused by:
> The connection has been closed [Unnamed Channel]
> Cleaning up...
> Shutting down service at https://128.135.125.117:52276
> Got channel MetaChannel: 1867624887[1293086287: {}] ->
> GSSSChannel-11234326669(1)[1293086287: {}]
> + Done
>
> I am testing my wrappers to Montage on a larger scale and I keep getting
> this error. There is about 640 images but it only projects about 142
> images before this error pops up. If my run will help my run exists at
> "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09"
> on the ci machines. Any help is much appreciated.
>
From jon.monette at gmail.com Tue Aug 3 22:49:56 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 22:49:56 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280893492.28553.13.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none>
<4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none>
Message-ID: <4C58E364.7040207@gmail.com>
to use the coaster filesystem i should use local:pbs in the jobmanager?
On 8/3/10 10:44 PM, Mihael Hategan wrote:
> Ok. Maybe not. Looks like an SSH issue. Can you try the coaster fs
> provider instead?
>
> On Tue, 2010-08-03 at 22:41 -0500, Jonathan Monette wrote:
>
>>
>>
>>
>> /home/jonmon/Library/Swift/work/localhost
>> .05
>>
>>
>>
>> > jobmanager="ssh:pbs" />
>> 3000
>> 8
>> 1
>> 1
>> 10
>> short
>> 0.7
>> 10000
>>
>> /home/jonmon/Library/swift/work/teraport
>>
>>
>> This is my sites file
>>
>> On 8/3/10 10:40 PM, Mihael Hategan wrote:
>>
>>> On the remote site there should be something called
>>> ~/.globus/coasters/coasters.log. It tends to contain useful information.
>>> As usual, the swift log also tends to contain useful information.
>>>
>>> However, Mike has mentioned some problems when using the coaster
>>> filesystem provider. In the effort to implement provider staging for
>>> coasters, that may have broke. Is that what you are using? (i.e. post
>>> sites.xml).
>>>
>>> Mihael
>>>
>>> On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote:
>>>
>>>
>>>> Hello,
>>>> Has anyone ever ran into this error:
>>>>
>>>> Failed to transfer wrapper log from
>>>> m101_montage-20100803-2101-4ihqvdv9/info/s on teraport
>>>> Execution failed:
>>>> Exception in mProjectPP:
>>>> Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits,
>>>> proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr]
>>>> Host: teraport
>>>> Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>>
>>>> ----
>>>>
>>>> Caused by:
>>>>
>>>> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException:
>>>> Cannot determine the existence of the file
>>>> Caused by:
>>>> The connection has been closed [Unnamed Channel]
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.117:52276
>>>> Got channel MetaChannel: 1867624887[1293086287: {}] ->
>>>> GSSSChannel-11234326669(1)[1293086287: {}]
>>>> + Done
>>>>
>>>> I am testing my wrappers to Montage on a larger scale and I keep getting
>>>> this error. There is about 640 images but it only projects about 142
>>>> images before this error pops up. If my run will help my run exists at
>>>> "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09"
>>>> on the ci machines. Any help is much appreciated.
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
From hategan at mcs.anl.gov Tue Aug 3 22:55:01 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 22:55:01 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58E364.7040207@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com>
Message-ID: <1280894101.28882.1.camel@blabla2.none>
On Tue, 2010-08-03 at 22:49 -0500, Jonathan Monette wrote:
> to use the coaster filesystem i should use local:pbs in the jobmanager?
>
> On 8/3/10 10:44 PM, Mihael Hategan wrote:
> > Ok. Maybe not. Looks like an SSH issue. Can you try the coaster fs
> > provider instead?
> >
> > On Tue, 2010-08-03 at 22:41 -0500, Jonathan Monette wrote:
> >
> >>
> >>
> >>
> >> /home/jonmon/Library/Swift/work/localhost
> >> .05
> >>
> >>
> >>
> >> >> jobmanager="ssh:pbs" />
> >> 3000
> >> 8
> >> 1
> >> 1
> >> 10
> >> short
> >> 0.7
> >> 10000
> >>
> >> /home/jonmon/Library/swift/work/teraport
> >>
> >>
> >> This is my sites file
> >>
> >> On 8/3/10 10:40 PM, Mihael Hategan wrote:
> >>
> >>> On the remote site there should be something called
> >>> ~/.globus/coasters/coasters.log. It tends to contain useful information.
> >>> As usual, the swift log also tends to contain useful information.
> >>>
> >>> However, Mike has mentioned some problems when using the coaster
> >>> filesystem provider. In the effort to implement provider staging for
> >>> coasters, that may have broke. Is that what you are using? (i.e. post
> >>> sites.xml).
> >>>
> >>> Mihael
> >>>
> >>> On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote:
> >>>
> >>>
> >>>> Hello,
> >>>> Has anyone ever ran into this error:
> >>>>
> >>>> Failed to transfer wrapper log from
> >>>> m101_montage-20100803-2101-4ihqvdv9/info/s on teraport
> >>>> Execution failed:
> >>>> Exception in mProjectPP:
> >>>> Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits,
> >>>> proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr]
> >>>> Host: teraport
> >>>> Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj
> >>>> stderr.txt:
> >>>>
> >>>> stdout.txt:
> >>>>
> >>>> ----
> >>>>
> >>>> Caused by:
> >>>>
> >>>> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException:
> >>>> Cannot determine the existence of the file
> >>>> Caused by:
> >>>> The connection has been closed [Unnamed Channel]
> >>>> Cleaning up...
> >>>> Shutting down service at https://128.135.125.117:52276
> >>>> Got channel MetaChannel: 1867624887[1293086287: {}] ->
> >>>> GSSSChannel-11234326669(1)[1293086287: {}]
> >>>> + Done
> >>>>
> >>>> I am testing my wrappers to Montage on a larger scale and I keep getting
> >>>> this error. There is about 640 images but it only projects about 142
> >>>> images before this error pops up. If my run will help my run exists at
> >>>> "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09"
> >>>> on the ci machines. Any help is much appreciated.
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>
From jon.monette at gmail.com Tue Aug 3 22:59:53 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 22:59:53 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280894101.28882.1.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none>
<4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none>
<4C58E364.7040207@gmail.com>
<1280894101.28882.1.camel@blabla2.none>
Message-ID: <4C58E5B9.80607@gmail.com>
3000
8
1
1
10
short
0.7
10000
/home/jonmon/Library/swift/work/teraport
here is the new sites entry. I tried running my code and got this error.
class
org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
exception in doStuff. Fix it!
java.lang.NullPointerException
at
org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
at
org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
at
org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
at
org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
On 8/3/10 10:55 PM, Mihael Hategan wrote:
>
>
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
From hategan at mcs.anl.gov Tue Aug 3 23:04:19 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 23:04:19 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58E5B9.80607@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com>
<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>
Message-ID: <1280894659.29101.2.camel@blabla2.none>
Eh. I'll see if I can fix those.
Though if you are using coasters, and given that I do think I fixed the
existing issues, there is one more thing you could try:
in swift.properties, at the end, say use.provider.staging=true
Mihael
On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
>
> jobmanager="ssh:pbs" />
>
> 3000
> 8
> 1
> 1
> 10
> short
> 0.7
> 10000
>
> /home/jonmon/Library/swift/work/teraport
>
>
> here is the new sites entry. I tried running my code and got this error.
>
> class
> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
> exception in doStuff. Fix it!
> java.lang.NullPointerException
> at
> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
> at
> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
> at
> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
> at
> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
>
>
> On 8/3/10 10:55 PM, Mihael Hategan wrote:
> >
> >
>
From jon.monette at gmail.com Tue Aug 3 23:15:52 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 23:15:52 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280894659.29101.2.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none>
<4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none>
<4C58E364.7040207@gmail.com>
<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>
<1280894659.29101.2.camel@blabla2.none>
Message-ID: <4C58E978.5080208@gmail.com>
with use.provider.staging=true i get
Execution failed:
Exception in mProjectPP:
Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits,
proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
Host: teraport
Directory:
m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
----
Caused by:
Job failed with an exit code of 254
and with that line commented out i get a script printed to the screen
saying it doesn't know what #!/BIN?BASH
Execution failed:
Could not initialize shared directory on teraport
Caused by:
org.globus.cog.abstraction.impl.file.FileResourceException:
org.globus.cog.karajan.workflow.service.ProtocolException: Unknown
command: #!/BIN/BASH
On 8/3/10 11:04 PM, Mihael Hategan wrote:
> Eh. I'll see if I can fix those.
>
> Though if you are using coasters, and given that I do think I fixed the
> existing issues, there is one more thing you could try:
> in swift.properties, at the end, say use.provider.staging=true
>
> Mihael
>
> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
>
>>
>> > jobmanager="ssh:pbs" />
>>
>> 3000
>> 8
>> 1
>> 1
>> 10
>> short
>> 0.7
>> 10000
>>
>> /home/jonmon/Library/swift/work/teraport
>>
>>
>> here is the new sites entry. I tried running my code and got this error.
>>
>> class
>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
>> exception in doStuff. Fix it!
>> java.lang.NullPointerException
>> at
>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
>> at
>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
>> at
>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
>> at
>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
>>
>>
>> On 8/3/10 10:55 PM, Mihael Hategan wrote:
>>
>>>
>>>
>>>
>>
>
>
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
From hategan at mcs.anl.gov Tue Aug 3 23:20:31 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 23:20:31 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58E978.5080208@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com>
<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>
<1280894659.29101.2.camel@blabla2.none> <4C58E978.5080208@gmail.com>
Message-ID: <1280895631.29340.0.camel@blabla2.none>
Ok. I'll need to take a look at these.
On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote:
> with use.provider.staging=true i get
> Execution failed:
> Exception in mProjectPP:
> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits,
> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
> Host: teraport
> Directory:
> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
> ----
>
> Caused by:
> Job failed with an exit code of 254
>
> and with that line commented out i get a script printed to the screen
> saying it doesn't know what #!/BIN?BASH
>
> Execution failed:
> Could not initialize shared directory on teraport
> Caused by:
> org.globus.cog.abstraction.impl.file.FileResourceException:
> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown
> command: #!/BIN/BASH
>
>
> On 8/3/10 11:04 PM, Mihael Hategan wrote:
> > Eh. I'll see if I can fix those.
> >
> > Though if you are using coasters, and given that I do think I fixed the
> > existing issues, there is one more thing you could try:
> > in swift.properties, at the end, say use.provider.staging=true
> >
> > Mihael
> >
> > On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
> >
> >>
> >> >> jobmanager="ssh:pbs" />
> >>
> >> 3000
> >> 8
> >> 1
> >> 1
> >> 10
> >> short
> >> 0.7
> >> 10000
> >>
> >> /home/jonmon/Library/swift/work/teraport
> >>
> >>
> >> here is the new sites entry. I tried running my code and got this error.
> >>
> >> class
> >> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
> >> exception in doStuff. Fix it!
> >> java.lang.NullPointerException
> >> at
> >> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
> >> at
> >> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
> >> at
> >> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
> >> at
> >> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
> >>
> >>
> >> On 8/3/10 10:55 PM, Mihael Hategan wrote:
> >>
> >>>
> >>>
> >>>
> >>
> >
> >
>
From jon.monette at gmail.com Tue Aug 3 23:29:22 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 23:29:22 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280895631.29340.0.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none>
<4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none>
<4C58E364.7040207@gmail.com>
<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>
<1280894659.29101.2.camel@blabla2.none>
<4C58E978.5080208@gmail.com>
<1280895631.29340.0.camel@blabla2.none>
Message-ID: <4C58ECA2.1010305@gmail.com>
Ok. Let me know if you need more of my input files or configurations.
On 8/3/10 11:20 PM, Mihael Hategan wrote:
> Ok. I'll need to take a look at these.
>
> On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote:
>
>> with use.provider.staging=true i get
>> Execution failed:
>> Exception in mProjectPP:
>> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits,
>> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
>> Host: teraport
>> Directory:
>> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
>> ----
>>
>> Caused by:
>> Job failed with an exit code of 254
>>
>> and with that line commented out i get a script printed to the screen
>> saying it doesn't know what #!/BIN?BASH
>>
>> Execution failed:
>> Could not initialize shared directory on teraport
>> Caused by:
>> org.globus.cog.abstraction.impl.file.FileResourceException:
>> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown
>> command: #!/BIN/BASH
>>
>>
>> On 8/3/10 11:04 PM, Mihael Hategan wrote:
>>
>>> Eh. I'll see if I can fix those.
>>>
>>> Though if you are using coasters, and given that I do think I fixed the
>>> existing issues, there is one more thing you could try:
>>> in swift.properties, at the end, say use.provider.staging=true
>>>
>>> Mihael
>>>
>>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
>>>
>>>
>>>>
>>>> >>> jobmanager="ssh:pbs" />
>>>>
>>>> 3000
>>>> 8
>>>> 1
>>>> 1
>>>> 10
>>>> short
>>>> 0.7
>>>> 10000
>>>>
>>>> /home/jonmon/Library/swift/work/teraport
>>>>
>>>>
>>>> here is the new sites entry. I tried running my code and got this error.
>>>>
>>>> class
>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
>>>> exception in doStuff. Fix it!
>>>> java.lang.NullPointerException
>>>> at
>>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
>>>> at
>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
>>>> at
>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
>>>> at
>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
>>>>
>>>>
>>>> On 8/3/10 10:55 PM, Mihael Hategan wrote:
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
From hategan at mcs.anl.gov Tue Aug 3 23:33:55 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Aug 2010 23:33:55 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <4C58ECA2.1010305@gmail.com>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com>
<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>
<1280894659.29101.2.camel@blabla2.none> <4C58E978.5080208@gmail.com>
<1280895631.29340.0.camel@blabla2.none> <4C58ECA2.1010305@gmail.com>
Message-ID: <1280896435.29512.1.camel@blabla2.none>
It may be useful for quickly reproducing it. You already know what I
need. Config files, input files, table files, and scripts if they
changed.
Mihael
On Tue, 2010-08-03 at 23:29 -0500, Jonathan Monette wrote:
> Ok. Let me know if you need more of my input files or configurations.
>
> On 8/3/10 11:20 PM, Mihael Hategan wrote:
> > Ok. I'll need to take a look at these.
> >
> > On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote:
> >
> >> with use.provider.staging=true i get
> >> Execution failed:
> >> Exception in mProjectPP:
> >> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits,
> >> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
> >> Host: teraport
> >> Directory:
> >> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
> >> ----
> >>
> >> Caused by:
> >> Job failed with an exit code of 254
> >>
> >> and with that line commented out i get a script printed to the screen
> >> saying it doesn't know what #!/BIN?BASH
> >>
> >> Execution failed:
> >> Could not initialize shared directory on teraport
> >> Caused by:
> >> org.globus.cog.abstraction.impl.file.FileResourceException:
> >> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown
> >> command: #!/BIN/BASH
> >>
> >>
> >> On 8/3/10 11:04 PM, Mihael Hategan wrote:
> >>
> >>> Eh. I'll see if I can fix those.
> >>>
> >>> Though if you are using coasters, and given that I do think I fixed the
> >>> existing issues, there is one more thing you could try:
> >>> in swift.properties, at the end, say use.provider.staging=true
> >>>
> >>> Mihael
> >>>
> >>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
> >>>
> >>>
> >>>>
> >>>> >>>> jobmanager="ssh:pbs" />
> >>>>
> >>>> 3000
> >>>> 8
> >>>> 1
> >>>> 1
> >>>> 10
> >>>> short
> >>>> 0.7
> >>>> 10000
> >>>>
> >>>> /home/jonmon/Library/swift/work/teraport
> >>>>
> >>>>
> >>>> here is the new sites entry. I tried running my code and got this error.
> >>>>
> >>>> class
> >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
> >>>> exception in doStuff. Fix it!
> >>>> java.lang.NullPointerException
> >>>> at
> >>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
> >>>> at
> >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
> >>>>
> >>>>
> >>>> On 8/3/10 10:55 PM, Mihael Hategan wrote:
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>
From jon.monette at gmail.com Tue Aug 3 23:40:37 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Tue, 03 Aug 2010 23:40:37 -0500
Subject: [Swift-user] Montage wrapper error
In-Reply-To: <1280896435.29512.1.camel@blabla2.none>
References: <4C58CCA4.7030805@gmail.com>
<1280893217.28553.7.camel@blabla2.none>
<4C58E15D.3040705@gmail.com>
<1280893492.28553.13.camel@blabla2.none>
<4C58E364.7040207@gmail.com>
<1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com>
<1280894659.29101.2.camel@blabla2.none>
<4C58E978.5080208@gmail.com>
<1280895631.29340.0.camel@blabla2.none>
<4C58ECA2.1010305@gmail.com>
<1280896435.29512.1.camel@blabla2.none>
Message-ID: <4C58EF45.3010506@gmail.com>
all files used for this run and created (log files and such) are in
$HOME/Workspace/Swift/Montage/m101_j_4x4/runs on the ci machines. If
you would like I can tar up one of the runs. Not sure which you would
prefer.
On 8/3/10 11:33 PM, Mihael Hategan wrote:
> It may be useful for quickly reproducing it. You already know what I
> need. Config files, input files, table files, and scripts if they
> changed.
>
> Mihael
>
> On Tue, 2010-08-03 at 23:29 -0500, Jonathan Monette wrote:
>
>> Ok. Let me know if you need more of my input files or configurations.
>>
>> On 8/3/10 11:20 PM, Mihael Hategan wrote:
>>
>>> Ok. I'll need to take a look at these.
>>>
>>> On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote:
>>>
>>>
>>>> with use.provider.staging=true i get
>>>> Execution failed:
>>>> Exception in mProjectPP:
>>>> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits,
>>>> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr]
>>>> Host: teraport
>>>> Directory:
>>>> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs
>>>> ----
>>>>
>>>> Caused by:
>>>> Job failed with an exit code of 254
>>>>
>>>> and with that line commented out i get a script printed to the screen
>>>> saying it doesn't know what #!/BIN?BASH
>>>>
>>>> Execution failed:
>>>> Could not initialize shared directory on teraport
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.file.FileResourceException:
>>>> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown
>>>> command: #!/BIN/BASH
>>>>
>>>>
>>>> On 8/3/10 11:04 PM, Mihael Hategan wrote:
>>>>
>>>>
>>>>> Eh. I'll see if I can fix those.
>>>>>
>>>>> Though if you are using coasters, and given that I do think I fixed the
>>>>> existing issues, there is one more thing you could try:
>>>>> in swift.properties, at the end, say use.provider.staging=true
>>>>>
>>>>> Mihael
>>>>>
>>>>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote:
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> >>>>> jobmanager="ssh:pbs" />
>>>>>>
>>>>>> 3000
>>>>>> 8
>>>>>> 1
>>>>>> 1
>>>>>> 10
>>>>>> short
>>>>>> 0.7
>>>>>> 10000
>>>>>>
>>>>>> /home/jonmon/Library/swift/work/teraport
>>>>>>
>>>>>>
>>>>>> here is the new sites entry. I tried running my code and got this error.
>>>>>>
>>>>>> class
>>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws
>>>>>> exception in doStuff. Fix it!
>>>>>> java.lang.NullPointerException
>>>>>> at
>>>>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79)
>>>>>> at
>>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77)
>>>>>> at
>>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36)
>>>>>> at
>>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122)
>>>>>>
>>>>>>
>>>>>> On 8/3/10 10:55 PM, Mihael Hategan wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
From iraicu at cs.uchicago.edu Wed Aug 4 03:15:03 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 04 Aug 2010 03:15:03 -0500
Subject: [Swift-user] CFP: The 3rd IEEE Workshop on Many-Task Computing on
Grids and
Supercomputers (MTAGS) 2010, co-located with Supercomputing 2010 -- November
15th, 2010
Message-ID: <4C592187.6050702@cs.uchicago.edu>
Call for Papers
------------------------------------------------------------------------------------------------
The 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010
http://dsl.cs.uchicago.edu/MTAGS10/
------------------------------------------------------------------------------------------------
November 15th, 2010
New Orleans, Louisiana, USA
Co-located with with IEEE/ACM International Conference for
High Performance Computing, Networking, Storage and Analysis (SC10)
================================================================================================
The 3rd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide
the scientific community a dedicated forum for presenting new research, development, and
deployment efforts of large-scale many-task computing (MTC) applications on large scale
clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of
the workshop encompasses loosely coupled applications, which are generally composed of
many tasks (both independent and dependent tasks) to achieve some larger application
goal. This workshop will cover challenges that can hamper efficiency and utilization in
running applications on large-scale systems, such as local resource manager scalability
and granularity, efficient utilization of raw hardware, parallel file system contention
and scalability, data management, I/O management, reliability at scale, and application
scalability. We welcome paper submissions on all topics related to MTC on large scale
systems. Papers will be peer-reviewed, and accepted papers will be published in the
workshop proceedings as part of the IEEE digital library (pending approval). The workshop
will be co-located with the IEEE/ACM Supercomputing 2010 Conference in New Orleans
Louisiana on November 15th, 2010. For more information, please see
http://dsl.cs.uchicago.edu/MTAGS010/.
Topics
------------------------------------------------------------------------------------------------
We invite the submission of original work that is related to the topics below. The papers can be
either short (5 pages) position papers, or long (10 pages) research papers. Topics of interest
include (in the context of Many-Task Computing):
* Compute Resource Management
* Scheduling
* Job execution frameworks
* Local resource manager extensions
* Performance evaluation of resource managers in use on large scale systems
* Dynamic resource provisioning
* Techniques to manage many-core resources and/or GPUs
* Challenges and opportunities in running many-task workloads on HPC systems
* Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure
* Storage architectures and implementations
* Distributed file systems
* Parallel file systems
* Distributed meta-data management
* Content distribution systems for large data
* Data caching frameworks and techniques
* Data management within and across data centers
* Data-aware scheduling
* Data-intensive computing applications
* Eventual-consistency storage usage and management
* Programming models and tools
* Map-reduce and its generalizations
* Many-task computing middleware and applications
* Parallel programming frameworks
* Ensemble MPI techniques and frameworks
* Service-oriented science applications
* Large-Scale Workflow Systems
* Workflow system performance and scalability analysis
* Scalability of workflow systems
* Workflow infrastructure and e-Science middleware
* Programming Paradigms and Models
* Large-Scale Many-Task Applications
* High-throughput computing (HTC) applications
* Data-intensive applications
* Quasi-supercomputing applications, deployments, and experiences
* Performance Evaluation
* Performance evaluation
* Real systems
* Simulations
* Reliability of large systems
Paper Submission and Publication
------------------------------------------------------------------------------------------------
Authors are invited to submit papers with unpublished, original work of not more than 10 pages
of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE
8.5 x 11 manuscript guidelines; document templates can be found at
ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.pdf and
ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.doc. We are also seeking
position papers of no more than 5 pages in length. A 250 word abstract (PDF format) must be
submitted online at https://cmt.research.microsoft.com/MTAGS2010/ before the deadline of August
25th, 2010 at 11:59PM PST; the final 5/10 page papers in PDF format will be due on September 1st,
2010 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the
workshop proceedings as part of the IEEE digital library (pending approval). Notifications of the
paper decisions will be sent out by October 1st, 2010. Selected excellent work may be eligible
for additional post-conference publication as journal articles or book chapters; see this year's
ongoing special issue in the IEEE Transactions on Parallel and Distributed Systems (TPDS) at
http://dsl.cs.uchicago.edu/TPDS_MTC/. Submission implies the willingness of at least one of the
authors to register and present the paper. For more information, please visit
http://dsl.cs.uchicago.edu/MTAGS10/.
Important Dates
------------------------------------------------------------------------------------------------
* Abstract Due: August 25th, 2010
* Papers Due: September 1st, 2010
* Notification of Acceptance: October 1st, 2010
* Camera Ready Papers Due: November 1st, 2010
* Workshop Date: November 15th, 2010
Committee Members
------------------------------------------------------------------------------------------------
Workshop Chairs
* Ioan Raicu, Illinois Institute of Technology
* Ian Foster, University of Chicago & Argonne National Laboratory
* Yong Zhao, Microsoft
Steering Committee
* David Abramson, Monash University, Australia
* Alok Choudhary, Northwestern University, USA
* Jack Dongara, University of Tennessee, USA
* Geoffrey Fox, Indiana University, USA
* Robert Grossman, University of Illinois at Chicago, USA
* Arthur Maccabe, Oak Ridge National Labs, USA
* Dan Reed, Microsoft Research, USA
* Marc Snir, University of Illinois at Urbana Champaign, USA
* Xian-He Sun, Illinois Institute of Technology, USA
* Manish Parashar, Rutgers University, USA
Technical Committee
* Roger Barga, Microsoft Research, USA
* Mihai Budiu, Microsoft Research, USA
* Rajkumar Buyya, University of Melbourne, Australia
* Henri Casanova, University of Hawaii at Manoa, USA
* Jeff Chase, Duke University, USA
* Peter Dinda, Northwestern University, USA
* Catalin Dumitrescu, Fermi National Labs, USA
* Evangelinos Constantinos, Massachusetts Institute of Technology, USA
* Indranil Gupta, University of Illinois at Urbana Champaign, USA
* Alexandru Iosup, Delft University of Technology, Netherlands
* Florin Isaila, Universidad Carlos III de Madrid, Spain
* Michael Isard, Microsoft Research, USA
* Kamil Iskra, Argonne National Laboratory, USA
* Daniel Katz, University of Chicago, USA
* Tevfik Kosar, Louisiana State University, USA
* Zhiling Lan, Illinois Institute of Technology, USA
* Ignacio Llorente, Universidad Complutense de Madrid, Spain
* Reagan Moore, University of North Carolina, Chappel Hill, USA
* Jose Moreira, IBM Research, USA
* Marlon Pierce, Indiana University, USA
* Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA
* Matei Ripeanu, University of British Columbia, Canada
* Alain Roy, University of Wisconsin Madison, USA
* Edward Walker, Texas Advanced Computing Center, USA
* Mike Wilde, University of Chicago & Argonne National Laboratory, USA
* Matthew Woitaszek, The University Coorporation for Atmospheric Research, USA
* Justin Wozniak, Argonne National Laboratory, USA
* Ken Yocum, University of California San Diego, USA
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel: 1-847-722-0876
Email: iraicu at cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================
From jon.monette at gmail.com Thu Aug 5 13:36:52 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Thu, 05 Aug 2010 13:36:52 -0500
Subject: [Swift-user] Pads submit problem
Message-ID: <4C5B04C4.9030508@gmail.com>
Has anyone every seen this error and know what causes it?
Caused by:
Exitcode file not found 5 queue polls after the job was reported done
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
From jon.monette at gmail.com Thu Aug 5 14:11:23 2010
From: jon.monette at gmail.com (Jonathan Monette)
Date: Thu, 05 Aug 2010 14:11:23 -0500
Subject: [Swift-user] Coaster error
Message-ID: <4C5B0CDB.1010404@gmail.com>
Hello again,
Also has anyone seen this error and know what it means?
Worker task failed:
org.globus.cog.abstraction.impl.common.execution.JobException: Job
failed with an exit code of 29
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240)
at
org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104)
at
org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81)
at
org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
at
org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
at java.lang.Thread.run(Thread.java:619)
--
Jon
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein
From hategan at mcs.anl.gov Thu Aug 5 15:33:45 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Aug 2010 15:33:45 -0500
Subject: [Swift-user] Coaster error
In-Reply-To: <4C5B0CDB.1010404@gmail.com>
References: <4C5B0CDB.1010404@gmail.com>
Message-ID: <1281040425.8893.3.camel@blabla2.none>
It means that qstat failed for some reason.
Can you reproduce it?
On Thu, 2010-08-05 at 14:11 -0500, Jonathan Monette wrote:
> Hello again,
> Also has anyone seen this error and know what it means?
>
> Worker task failed:
> org.globus.cog.abstraction.impl.common.execution.JobException: Job
> failed with an exit code of 29
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240)
> at
> org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104)
> at
> org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81)
> at
> org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
> at
> org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
> at
> org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
> at java.lang.Thread.run(Thread.java:619)
>
From hategan at mcs.anl.gov Thu Aug 5 15:34:58 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 05 Aug 2010 15:34:58 -0500
Subject: [Swift-user] Coaster error
In-Reply-To: <1281040425.8893.3.camel@blabla2.none>
References: <4C5B0CDB.1010404@gmail.com> <1281040425.8893.3.camel@blabla2.none>
Message-ID: <1281040498.8893.4.camel@blabla2.none>
Ignore the message I just sent.
On Thu, 2010-08-05 at 15:33 -0500, Mihael Hategan wrote:
> It means that qstat failed for some reason.
>
> Can you reproduce it?
>
> On Thu, 2010-08-05 at 14:11 -0500, Jonathan Monette wrote:
> > Hello again,
> > Also has anyone seen this error and know what it means?
> >
> > Worker task failed:
> > org.globus.cog.abstraction.impl.common.execution.JobException: Job
> > failed with an exit code of 29
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95)
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240)
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104)
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81)
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
> > at
> > org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
> > at
> > org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
> > at java.lang.Thread.run(Thread.java:619)
> >
>
From iraicu at cs.uchicago.edu Sat Aug 7 14:28:18 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 07 Aug 2010 14:28:18 -0500
Subject: [Swift-user] CFP: The 5th Workshop on Workflows in Support of
Large-Scale Science 2010
Message-ID: <4C5DB3D2.2040703@cs.uchicago.edu>
Call for Papers
The 5th Workshop on Workflows in Support of Large-Scale Science
in conjunction with SC?10
New Orleans, LA
November 14, 2010
http://www.isi.edu/works10
Scientific workflows are a key technology that enables large-scale computations
and service management on distributed resources. Workflows enable scientists to
design complex analysis that are composed of individual application components
or services and often such components and services are designed, developed, and
tested collaboratively.
The size of the data and the complexity of the analysis often lead to large
amounts of shared resources, such as clusters and storage systems, being used
to store the data sets and execute the workflows. The process of workflow design
and execution in a distributed environment can be very complex and can involve
multiple stages including their textual or graphical specification, the mapping
of the high-level workflow descriptions onto the available resources, as well as
monitoring and debugging of the subsequent execution. Further, since
computations and data access operations are performed on shared resources, there
is an increased interest in managing the fair allocation and management of those
resources at the workflow level.
Large-scale scientific applications pose several requirements on the workflow
systems. Besides the magnitude of data processed by the workflow components,
the intermediate and resulting data needs to be annotated with provenance and
other information to evaluate the quality of the data and support the
repeatability of the analysis. Further, adequate workflow descriptions are
needed to support the complex workflow management process which includes workflow
creation, workflow reuse, and modifications made to the workflow over time?for
example modifications to the individual workflow components. Additional workflow
annotations may provide guidelines and requirements for resource mapping and
execution.
The Fifth Workshop on Workflows in Support of Large-Scale Science focuses on the
entire workflow lifecycle including the workflow composition, mapping, robust
execution and the recording of provenance information. The workshop also welcomes
contributions in the applications area, where the requirements on the workflow
management systems can be derived. Special attention will be paid to Bio-Computing
applications which are the theme for SC10. The topics of the workshop include but
are not limited to:
* Workflow applications and their requirements with special emphasis on
Bio-Computing applications.
* Workflow composition, tools and languages.
* Workflow user environments, including portals.
* Workflow refinement tools that can manage the workflow mapping process.
* Workflow execution in distributed environments.
* Workflow fault-tolerance and recovery techniques.
* Data-driven workflow processing.
* Adaptive workflows.
* Workflow monitoring.
* Workflow optimizations.
* Performance analysis of workflows
* Workflow debugging.
* Workflow provenance.
* Interactive workflows.
* Workflow interoperability
* Mashups and workflows
* Workflows on the cloud.
Important Dates:
Papers due September 3, 2010
Notifications of acceptance September 30, 2010
Final papers due October 8, 2010
We will accept both short (6pages) and long (10 page) papers. The papers should
be in IEEE format. To submit the papers, please submit to EasyChair at
http://www.easychair.org/conferences/?conf=works10
If you have questions, please email works10 at isi.edu
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel: 1-847-722-0876
Email: iraicu at cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================
From iraicu at cs.uchicago.edu Tue Aug 10 20:03:02 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 10 Aug 2010 20:03:02 -0500
Subject: [Swift-user] CFP: The 20th International ACM Symposium on
High-Performance Parallel and Distributed Computing (HPDC) 2011
Message-ID: <4C61F6C6.3020203@cs.uchicago.edu>
Call For Papers
The 20th International ACM Symposium on
High-Performance Parallel and Distributed Computing
http://www.hpdc.org/2011/
San Jose, California, June 8-11, 2011
The ACM International Symposium on High-Performance Parallel and Distributed
Computing is the premier conference for presenting the latest research on the
design, implementation, evaluation, and use of parallel and distributed systems
for high end computing. The 20th installment of HPDC will take place in
San Jose, California, in the heart of Silicon Valley. This year, HPDC is
affiliated with the ACM Federated Computing Research Conference, consisting of
fifteen leading ACM conferences all in one week. HPDC will be held on June 9-11
(Thursday through Saturday) with affiliated workshops taking place on June 8th
(Wednesday).
Submissions are welcomed on all forms of high performance parallel and
distributed computing, including but not limited to clusters, clouds, grids,
utility computing, data-intensive computing, multicore and parallel computing.
All papers will be reviewed by a distinguished program committee, with a strong
preference for rigorous results obtained in operational parallel and distributed
systems. All papers will be evaluated for correctness, originality, potential
impact, quality of presentation, and interest and relevance to the conference.
In addition to traditional technical papers, we also invite experience papers.
Such papers should present operational details of a production high end system
or application, and draw out conclusions gained from operating the system or
application. The evaluation of experience papers will place a greater weight on
the real-world impact of the system and the value of conclusions to future
system designs.
Topics of interest include, but are not limited to:
-------------------------------------------------------------------------------
# Applications of parallel and distributed computing.
# Systems, networks, and architectures for high end computing.
# Parallel and multicore issues and opportunities.
# Virtualization of machines, networks, and storage.
# Programming languages and environments.
# I/O, file systems, and data management.
# Data intensive computing.
# Resource management, scheduling, and load-balancing.
# Performance modeling, simulation, and prediction.
# Fault tolerance, reliability and availability.
# Security, configuration, policy, and management issues.
# Models and use cases for utility, grid, and cloud computing.
Authors are invited to submit technical papers of at most 12 pages in PDF
format, including all figures and references. Papers should be formatted in the
ACM Proceedings Style and submitted via the conference web site. Accepted
papers will appear in the conference proceedings, and will be incorporated into
the ACM Digital Library.
Papers must be self-contained and provide the technical substance required for
the program committee to evaluate the paper's contribution. Papers should
thoughtfully address all related work, particularly work presented at previous
HPDC events. Submitted papers must be original work that has not appeared in
and is not under consideration for another conference or a journal. See the ACM
Prior Publication Policy for more details.
Workshops
-------------------------------------------------------------------------------
We invite proposals for workshops affiliated with HPDC to be held on Wednesday,
June 8th. For more information, see the Call for Workshops at
http://www.hpdc.org/2011/cfw.php.
Important Dates
-------------------------------------------------------------------------------
Workshop Proposals Due 1 October 2010
Technical Papers Due: 17 January 2011
PAPER DEADLINE EXTENDED: 24 January 2011 (No further extensions!)
Author Notifications: 28 February 2011
Final Papers Due: 24 March 2011
Conference Dates: 8-11 June 2011
Organization
-------------------------------------------------------------------------------
General Chair
Barney Maccabe, Oak Ridge National Laboratory
Program Chair
Douglas Thain, University of Notre Dame
Workshops Chair
Mike Lewis, Binghamton University
Local Arrangements Chair
Nick Wright, Lawrence Berkeley National Laboratory
Publicity Chairs
Alexandru Iosup, Delft University
John Lange, University of Pittsburgh
Ioan Raicu, Illinois Institute of Technology
Yong Zhao, Microsoft
Program Committee
Kento Aida, National Institute of Informatics
Henri Bal, Vrije Universiteit
Roger Barga, Microsoft
Jim Basney, NCSA
John Bent, Los Alamos National Laboratory
Ron Brightwell, Sandia National Laboratories
Shawn Brown, Pittsburgh Supercomputer Center
Claris Castillo, IBM
Andrew A. Chien, UC San Diego and SDSC
Ewa Deelman, USC Information Sciences Institute
Peter Dinda, Northwestern University
Scott Emrich, University of Notre Dame
Dick Epema, TU-Delft
Gilles Fedak, INRIA
Renato Figuierdo, University of Florida
Ian Foster, University of Chicago and Argonne National Laboratory
Gabriele Garzoglio, Fermi National Accelerator Laboratory
Rong Ge, Marquette University
Sebastien Goasguen, Clemson University
Kartik Gopalan, Binghamton University
Dean Hildebrand, IBM Almaden
Adriana Iamnitchi, University of South Florida
Alexandru Iosup, TU-Delft
Keith Jackson, Lawrence Berkeley
Shantenu Jha, Louisiana State University
Daniel S. Katz, University of Chicago and Argonne National Laboratory
Thilo Kielmann, Vrije Universiteit
Charles Killian, Purdue University
Tevfik Kosar, Louisiana State University
John Lange, University of Pittsburgh
Mike Lewis, Binghamton University
Barney Maccabe, Oak Ridge National Laboratory
Grzegorz Malewicz, Google
Satoshi Matsuoka, Tokyo Institute of Technology
Jarek Nabrzyski, University of Notre Dame
Manish Parashar, Rutgers University
Beth Plale, Indiana University
Ioan Raicu, Illinois Institute of Technology
Philip Rhodes, University of Mississippi
Philip Roth, Oak Ridge National Laboratory
Karsten Schwan, Georgia Tech
Martin Swany, University of Delaware
Jon Weissman, University of Minnesota
Dongyan Xu, Purdue University
Ken Yocum, UCSD
Yong Zhao, Microsoft
Steering Committee
Henri Bal, Vrije Universiteit
Andrew A. Chien, UC San Diego and SDSC
Peter Dinda, Northwestern University
Ian Foster, Argonne National Laboratory and University of Chicago
Dennis Gannon, Microsoft
Salim Hariri, University of Arizona
Dieter Kranzlmueller, Ludwig-Maximilians-Univ. Muenchen
Manish Parashar, Rutgers University
Karsten Schwan, Georgia Tech
Jon Weissman, University of Minnesota (Chair)
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel: 1-847-722-0876
Email: iraicu at cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From iraicu at cs.uchicago.edu Tue Aug 10 20:47:53 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 10 Aug 2010 20:47:53 -0500
Subject: [Swift-user] Call for Workshops at ACM HPDC 2011
Message-ID: <4C620149.3060609@cs.uchicago.edu>
Call for Workshops
The 20th International ACM Symposium on
High-Performance Parallel and Distributed Computing
http://www.hpdc.org/2011/
San Jose, California, June 8-11, 2011
-------------------------------------------------------------------------------
The ACM Symposium on High Performance Distributed Computing (HPDC) conference
organizers invite proposals for Workshops to be held with HPDC in San Jose,
California in June 2011. Workshops will run on June 8, preceding the main
conference sessions June 9-11.
HPDC 2011 is the 20th anniversary of HPDC, a preeminent conference in high
performance computing, including cloud and grid computing. This year's
conference will be held in conjunction with the Federated Computing Research
Conference (FCRC), which includes high profile conferences in complementary
research areas, providing a unique opportunity for a broader technical audience
and wider impact for successful workshops.
Workshops provide forums for discussion among researchers and practitioners on
focused topics, emerging research areas, or both. Organizers may structure
workshops as they see fit, possibly including invited talks, panel discussions,
presentations of work in progress, fully peer-reviewed papers, or some
combination. Workshops could be scheduled for a half day or a full day,
depending on interest, space constraints, and organizer preference. Organizers
should design workshops for approximately 20-40 participants, to balance impact
and effective discussion.
A workshop proposal must be made in writing, sent to Mike Lewis at
mlewis at cs.binghamton.edu, and should include:
# The name of the workshop
# Several paragraphs describing the theme of the workshop and how it relates to
the HPDC conference
# Data about previous offerings of the workshop (if any), including attendance,
number of papers, or presentations submitted and accepted
# Names and affiliations of the workshop organizers, and if applicable, a
significant portion of the program committee
# A plan for attracting submissions and attendees
Due to publication deadlines, workshops must operate within roughly the following
timeline: papers due in early February (2-3 weeks after the HPDC deadline) and
selected and sent to the publisher by late February.
IMPORTANT DATES:
# Workshop Proposals Deadline: October 1, 2010
# Notification: October 25, 2010
# Workshop CFPs Online and Distributed: November 8, 2010
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel: 1-847-722-0876
Email: iraicu at cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dk0966 at cs.ship.edu Fri Aug 20 11:04:30 2010
From: dk0966 at cs.ship.edu (David Kelly)
Date: Fri, 20 Aug 2010 12:04:30 -0400
Subject: [Swift-user] Exitcode file not found
Message-ID:
Hello,
While running Mike's MODIS demo on PADS with pbs and coasters, I receive the
following error:
Worker task failed:
org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exitcode
file not found 5 queue polls after the job was reported done
at
org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:66)
at
org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
at
org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
at java.lang.Thread.run(Thread.java:619)
I also receive errors relating to qdel:
Canceling job
Failed to shut down block
org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Failed
to cancel task. qdel returned with an exit code of 1
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:159)
at
org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85)
at
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70)
at
org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:101)
at
org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:90)
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44)
at
org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:293)
at
org.globus.cog.abstraction.coaster.service.job.manager.Block.shutdown(Block.java:274)
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.shutdownBlocks(BlockQueueProcessor.java:518)
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.shutdown(BlockQueueProcessor.java:510)
at
org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.shutdown(JobQueue.java:108)
at
org.globus.cog.abstraction.coaster.service.CoasterService.shutdown(CoasterService.java:249)
at
org.globus.cog.abstraction.coaster.service.ServiceShutdownHandler.requestComplete(ServiceShutdownHandler.java:28)
at
org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84)
at
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:387)
at
org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel.actualSend(AbstractPipedChannel.java:86)
at
org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:115)
Canceling job
Checking through the mailing list archives, I found an instance where this
was happening when the work directory was /var/tmp and not consistent across
all nodes. The work directory in my configuration is /home/davidk/swiftwork,
so I'm not sure what's causing it. Attached are the sites.xml, tc.data,
swift.properties and the script I'm using. The full log can be found in
/home/davidk/modis/run.0019.
Thanks,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: modis.swift
Type: application/octet-stream
Size: 1270 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sites.xml
Type: text/xml
Size: 730 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: swift.properties
Type: application/octet-stream
Size: 11600 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tc.data
Type: application/octet-stream
Size: 2026 bytes
Desc: not available
URL:
From dk0966 at cs.ship.edu Fri Aug 20 12:25:37 2010
From: dk0966 at cs.ship.edu (David Kelly)
Date: Fri, 20 Aug 2010 13:25:37 -0400
Subject: [Swift-user] Exitcode file not found
In-Reply-To: <4C6EA95F.1000106@gmail.com>
References:
<4C6EA95F.1000106@gmail.com>
Message-ID: <1282325137.2492.11.camel@hbk>
I added the internalhostname profile like you suggested, but for some
reason I am still getting the exitcode and qdel errors. Is there
anything else I'm missing? The updated info is
in /home/davidk/modis/run.0020. Thanks.
3500
8
4
4
32
fast
3
10000
192.5.86.5
/home/davidk/swiftwork
On Fri, 2010-08-20 at 11:12 -0500, Jonathan Monette wrote:
> You must set the interanlhostname parameter for coasters.
>
> namespace="globus">192.5.86.6. This is when you are on the
> login2 node for PADS.
> namespace="globus">192.5.86.5. This is when you are on the
> login1 node for PADS.
From aespinosa at cs.uchicago.edu Mon Aug 23 14:41:35 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 23 Aug 2010 14:41:35 -0500
Subject: [Swift-user] coaster maxtime
Message-ID:
Hi,
Can someone remind me what is the units of the maxtime parameter for
coasters? Table 13 of the user guide does not specify it.
Thanks,
-Allan
--
Allan M. Espinosa
PhD student, Computer Science
University of Chicago
From wilde at mcs.anl.gov Mon Aug 23 14:44:15 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 23 Aug 2010 13:44:15 -0600 (GMT-06:00)
Subject: [Swift-user] coaster maxtime
In-Reply-To:
Message-ID: <409785339.1210161282592655201.JavaMail.root@zimbra.anl.gov>
seconds.
- Mike
----- "Allan Espinosa" wrote:
> Hi,
>
> Can someone remind me what is the units of the maxtime parameter for
> coasters? Table 13 of the user guide does not specify it.
>
> Thanks,
> -Allan
>
> --
> Allan M. Espinosa
> PhD student, Computer Science
> University of Chicago
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
From hategan at mcs.anl.gov Mon Aug 23 14:46:42 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 23 Aug 2010 14:46:42 -0500
Subject: [Swift-user] coaster maxtime
In-Reply-To:
References:
Message-ID: <1282592802.2046.0.camel@blabla2.none>
Seconds. This should be changed to be the same as a walltime spec.
On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote:
> Hi,
>
> Can someone remind me what is the units of the maxtime parameter for
> coasters? Table 13 of the user guide does not specify it.
>
> Thanks,
> -Allan
>
From wilde at mcs.anl.gov Mon Aug 23 14:49:55 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 23 Aug 2010 13:49:55 -0600 (GMT-06:00)
Subject: [Swift-user] coaster maxtime
In-Reply-To: <1282592802.2046.0.camel@blabla2.none>
Message-ID: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields:
hh:mm:ss or nnnn seconds.
- Mike
----- "Mihael Hategan" wrote:
> Seconds. This should be changed to be the same as a walltime spec.
>
> On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote:
> > Hi,
> >
> > Can someone remind me what is the units of the maxtime parameter
> for
> > coasters? Table 13 of the user guide does not specify it.
> >
> > Thanks,
> > -Allan
> >
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
From aespinosa at cs.uchicago.edu Mon Aug 23 14:54:31 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 23 Aug 2010 14:54:31 -0500
Subject: [Swift-user] coaster maxtime
In-Reply-To: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
References: <1282592802.2046.0.camel@blabla2.none>
<571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
Message-ID:
In other fields like "maxwalltime", the default in nnnn format is minutes
2010/8/23 Michael Wilde :
> I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields:
> hh:mm:ss or nnnn seconds.
>
> - Mike
>
> ----- "Mihael Hategan" wrote:
>
>> Seconds. This should be changed to be the same as a walltime spec.
>>
>> On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote:
>> > Hi,
>> >
>> > Can someone remind me what is the units of the maxtime parameter
>> for
>> > coasters? ?Table 13 of the user guide does not specify it.
>> >
>> > Thanks,
>> > -Allan
From aespinosa at cs.uchicago.edu Mon Aug 23 14:55:05 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 23 Aug 2010 14:55:05 -0500
Subject: [Swift-user] coaster maxtime
In-Reply-To: <1282592802.2046.0.camel@blabla2.none>
References:
<1282592802.2046.0.camel@blabla2.none>
Message-ID:
Thanks guys. Looking at my old sites.xml files make sense now :)
-Allan
2010/8/23 Mihael Hategan :
> Seconds. This should be changed to be the same as a walltime spec.
>
> On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote:
>> Hi,
>>
>> Can someone remind me what is the units of the maxtime parameter for
>> coasters? ?Table 13 of the user guide does not specify it.
>>
>> Thanks,
>> -Allan
From hategan at mcs.anl.gov Mon Aug 23 15:01:56 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 23 Aug 2010 15:01:56 -0500
Subject: [Swift-user] coaster maxtime
In-Reply-To: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
References: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov>
Message-ID: <1282593716.2228.2.camel@blabla2.none>
On Mon, 2010-08-23 at 13:49 -0600, Michael Wilde wrote:
> I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields:
> hh:mm:ss or nnnn seconds.
Except that for walltimes in general that would mean minutes. Which is
exactly the cause for some funny troubles. Justin an I stared at
(maxtime="3600" walltime="3600" ) for quite a while before figuring out
that one was minutes and the other seconds.
Mihael
From wilde at mcs.anl.gov Thu Aug 26 23:11:55 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 26 Aug 2010 22:11:55 -0600 (GMT-06:00)
Subject: [Swift-user] Errors in 13-site OSG run: lazy error question
In-Reply-To:
Message-ID: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution?
Can anyone confirm that this is whats happening, and if it is the expected behavior?
Also, Glen, 2 questions:
1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week?
2) Do you know what errors the "Failed but can retry:8" message is referring to?
Where is the log/run directory for this run? How long did it take to get the 589 jobs finished? It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing.
- Mike
----- "Glen Hocky" wrote:
> here's the result of my 13 site run that ran while i was out this
> evening. It did pretty well!
> but seems to have that problem of not quite lazy errors
> ........
> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> Stage out:1 Finished successfully:586
> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> Stage out:2 Finished successfully:587
> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished
> successfully:587 Failed but can retry:6
> Progress: Submitting:3 Submitted:262 Active:140 Finished
> successfully:589 Failed but can retry:8
> Failed to transfer wrapper log from
> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> UCHC_CBG_vdgateway.vcell.uchc.edu
> Execution failed:
> org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile)
> for org.griphyn.vdl.mapping.DataNode identifier
> tag:benc at ci.uchicago.edu
> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut
> with no value at dataset=modelOut path=[3][1][11] (not closed)
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
From hategan at mcs.anl.gov Thu Aug 26 23:15:44 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Aug 2010 23:15:44 -0500
Subject: [Swift-user] Errors in 13-site OSG run: lazy error question
In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
Message-ID: <1282882544.17690.0.camel@blabla2.none>
Wait, wait, wait. Is this a new "invalid path (..logfile)" error?
On Thu, 2010-08-26 at 22:11 -0600, Michael Wilde wrote:
> Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution?
>
> Can anyone confirm that this is whats happening, and if it is the expected behavior?
>
> Also, Glen, 2 questions:
>
> 1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week?
>
> 2) Do you know what errors the "Failed but can retry:8" message is referring to?
>
> Where is the log/run directory for this run? How long did it take to get the 589 jobs finished? It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing.
>
> - Mike
>
>
> ----- "Glen Hocky" wrote:
>
> > here's the result of my 13 site run that ran while i was out this
> > evening. It did pretty well!
> > but seems to have that problem of not quite lazy errors
> > ........
> > Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> > Stage out:1 Finished successfully:586
> > Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> > Stage out:2 Finished successfully:587
> > Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished
> > successfully:587 Failed but can retry:6
> > Progress: Submitting:3 Submitted:262 Active:140 Finished
> > successfully:589 Failed but can retry:8
> > Failed to transfer wrapper log from
> > glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> > UCHC_CBG_vdgateway.vcell.uchc.edu
> > Execution failed:
> > org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile)
> > for org.griphyn.vdl.mapping.DataNode identifier
> > tag:benc at ci.uchicago.edu
> > ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut
> > with no value at dataset=modelOut path=[3][1][11] (not closed)
>
From hategan at mcs.anl.gov Thu Aug 26 23:27:38 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 26 Aug 2010 23:27:38 -0500
Subject: [Swift-user] Errors in 13-site OSG run: lazy error question
In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
Message-ID: <1282883258.17811.2.camel@blabla2.none>
On Thu, 2010-08-26 at 22:11 -0600, Michael Wilde wrote:
> Glen, I wonder if whats happening here is that Swift will retry and
> lazily run past *job* errors, but the error below (a mapping error) is
> maybe being treated as an error in Swift's interpretation of the
> script itself, and this causes an immediate halt to execution?
>
> Can anyone confirm that this is whats happening, and if it is the expected behavior?
Right. Some errors are re-triable. Jobs get retried in the hope that
they will go away. Which means that they don't get reported until the
last round (and currently only the last error is reported).
Some errors, such as the ones considered to be internal inconsistencies,
will cause everything to fail immediately.
From hockyg at uchicago.edu Thu Aug 26 23:54:22 2010
From: hockyg at uchicago.edu (Glen Hocky)
Date: Fri, 27 Aug 2010 00:54:22 -0400
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <-3390454574925164218@unknownmsgid>
References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
<-3390454574925164218@unknownmsgid>
Message-ID:
log is on engage-submit
/home/hockyg/swift_logs/glassRunCavities-20100826-1718-7gi0dzs1.log
On Fri, Aug 27, 2010 at 12:35 AM, Glen Hocky wrote:
> Yes nominally the same error but it's not at the beginning but in the
> middle now for some reason. I think it's a mid-stated error message.
> I'll attach the log soon
>
> On Aug 27, 2010, at 12:11 AM, Michael Wilde wrote:
>
> > Glen, I wonder if whats happening here is that Swift will retry and
> lazily run past *job* errors, but the error below (a mapping error) is maybe
> being treated as an error in Swift's interpretation of the script itself,
> and this causes an immediate halt to execution?
> >
> > Can anyone confirm that this is whats happening, and if it is the
> expected behavior?
> >
> > Also, Glen, 2 questions:
> >
> > 1) Isn't the error below the one that was fixed by Mihael in a recent
> revision - the same one I looked at earlier in the week?
> >
> > 2) Do you know what errors the "Failed but can retry:8" message is
> referring to?
> >
> > Where is the log/run directory for this run? How long did it take to get
> the 589 jobs finished? It would be good to start plotting these large
> multi-site runs to get a sense of how the scheduler is doing.
> >
> > - Mike
> >
> >
> > ----- "Glen Hocky" wrote:
> >
> >> here's the result of my 13 site run that ran while i was out this
> >> evening. It did pretty well!
> >> but seems to have that problem of not quite lazy errors
> >> ........
> >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> >> Stage out:1 Finished successfully:586
> >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> >> Stage out:2 Finished successfully:587
> >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished
> >> successfully:587 Failed but can retry:6
> >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> >> successfully:589 Failed but can retry:8
> >> Failed to transfer wrapper log from
> >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> >> UCHC_CBG_vdgateway.vcell.uchc.edu
> >> Execution failed:
> >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile)
> >> for org.griphyn.vdl.mapping.DataNode identifier
> >> tag:benc at ci.uchicago.edu
> >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut
> >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From wilde at mcs.anl.gov Fri Aug 27 10:06:03 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 27 Aug 2010 09:06:03 -0600 (GMT-06:00)
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <-3390454574925164218@unknownmsgid>
Message-ID: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov>
Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct?
Is it possible to re-create this similar error in a similar test script?
Mihael, any thoughts on whether its likely that the prior fix did not address all cases?
Thanks,
- Mike
----- "Glen Hocky" wrote:
> Yes nominally the same error but it's not at the beginning but in the
> middle now for some reason. I think it's a mid-stated error message.
> I'll attach the log soon
>
> On Aug 27, 2010, at 12:11 AM, Michael Wilde
> wrote:
>
> > Glen, I wonder if whats happening here is that Swift will retry and
> lazily run past *job* errors, but the error below (a mapping error) is
> maybe being treated as an error in Swift's interpretation of the
> script itself, and this causes an immediate halt to execution?
> >
> > Can anyone confirm that this is whats happening, and if it is the
> expected behavior?
> >
> > Also, Glen, 2 questions:
> >
> > 1) Isn't the error below the one that was fixed by Mihael in a
> recent revision - the same one I looked at earlier in the week?
> >
> > 2) Do you know what errors the "Failed but can retry:8" message is
> referring to?
> >
> > Where is the log/run directory for this run? How long did it take
> to get the 589 jobs finished? It would be good to start plotting
> these large multi-site runs to get a sense of how the scheduler is
> doing.
> >
> > - Mike
> >
> >
> > ----- "Glen Hocky" wrote:
> >
> >> here's the result of my 13 site run that ran while i was out this
> >> evening. It did pretty well!
> >> but seems to have that problem of not quite lazy errors
> >> ........
> >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> >> Stage out:1 Finished successfully:586
> >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> >> Stage out:2 Finished successfully:587
> >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2
> Finished
> >> successfully:587 Failed but can retry:6
> >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> >> successfully:589 Failed but can retry:8
> >> Failed to transfer wrapper log from
> >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> >> UCHC_CBG_vdgateway.vcell.uchc.edu
> >> Execution failed:
> >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path
> (..logfile)
> >> for org.griphyn.vdl.mapping.DataNode identifier
> >> tag:benc at ci.uchicago.edu
> >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type
> GlassOut
> >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
From hategan at mcs.anl.gov Fri Aug 27 11:34:05 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 27 Aug 2010 11:34:05 -0500
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov>
References: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov>
Message-ID: <1282926845.19454.6.camel@blabla2.none>
Or if you can find the stack trace of that specific error in the log,
that might be useful.
On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote:
> Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct?
>
> Is it possible to re-create this similar error in a similar test script?
>
> Mihael, any thoughts on whether its likely that the prior fix did not address all cases?
>
> Thanks,
>
> - Mike
>
>
> ----- "Glen Hocky" wrote:
>
> > Yes nominally the same error but it's not at the beginning but in the
> > middle now for some reason. I think it's a mid-stated error message.
> > I'll attach the log soon
> >
> > On Aug 27, 2010, at 12:11 AM, Michael Wilde
> > wrote:
> >
> > > Glen, I wonder if whats happening here is that Swift will retry and
> > lazily run past *job* errors, but the error below (a mapping error) is
> > maybe being treated as an error in Swift's interpretation of the
> > script itself, and this causes an immediate halt to execution?
> > >
> > > Can anyone confirm that this is whats happening, and if it is the
> > expected behavior?
> > >
> > > Also, Glen, 2 questions:
> > >
> > > 1) Isn't the error below the one that was fixed by Mihael in a
> > recent revision - the same one I looked at earlier in the week?
> > >
> > > 2) Do you know what errors the "Failed but can retry:8" message is
> > referring to?
> > >
> > > Where is the log/run directory for this run? How long did it take
> > to get the 589 jobs finished? It would be good to start plotting
> > these large multi-site runs to get a sense of how the scheduler is
> > doing.
> > >
> > > - Mike
> > >
> > >
> > > ----- "Glen Hocky" wrote:
> > >
> > >> here's the result of my 13 site run that ran while i was out this
> > >> evening. It did pretty well!
> > >> but seems to have that problem of not quite lazy errors
> > >> ........
> > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> > >> Stage out:1 Finished successfully:586
> > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> > >> Stage out:2 Finished successfully:587
> > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2
> > Finished
> > >> successfully:587 Failed but can retry:6
> > >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> > >> successfully:589 Failed but can retry:8
> > >> Failed to transfer wrapper log from
> > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> > >> UCHC_CBG_vdgateway.vcell.uchc.edu
> > >> Execution failed:
> > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path
> > (..logfile)
> > >> for org.griphyn.vdl.mapping.DataNode identifier
> > >> tag:benc at ci.uchicago.edu
> > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type
> > GlassOut
> > >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> > >
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> > >
>
From hategan at mcs.anl.gov Fri Aug 27 11:41:07 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 27 Aug 2010 11:41:07 -0500
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <1282926845.19454.6.camel@blabla2.none>
References: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov>
<1282926845.19454.6.camel@blabla2.none>
Message-ID: <1282927267.19454.7.camel@blabla2.none>
Or even the log itself, because I don't think I have access to
engage-submit.
On Fri, 2010-08-27 at 11:34 -0500, Mihael Hategan wrote:
> Or if you can find the stack trace of that specific error in the log,
> that might be useful.
>
> On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote:
> > Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct?
> >
> > Is it possible to re-create this similar error in a similar test script?
> >
> > Mihael, any thoughts on whether its likely that the prior fix did not address all cases?
> >
> > Thanks,
> >
> > - Mike
> >
> >
> > ----- "Glen Hocky" wrote:
> >
> > > Yes nominally the same error but it's not at the beginning but in the
> > > middle now for some reason. I think it's a mid-stated error message.
> > > I'll attach the log soon
> > >
> > > On Aug 27, 2010, at 12:11 AM, Michael Wilde
> > > wrote:
> > >
> > > > Glen, I wonder if whats happening here is that Swift will retry and
> > > lazily run past *job* errors, but the error below (a mapping error) is
> > > maybe being treated as an error in Swift's interpretation of the
> > > script itself, and this causes an immediate halt to execution?
> > > >
> > > > Can anyone confirm that this is whats happening, and if it is the
> > > expected behavior?
> > > >
> > > > Also, Glen, 2 questions:
> > > >
> > > > 1) Isn't the error below the one that was fixed by Mihael in a
> > > recent revision - the same one I looked at earlier in the week?
> > > >
> > > > 2) Do you know what errors the "Failed but can retry:8" message is
> > > referring to?
> > > >
> > > > Where is the log/run directory for this run? How long did it take
> > > to get the 589 jobs finished? It would be good to start plotting
> > > these large multi-site runs to get a sense of how the scheduler is
> > > doing.
> > > >
> > > > - Mike
> > > >
> > > >
> > > > ----- "Glen Hocky" wrote:
> > > >
> > > >> here's the result of my 13 site run that ran while i was out this
> > > >> evening. It did pretty well!
> > > >> but seems to have that problem of not quite lazy errors
> > > >> ........
> > > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
> > > >> Stage out:1 Finished successfully:586
> > > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
> > > >> Stage out:2 Finished successfully:587
> > > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2
> > > Finished
> > > >> successfully:587 Failed but can retry:6
> > > >> Progress: Submitting:3 Submitted:262 Active:140 Finished
> > > >> successfully:589 Failed but can retry:8
> > > >> Failed to transfer wrapper log from
> > > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
> > > >> UCHC_CBG_vdgateway.vcell.uchc.edu
> > > >> Execution failed:
> > > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path
> > > (..logfile)
> > > >> for org.griphyn.vdl.mapping.DataNode identifier
> > > >> tag:benc at ci.uchicago.edu
> > > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type
> > > GlassOut
> > > >> with no value at dataset=modelOut path=[3][1][11] (not closed)
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> >
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From matthew.woitaszek at gmail.com Fri Aug 27 14:10:54 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Fri, 27 Aug 2010 13:10:54 -0600
Subject: [Swift-user] Deleting no longer necessary anonymous files in
_concurrent
Message-ID:
Good afternoon,
I'm working with a script that creates arrays of intermediate files
using the anonymous concurrent mapper, such as:
file wgt_file[];
As I expect, all of these files get generated in the remote swift
temporary directory and are then returned to the _concurrent directory
on the host executing Swift. However, in this particular application,
they're then immediately consumed by a subsequent procedure and never
needed again.
Is there a way to configure Swift or the file mapper declaration to
delete these files after the remaining script "consumes" them? (That
is, after all procedures relying on them as inputs have been
executed?) Or can (should?) that be done manually?
More speculatively, is there a way to keep files like these on the
execution host and not even bring them back to _concurrent? (With loss
of generality, I'm executing on a single site, and don't really ever
need the file locally, for restarts or staging to another site.)
Any advice about managing copies of large intermediate data files in
the Swift execution context would be appreciated!
Matthew
From wozniak at mcs.anl.gov Mon Aug 30 16:54:29 2010
From: wozniak at mcs.anl.gov (Justin M Wozniak)
Date: Mon, 30 Aug 2010 16:54:29 -0500 (CDT)
Subject: [Swift-user] Deleting no longer necessary anonymous files in
_concurrent
In-Reply-To:
References:
Message-ID:
Hi Matthew
Deleting files is out of the scope of the Swift language. You can
of course remove them yourself in your scripts, and as long as Swift does
not try to stage them out you should be fine.
You may want to look at external variables as another way to
approach this (manual 2.5). Using external variables you can manage the
files in your scripts while maintaining the Swift progress model.
Justin
On Fri, 27 Aug 2010, Matthew Woitaszek wrote:
> Good afternoon,
>
> I'm working with a script that creates arrays of intermediate files
> using the anonymous concurrent mapper, such as:
>
> file wgt_file[];
>
> As I expect, all of these files get generated in the remote swift
> temporary directory and are then returned to the _concurrent directory
> on the host executing Swift. However, in this particular application,
> they're then immediately consumed by a subsequent procedure and never
> needed again.
>
> Is there a way to configure Swift or the file mapper declaration to
> delete these files after the remaining script "consumes" them? (That
> is, after all procedures relying on them as inputs have been
> executed?) Or can (should?) that be done manually?
>
> More speculatively, is there a way to keep files like these on the
> execution host and not even bring them back to _concurrent? (With loss
> of generality, I'm executing on a single site, and don't really ever
> need the file locally, for restarts or staging to another site.)
>
> Any advice about managing copies of large intermediate data files in
> the Swift execution context would be appreciated!
>
> Matthew
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
--
Justin M Wozniak
From hockyg at gmail.com Thu Aug 26 23:36:23 2010
From: hockyg at gmail.com (Glen Hocky)
Date: Fri, 27 Aug 2010 04:36:23 -0000
Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question
In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov>
Message-ID: <-3390454574925164218@unknownmsgid>
Yes nominally the same error but it's not at the beginning but in the
middle now for some reason. I think it's a mid-stated error message.
I'll attach the log soon
On Aug 27, 2010, at 12:11 AM, Michael Wilde wrote:
> Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution?
>
> Can anyone confirm that this is whats happening, and if it is the expected behavior?
>
> Also, Glen, 2 questions:
>
> 1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week?
>
> 2) Do you know what errors the "Failed but can retry:8" message is referring to?
>
> Where is the log/run directory for this run? How long did it take to get the 589 jobs finished? It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing.
>
> - Mike
>
>
> ----- "Glen Hocky" wrote:
>
>> here's the result of my 13 site run that ran while i was out this
>> evening. It did pretty well!
>> but seems to have that problem of not quite lazy errors
>> ........
>> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3
>> Stage out:1 Finished successfully:586
>> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4
>> Stage out:2 Finished successfully:587
>> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished
>> successfully:587 Failed but can retry:6
>> Progress: Submitting:3 Submitted:262 Active:140 Finished
>> successfully:589 Failed but can retry:8
>> Failed to transfer wrapper log from
>> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on
>> UCHC_CBG_vdgateway.vcell.uchc.edu
>> Execution failed:
>> org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile)
>> for org.griphyn.vdl.mapping.DataNode identifier
>> tag:benc at ci.uchicago.edu
>> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut
>> with no value at dataset=modelOut path=[3][1][11] (not closed)
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>