From jon.monette at gmail.com Tue Aug 3 21:12:52 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 03 Aug 2010 21:12:52 -0500 Subject: [Swift-user] Montage wrapper error Message-ID: <4C58CCA4.7030805@gmail.com> Hello, Has anyone ever ran into this error: Failed to transfer wrapper log from m101_montage-20100803-2101-4ihqvdv9/info/s on teraport Execution failed: Exception in mProjectPP: Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits, proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr] Host: teraport Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj stderr.txt: stdout.txt: ---- Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Cannot determine the existence of the file Caused by: The connection has been closed [Unnamed Channel] Cleaning up... Shutting down service at https://128.135.125.117:52276 Got channel MetaChannel: 1867624887[1293086287: {}] -> GSSSChannel-11234326669(1)[1293086287: {}] + Done I am testing my wrappers to Montage on a larger scale and I keep getting this error. There is about 640 images but it only projects about 142 images before this error pops up. If my run will help my run exists at "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09" on the ci machines. Any help is much appreciated. -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Tue Aug 3 22:40:17 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Aug 2010 22:40:17 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <4C58CCA4.7030805@gmail.com> References: <4C58CCA4.7030805@gmail.com> Message-ID: <1280893217.28553.7.camel@blabla2.none> On the remote site there should be something called ~/.globus/coasters/coasters.log. It tends to contain useful information. As usual, the swift log also tends to contain useful information. However, Mike has mentioned some problems when using the coaster filesystem provider. In the effort to implement provider staging for coasters, that may have broke. Is that what you are using? (i.e. post sites.xml). Mihael On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote: > Hello, > Has anyone ever ran into this error: > > Failed to transfer wrapper log from > m101_montage-20100803-2101-4ihqvdv9/info/s on teraport > Execution failed: > Exception in mProjectPP: > Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits, > proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr] > Host: teraport > Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: > Cannot determine the existence of the file > Caused by: > The connection has been closed [Unnamed Channel] > Cleaning up... > Shutting down service at https://128.135.125.117:52276 > Got channel MetaChannel: 1867624887[1293086287: {}] -> > GSSSChannel-11234326669(1)[1293086287: {}] > + Done > > I am testing my wrappers to Montage on a larger scale and I keep getting > this error. There is about 640 images but it only projects about 142 > images before this error pops up. If my run will help my run exists at > "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09" > on the ci machines. Any help is much appreciated. > From jon.monette at gmail.com Tue Aug 3 22:49:56 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 03 Aug 2010 22:49:56 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <1280893492.28553.13.camel@blabla2.none> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> Message-ID: <4C58E364.7040207@gmail.com> to use the coaster filesystem i should use local:pbs in the jobmanager? On 8/3/10 10:44 PM, Mihael Hategan wrote: > Ok. Maybe not. Looks like an SSH issue. Can you try the coaster fs > provider instead? > > On Tue, 2010-08-03 at 22:41 -0500, Jonathan Monette wrote: > >> >> >> >> /home/jonmon/Library/Swift/work/localhost >> .05 >> >> >> >> > jobmanager="ssh:pbs" /> >> 3000 >> 8 >> 1 >> 1 >> 10 >> short >> 0.7 >> 10000 >> >> /home/jonmon/Library/swift/work/teraport >> >> >> This is my sites file >> >> On 8/3/10 10:40 PM, Mihael Hategan wrote: >> >>> On the remote site there should be something called >>> ~/.globus/coasters/coasters.log. It tends to contain useful information. >>> As usual, the swift log also tends to contain useful information. >>> >>> However, Mike has mentioned some problems when using the coaster >>> filesystem provider. In the effort to implement provider staging for >>> coasters, that may have broke. Is that what you are using? (i.e. post >>> sites.xml). >>> >>> Mihael >>> >>> On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote: >>> >>> >>>> Hello, >>>> Has anyone ever ran into this error: >>>> >>>> Failed to transfer wrapper log from >>>> m101_montage-20100803-2101-4ihqvdv9/info/s on teraport >>>> Execution failed: >>>> Exception in mProjectPP: >>>> Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits, >>>> proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr] >>>> Host: teraport >>>> Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> >>>> ---- >>>> >>>> Caused by: >>>> >>>> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: >>>> Cannot determine the existence of the file >>>> Caused by: >>>> The connection has been closed [Unnamed Channel] >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.117:52276 >>>> Got channel MetaChannel: 1867624887[1293086287: {}] -> >>>> GSSSChannel-11234326669(1)[1293086287: {}] >>>> + Done >>>> >>>> I am testing my wrappers to Montage on a larger scale and I keep getting >>>> this error. There is about 640 images but it only projects about 142 >>>> images before this error pops up. If my run will help my run exists at >>>> "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09" >>>> on the ci machines. Any help is much appreciated. >>>> >>>> >>>> >>> >>> >> > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Tue Aug 3 22:55:01 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Aug 2010 22:55:01 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <4C58E364.7040207@gmail.com> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com> Message-ID: <1280894101.28882.1.camel@blabla2.none> On Tue, 2010-08-03 at 22:49 -0500, Jonathan Monette wrote: > to use the coaster filesystem i should use local:pbs in the jobmanager? > > On 8/3/10 10:44 PM, Mihael Hategan wrote: > > Ok. Maybe not. Looks like an SSH issue. Can you try the coaster fs > > provider instead? > > > > On Tue, 2010-08-03 at 22:41 -0500, Jonathan Monette wrote: > > > >> > >> > >> > >> /home/jonmon/Library/Swift/work/localhost > >> .05 > >> > >> > >> > >> >> jobmanager="ssh:pbs" /> > >> 3000 > >> 8 > >> 1 > >> 1 > >> 10 > >> short > >> 0.7 > >> 10000 > >> > >> /home/jonmon/Library/swift/work/teraport > >> > >> > >> This is my sites file > >> > >> On 8/3/10 10:40 PM, Mihael Hategan wrote: > >> > >>> On the remote site there should be something called > >>> ~/.globus/coasters/coasters.log. It tends to contain useful information. > >>> As usual, the swift log also tends to contain useful information. > >>> > >>> However, Mike has mentioned some problems when using the coaster > >>> filesystem provider. In the effort to implement provider staging for > >>> coasters, that may have broke. Is that what you are using? (i.e. post > >>> sites.xml). > >>> > >>> Mihael > >>> > >>> On Tue, 2010-08-03 at 21:12 -0500, Jonathan Monette wrote: > >>> > >>> > >>>> Hello, > >>>> Has anyone ever ran into this error: > >>>> > >>>> Failed to transfer wrapper log from > >>>> m101_montage-20100803-2101-4ihqvdv9/info/s on teraport > >>>> Execution failed: > >>>> Exception in mProjectPP: > >>>> Arguments: [-X, raw_dir/2mass-atlas-990524n-j0320044.fits, > >>>> proj_dir/proj_2mass-atlas-990524n-j0320044.fits, template.hdr] > >>>> Host: teraport > >>>> Directory: m101_montage-20100803-2101-4ihqvdv9/jobs/s/mProjectPP-sz57orvj > >>>> stderr.txt: > >>>> > >>>> stdout.txt: > >>>> > >>>> ---- > >>>> > >>>> Caused by: > >>>> > >>>> org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: > >>>> Cannot determine the existence of the file > >>>> Caused by: > >>>> The connection has been closed [Unnamed Channel] > >>>> Cleaning up... > >>>> Shutting down service at https://128.135.125.117:52276 > >>>> Got channel MetaChannel: 1867624887[1293086287: {}] -> > >>>> GSSSChannel-11234326669(1)[1293086287: {}] > >>>> + Done > >>>> > >>>> I am testing my wrappers to Montage on a larger scale and I keep getting > >>>> this error. There is about 640 images but it only projects about 142 > >>>> images before this error pops up. If my run will help my run exists at > >>>> "/home/jonmon/Workspace/Swift/Montage/m101_j_4x4/runs/m101_montage_Aug-03-2010_21-01-09" > >>>> on the ci machines. Any help is much appreciated. > >>>> > >>>> > >>>> > >>> > >>> > >> > > > > > From jon.monette at gmail.com Tue Aug 3 22:59:53 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 03 Aug 2010 22:59:53 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <1280894101.28882.1.camel@blabla2.none> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com> <1280894101.28882.1.camel@blabla2.none> Message-ID: <4C58E5B9.80607@gmail.com> 3000 8 1 1 10 short 0.7 10000 /home/jonmon/Library/swift/work/teraport here is the new sites entry. I tried running my code and got this error. class org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws exception in doStuff. Fix it! java.lang.NullPointerException at org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79) at org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77) at org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36) at org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122) On 8/3/10 10:55 PM, Mihael Hategan wrote: > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Tue Aug 3 23:04:19 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Aug 2010 23:04:19 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <4C58E5B9.80607@gmail.com> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com> <1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com> Message-ID: <1280894659.29101.2.camel@blabla2.none> Eh. I'll see if I can fix those. Though if you are using coasters, and given that I do think I fixed the existing issues, there is one more thing you could try: in swift.properties, at the end, say use.provider.staging=true Mihael On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote: > > jobmanager="ssh:pbs" /> > > 3000 > 8 > 1 > 1 > 10 > short > 0.7 > 10000 > > /home/jonmon/Library/swift/work/teraport > > > here is the new sites entry. I tried running my code and got this error. > > class > org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws > exception in doStuff. Fix it! > java.lang.NullPointerException > at > org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79) > at > org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77) > at > org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36) > at > org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122) > > > On 8/3/10 10:55 PM, Mihael Hategan wrote: > > > > > From jon.monette at gmail.com Tue Aug 3 23:15:52 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 03 Aug 2010 23:15:52 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <1280894659.29101.2.camel@blabla2.none> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com> <1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com> <1280894659.29101.2.camel@blabla2.none> Message-ID: <4C58E978.5080208@gmail.com> with use.provider.staging=true i get Execution failed: Exception in mProjectPP: Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits, proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr] Host: teraport Directory: m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs ---- Caused by: Job failed with an exit code of 254 and with that line commented out i get a script printed to the screen saying it doesn't know what #!/BIN?BASH Execution failed: Could not initialize shared directory on teraport Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH On 8/3/10 11:04 PM, Mihael Hategan wrote: > Eh. I'll see if I can fix those. > > Though if you are using coasters, and given that I do think I fixed the > existing issues, there is one more thing you could try: > in swift.properties, at the end, say use.provider.staging=true > > Mihael > > On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote: > >> >> > jobmanager="ssh:pbs" /> >> >> 3000 >> 8 >> 1 >> 1 >> 10 >> short >> 0.7 >> 10000 >> >> /home/jonmon/Library/swift/work/teraport >> >> >> here is the new sites entry. I tried running my code and got this error. >> >> class >> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws >> exception in doStuff. Fix it! >> java.lang.NullPointerException >> at >> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79) >> at >> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77) >> at >> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36) >> at >> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122) >> >> >> On 8/3/10 10:55 PM, Mihael Hategan wrote: >> >>> >>> >>> >> > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Tue Aug 3 23:20:31 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Aug 2010 23:20:31 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <4C58E978.5080208@gmail.com> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com> <1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com> <1280894659.29101.2.camel@blabla2.none> <4C58E978.5080208@gmail.com> Message-ID: <1280895631.29340.0.camel@blabla2.none> Ok. I'll need to take a look at these. On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote: > with use.provider.staging=true i get > Execution failed: > Exception in mProjectPP: > Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits, > proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr] > Host: teraport > Directory: > m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs > ---- > > Caused by: > Job failed with an exit code of 254 > > and with that line commented out i get a script printed to the screen > saying it doesn't know what #!/BIN?BASH > > Execution failed: > Could not initialize shared directory on teraport > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: > org.globus.cog.karajan.workflow.service.ProtocolException: Unknown > command: #!/BIN/BASH > > > On 8/3/10 11:04 PM, Mihael Hategan wrote: > > Eh. I'll see if I can fix those. > > > > Though if you are using coasters, and given that I do think I fixed the > > existing issues, there is one more thing you could try: > > in swift.properties, at the end, say use.provider.staging=true > > > > Mihael > > > > On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote: > > > >> > >> >> jobmanager="ssh:pbs" /> > >> > >> 3000 > >> 8 > >> 1 > >> 1 > >> 10 > >> short > >> 0.7 > >> 10000 > >> > >> /home/jonmon/Library/swift/work/teraport > >> > >> > >> here is the new sites entry. I tried running my code and got this error. > >> > >> class > >> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws > >> exception in doStuff. Fix it! > >> java.lang.NullPointerException > >> at > >> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79) > >> at > >> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77) > >> at > >> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36) > >> at > >> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122) > >> > >> > >> On 8/3/10 10:55 PM, Mihael Hategan wrote: > >> > >>> > >>> > >>> > >> > > > > > From jon.monette at gmail.com Tue Aug 3 23:29:22 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 03 Aug 2010 23:29:22 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <1280895631.29340.0.camel@blabla2.none> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com> <1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com> <1280894659.29101.2.camel@blabla2.none> <4C58E978.5080208@gmail.com> <1280895631.29340.0.camel@blabla2.none> Message-ID: <4C58ECA2.1010305@gmail.com> Ok. Let me know if you need more of my input files or configurations. On 8/3/10 11:20 PM, Mihael Hategan wrote: > Ok. I'll need to take a look at these. > > On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote: > >> with use.provider.staging=true i get >> Execution failed: >> Exception in mProjectPP: >> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits, >> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr] >> Host: teraport >> Directory: >> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs >> ---- >> >> Caused by: >> Job failed with an exit code of 254 >> >> and with that line commented out i get a script printed to the screen >> saying it doesn't know what #!/BIN?BASH >> >> Execution failed: >> Could not initialize shared directory on teraport >> Caused by: >> org.globus.cog.abstraction.impl.file.FileResourceException: >> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown >> command: #!/BIN/BASH >> >> >> On 8/3/10 11:04 PM, Mihael Hategan wrote: >> >>> Eh. I'll see if I can fix those. >>> >>> Though if you are using coasters, and given that I do think I fixed the >>> existing issues, there is one more thing you could try: >>> in swift.properties, at the end, say use.provider.staging=true >>> >>> Mihael >>> >>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote: >>> >>> >>>> >>>> >>> jobmanager="ssh:pbs" /> >>>> >>>> 3000 >>>> 8 >>>> 1 >>>> 1 >>>> 10 >>>> short >>>> 0.7 >>>> 10000 >>>> >>>> /home/jonmon/Library/swift/work/teraport >>>> >>>> >>>> here is the new sites entry. I tried running my code and got this error. >>>> >>>> class >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws >>>> exception in doStuff. Fix it! >>>> java.lang.NullPointerException >>>> at >>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79) >>>> at >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77) >>>> at >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36) >>>> at >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122) >>>> >>>> >>>> On 8/3/10 10:55 PM, Mihael Hategan wrote: >>>> >>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Tue Aug 3 23:33:55 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Aug 2010 23:33:55 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <4C58ECA2.1010305@gmail.com> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com> <1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com> <1280894659.29101.2.camel@blabla2.none> <4C58E978.5080208@gmail.com> <1280895631.29340.0.camel@blabla2.none> <4C58ECA2.1010305@gmail.com> Message-ID: <1280896435.29512.1.camel@blabla2.none> It may be useful for quickly reproducing it. You already know what I need. Config files, input files, table files, and scripts if they changed. Mihael On Tue, 2010-08-03 at 23:29 -0500, Jonathan Monette wrote: > Ok. Let me know if you need more of my input files or configurations. > > On 8/3/10 11:20 PM, Mihael Hategan wrote: > > Ok. I'll need to take a look at these. > > > > On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote: > > > >> with use.provider.staging=true i get > >> Execution failed: > >> Exception in mProjectPP: > >> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits, > >> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr] > >> Host: teraport > >> Directory: > >> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs > >> ---- > >> > >> Caused by: > >> Job failed with an exit code of 254 > >> > >> and with that line commented out i get a script printed to the screen > >> saying it doesn't know what #!/BIN?BASH > >> > >> Execution failed: > >> Could not initialize shared directory on teraport > >> Caused by: > >> org.globus.cog.abstraction.impl.file.FileResourceException: > >> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown > >> command: #!/BIN/BASH > >> > >> > >> On 8/3/10 11:04 PM, Mihael Hategan wrote: > >> > >>> Eh. I'll see if I can fix those. > >>> > >>> Though if you are using coasters, and given that I do think I fixed the > >>> existing issues, there is one more thing you could try: > >>> in swift.properties, at the end, say use.provider.staging=true > >>> > >>> Mihael > >>> > >>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote: > >>> > >>> > >>>> > >>>> >>>> jobmanager="ssh:pbs" /> > >>>> > >>>> 3000 > >>>> 8 > >>>> 1 > >>>> 1 > >>>> 10 > >>>> short > >>>> 0.7 > >>>> 10000 > >>>> > >>>> /home/jonmon/Library/swift/work/teraport > >>>> > >>>> > >>>> here is the new sites entry. I tried running my code and got this error. > >>>> > >>>> class > >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws > >>>> exception in doStuff. Fix it! > >>>> java.lang.NullPointerException > >>>> at > >>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79) > >>>> at > >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77) > >>>> at > >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36) > >>>> at > >>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122) > >>>> > >>>> > >>>> On 8/3/10 10:55 PM, Mihael Hategan wrote: > >>>> > >>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > > > > > From jon.monette at gmail.com Tue Aug 3 23:40:37 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 03 Aug 2010 23:40:37 -0500 Subject: [Swift-user] Montage wrapper error In-Reply-To: <1280896435.29512.1.camel@blabla2.none> References: <4C58CCA4.7030805@gmail.com> <1280893217.28553.7.camel@blabla2.none> <4C58E15D.3040705@gmail.com> <1280893492.28553.13.camel@blabla2.none> <4C58E364.7040207@gmail.com> <1280894101.28882.1.camel@blabla2.none> <4C58E5B9.80607@gmail.com> <1280894659.29101.2.camel@blabla2.none> <4C58E978.5080208@gmail.com> <1280895631.29340.0.camel@blabla2.none> <4C58ECA2.1010305@gmail.com> <1280896435.29512.1.camel@blabla2.none> Message-ID: <4C58EF45.3010506@gmail.com> all files used for this run and created (log files and such) are in $HOME/Workspace/Swift/Montage/m101_j_4x4/runs on the ci machines. If you would like I can tar up one of the runs. Not sure which you would prefer. On 8/3/10 11:33 PM, Mihael Hategan wrote: > It may be useful for quickly reproducing it. You already know what I > need. Config files, input files, table files, and scripts if they > changed. > > Mihael > > On Tue, 2010-08-03 at 23:29 -0500, Jonathan Monette wrote: > >> Ok. Let me know if you need more of my input files or configurations. >> >> On 8/3/10 11:20 PM, Mihael Hategan wrote: >> >>> Ok. I'll need to take a look at these. >>> >>> On Tue, 2010-08-03 at 23:15 -0500, Jonathan Monette wrote: >>> >>> >>>> with use.provider.staging=true i get >>>> Execution failed: >>>> Exception in mProjectPP: >>>> Arguments: [-X, raw_dir/2mass-atlas-990214n-j1210091.fits, >>>> proj_dir/proj_2mass-atlas-990214n-j1210091.fits, template.hdr] >>>> Host: teraport >>>> Directory: >>>> m101_montage-20100803-2312-ex8oc5ff/jobs/y/mProjectPP-yt0gtrvjTODO: outs >>>> ---- >>>> >>>> Caused by: >>>> Job failed with an exit code of 254 >>>> >>>> and with that line commented out i get a script printed to the screen >>>> saying it doesn't know what #!/BIN?BASH >>>> >>>> Execution failed: >>>> Could not initialize shared directory on teraport >>>> Caused by: >>>> org.globus.cog.abstraction.impl.file.FileResourceException: >>>> org.globus.cog.karajan.workflow.service.ProtocolException: Unknown >>>> command: #!/BIN/BASH >>>> >>>> >>>> On 8/3/10 11:04 PM, Mihael Hategan wrote: >>>> >>>> >>>>> Eh. I'll see if I can fix those. >>>>> >>>>> Though if you are using coasters, and given that I do think I fixed the >>>>> existing issues, there is one more thing you could try: >>>>> in swift.properties, at the end, say use.provider.staging=true >>>>> >>>>> Mihael >>>>> >>>>> On Tue, 2010-08-03 at 22:59 -0500, Jonathan Monette wrote: >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>> jobmanager="ssh:pbs" /> >>>>>> >>>>>> 3000 >>>>>> 8 >>>>>> 1 >>>>>> 1 >>>>>> 10 >>>>>> short >>>>>> 0.7 >>>>>> 10000 >>>>>> >>>>>> /home/jonmon/Library/swift/work/teraport >>>>>> >>>>>> >>>>>> here is the new sites entry. I tried running my code and got this error. >>>>>> >>>>>> class >>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer throws >>>>>> exception in doStuff. Fix it! >>>>>> java.lang.NullPointerException >>>>>> at >>>>>> org.globus.cog.abstraction.impl.file.coaster.commands.PutFileCommand.dataRead(PutFileCommand.java:79) >>>>>> at >>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.ReadBuffer.bufferRead(ReadBuffer.java:77) >>>>>> at >>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.NIOChannelReadBuffer.doStuff(NIOChannelReadBuffer.java:36) >>>>>> at >>>>>> org.globus.cog.abstraction.impl.file.coaster.buffers.Buffers.run(Buffers.java:122) >>>>>> >>>>>> >>>>>> On 8/3/10 10:55 PM, Mihael Hategan wrote: >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From iraicu at cs.uchicago.edu Wed Aug 4 03:15:03 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 04 Aug 2010 03:15:03 -0500 Subject: [Swift-user] CFP: The 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010, co-located with Supercomputing 2010 -- November 15th, 2010 Message-ID: <4C592187.6050702@cs.uchicago.edu> Call for Papers ------------------------------------------------------------------------------------------------ The 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010 http://dsl.cs.uchicago.edu/MTAGS10/ ------------------------------------------------------------------------------------------------ November 15th, 2010 New Orleans, Louisiana, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC10) ================================================================================================ The 3rd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all topics related to MTC on large scale systems. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library (pending approval). The workshop will be co-located with the IEEE/ACM Supercomputing 2010 Conference in New Orleans Louisiana on November 15th, 2010. For more information, please see http://dsl.cs.uchicago.edu/MTAGS010/. Topics ------------------------------------------------------------------------------------------------ We invite the submission of original work that is related to the topics below. The papers can be either short (5 pages) position papers, or long (10 pages) research papers. Topics of interest include (in the context of Many-Task Computing): * Compute Resource Management * Scheduling * Job execution frameworks * Local resource manager extensions * Performance evaluation of resource managers in use on large scale systems * Dynamic resource provisioning * Techniques to manage many-core resources and/or GPUs * Challenges and opportunities in running many-task workloads on HPC systems * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Storage architectures and implementations * Distributed file systems * Parallel file systems * Distributed meta-data management * Content distribution systems for large data * Data caching frameworks and techniques * Data management within and across data centers * Data-aware scheduling * Data-intensive computing applications * Eventual-consistency storage usage and management * Programming models and tools * Map-reduce and its generalizations * Many-task computing middleware and applications * Parallel programming frameworks * Ensemble MPI techniques and frameworks * Service-oriented science applications * Large-Scale Workflow Systems * Workflow system performance and scalability analysis * Scalability of workflow systems * Workflow infrastructure and e-Science middleware * Programming Paradigms and Models * Large-Scale Many-Task Applications * High-throughput computing (HTC) applications * Data-intensive applications * Quasi-supercomputing applications, deployments, and experiences * Performance Evaluation * Performance evaluation * Real systems * Simulations * Reliability of large systems Paper Submission and Publication ------------------------------------------------------------------------------------------------ Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines; document templates can be found at ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.pdf and ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.doc. We are also seeking position papers of no more than 5 pages in length. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2010/ before the deadline of August 25th, 2010 at 11:59PM PST; the final 5/10 page papers in PDF format will be due on September 1st, 2010 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library (pending approval). Notifications of the paper decisions will be sent out by October 1st, 2010. Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters; see this year's ongoing special issue in the IEEE Transactions on Parallel and Distributed Systems (TPDS) at http://dsl.cs.uchicago.edu/TPDS_MTC/. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visit http://dsl.cs.uchicago.edu/MTAGS10/. Important Dates ------------------------------------------------------------------------------------------------ * Abstract Due: August 25th, 2010 * Papers Due: September 1st, 2010 * Notification of Acceptance: October 1st, 2010 * Camera Ready Papers Due: November 1st, 2010 * Workshop Date: November 15th, 2010 Committee Members ------------------------------------------------------------------------------------------------ Workshop Chairs * Ioan Raicu, Illinois Institute of Technology * Ian Foster, University of Chicago & Argonne National Laboratory * Yong Zhao, Microsoft Steering Committee * David Abramson, Monash University, Australia * Alok Choudhary, Northwestern University, USA * Jack Dongara, University of Tennessee, USA * Geoffrey Fox, Indiana University, USA * Robert Grossman, University of Illinois at Chicago, USA * Arthur Maccabe, Oak Ridge National Labs, USA * Dan Reed, Microsoft Research, USA * Marc Snir, University of Illinois at Urbana Champaign, USA * Xian-He Sun, Illinois Institute of Technology, USA * Manish Parashar, Rutgers University, USA Technical Committee * Roger Barga, Microsoft Research, USA * Mihai Budiu, Microsoft Research, USA * Rajkumar Buyya, University of Melbourne, Australia * Henri Casanova, University of Hawaii at Manoa, USA * Jeff Chase, Duke University, USA * Peter Dinda, Northwestern University, USA * Catalin Dumitrescu, Fermi National Labs, USA * Evangelinos Constantinos, Massachusetts Institute of Technology, USA * Indranil Gupta, University of Illinois at Urbana Champaign, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Florin Isaila, Universidad Carlos III de Madrid, Spain * Michael Isard, Microsoft Research, USA * Kamil Iskra, Argonne National Laboratory, USA * Daniel Katz, University of Chicago, USA * Tevfik Kosar, Louisiana State University, USA * Zhiling Lan, Illinois Institute of Technology, USA * Ignacio Llorente, Universidad Complutense de Madrid, Spain * Reagan Moore, University of North Carolina, Chappel Hill, USA * Jose Moreira, IBM Research, USA * Marlon Pierce, Indiana University, USA * Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA * Matei Ripeanu, University of British Columbia, Canada * Alain Roy, University of Wisconsin Madison, USA * Edward Walker, Texas Advanced Computing Center, USA * Mike Wilde, University of Chicago & Argonne National Laboratory, USA * Matthew Woitaszek, The University Coorporation for Atmospheric Research, USA * Justin Wozniak, Argonne National Laboratory, USA * Ken Yocum, University of California San Diego, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= From jon.monette at gmail.com Thu Aug 5 13:36:52 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Thu, 05 Aug 2010 13:36:52 -0500 Subject: [Swift-user] Pads submit problem Message-ID: <4C5B04C4.9030508@gmail.com> Has anyone every seen this error and know what causes it? Caused by: Exitcode file not found 5 queue polls after the job was reported done -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From jon.monette at gmail.com Thu Aug 5 14:11:23 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Thu, 05 Aug 2010 14:11:23 -0500 Subject: [Swift-user] Coaster error Message-ID: <4C5B0CDB.1010404@gmail.com> Hello again, Also has anyone seen this error and know what it means? Worker task failed: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 29 at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95) at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240) at org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104) at org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81) at org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177) at org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126) at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169) at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82) at java.lang.Thread.run(Thread.java:619) -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Thu Aug 5 15:33:45 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Aug 2010 15:33:45 -0500 Subject: [Swift-user] Coaster error In-Reply-To: <4C5B0CDB.1010404@gmail.com> References: <4C5B0CDB.1010404@gmail.com> Message-ID: <1281040425.8893.3.camel@blabla2.none> It means that qstat failed for some reason. Can you reproduce it? On Thu, 2010-08-05 at 14:11 -0500, Jonathan Monette wrote: > Hello again, > Also has anyone seen this error and know what it means? > > Worker task failed: > org.globus.cog.abstraction.impl.common.execution.JobException: Job > failed with an exit code of 29 > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240) > at > org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104) > at > org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81) > at > org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177) > at > org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82) > at java.lang.Thread.run(Thread.java:619) > From hategan at mcs.anl.gov Thu Aug 5 15:34:58 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Aug 2010 15:34:58 -0500 Subject: [Swift-user] Coaster error In-Reply-To: <1281040425.8893.3.camel@blabla2.none> References: <4C5B0CDB.1010404@gmail.com> <1281040425.8893.3.camel@blabla2.none> Message-ID: <1281040498.8893.4.camel@blabla2.none> Ignore the message I just sent. On Thu, 2010-08-05 at 15:33 -0500, Mihael Hategan wrote: > It means that qstat failed for some reason. > > Can you reproduce it? > > On Thu, 2010-08-05 at 14:11 -0500, Jonathan Monette wrote: > > Hello again, > > Also has anyone seen this error and know what it means? > > > > Worker task failed: > > org.globus.cog.abstraction.impl.common.execution.JobException: Job > > failed with an exit code of 29 > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.processCompleted(AbstractJobSubmissionTaskHandler.java:95) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.processCompleted(AbstractExecutor.java:240) > > at > > org.globus.cog.abstraction.impl.scheduler.common.Job.processExitCode(Job.java:104) > > at > > org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:81) > > at > > org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177) > > at > > org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82) > > at java.lang.Thread.run(Thread.java:619) > > > From iraicu at cs.uchicago.edu Sat Aug 7 14:28:18 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 07 Aug 2010 14:28:18 -0500 Subject: [Swift-user] CFP: The 5th Workshop on Workflows in Support of Large-Scale Science 2010 Message-ID: <4C5DB3D2.2040703@cs.uchicago.edu> Call for Papers The 5th Workshop on Workflows in Support of Large-Scale Science in conjunction with SC?10 New Orleans, LA November 14, 2010 http://www.isi.edu/works10 Scientific workflows are a key technology that enables large-scale computations and service management on distributed resources. Workflows enable scientists to design complex analysis that are composed of individual application components or services and often such components and services are designed, developed, and tested collaboratively. The size of the data and the complexity of the analysis often lead to large amounts of shared resources, such as clusters and storage systems, being used to store the data sets and execute the workflows. The process of workflow design and execution in a distributed environment can be very complex and can involve multiple stages including their textual or graphical specification, the mapping of the high-level workflow descriptions onto the available resources, as well as monitoring and debugging of the subsequent execution. Further, since computations and data access operations are performed on shared resources, there is an increased interest in managing the fair allocation and management of those resources at the workflow level. Large-scale scientific applications pose several requirements on the workflow systems. Besides the magnitude of data processed by the workflow components, the intermediate and resulting data needs to be annotated with provenance and other information to evaluate the quality of the data and support the repeatability of the analysis. Further, adequate workflow descriptions are needed to support the complex workflow management process which includes workflow creation, workflow reuse, and modifications made to the workflow over time?for example modifications to the individual workflow components. Additional workflow annotations may provide guidelines and requirements for resource mapping and execution. The Fifth Workshop on Workflows in Support of Large-Scale Science focuses on the entire workflow lifecycle including the workflow composition, mapping, robust execution and the recording of provenance information. The workshop also welcomes contributions in the applications area, where the requirements on the workflow management systems can be derived. Special attention will be paid to Bio-Computing applications which are the theme for SC10. The topics of the workshop include but are not limited to: * Workflow applications and their requirements with special emphasis on Bio-Computing applications. * Workflow composition, tools and languages. * Workflow user environments, including portals. * Workflow refinement tools that can manage the workflow mapping process. * Workflow execution in distributed environments. * Workflow fault-tolerance and recovery techniques. * Data-driven workflow processing. * Adaptive workflows. * Workflow monitoring. * Workflow optimizations. * Performance analysis of workflows * Workflow debugging. * Workflow provenance. * Interactive workflows. * Workflow interoperability * Mashups and workflows * Workflows on the cloud. Important Dates: Papers due September 3, 2010 Notifications of acceptance September 30, 2010 Final papers due October 8, 2010 We will accept both short (6pages) and long (10 page) papers. The papers should be in IEEE format. To submit the papers, please submit to EasyChair at http://www.easychair.org/conferences/?conf=works10 If you have questions, please email works10 at isi.edu -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= From iraicu at cs.uchicago.edu Tue Aug 10 20:03:02 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 10 Aug 2010 20:03:02 -0500 Subject: [Swift-user] CFP: The 20th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2011 Message-ID: <4C61F6C6.3020203@cs.uchicago.edu> Call For Papers The 20th International ACM Symposium on High-Performance Parallel and Distributed Computing http://www.hpdc.org/2011/ San Jose, California, June 8-11, 2011 The ACM International Symposium on High-Performance Parallel and Distributed Computing is the premier conference for presenting the latest research on the design, implementation, evaluation, and use of parallel and distributed systems for high end computing. The 20th installment of HPDC will take place in San Jose, California, in the heart of Silicon Valley. This year, HPDC is affiliated with the ACM Federated Computing Research Conference, consisting of fifteen leading ACM conferences all in one week. HPDC will be held on June 9-11 (Thursday through Saturday) with affiliated workshops taking place on June 8th (Wednesday). Submissions are welcomed on all forms of high performance parallel and distributed computing, including but not limited to clusters, clouds, grids, utility computing, data-intensive computing, multicore and parallel computing. All papers will be reviewed by a distinguished program committee, with a strong preference for rigorous results obtained in operational parallel and distributed systems. All papers will be evaluated for correctness, originality, potential impact, quality of presentation, and interest and relevance to the conference. In addition to traditional technical papers, we also invite experience papers. Such papers should present operational details of a production high end system or application, and draw out conclusions gained from operating the system or application. The evaluation of experience papers will place a greater weight on the real-world impact of the system and the value of conclusions to future system designs. Topics of interest include, but are not limited to: ------------------------------------------------------------------------------- # Applications of parallel and distributed computing. # Systems, networks, and architectures for high end computing. # Parallel and multicore issues and opportunities. # Virtualization of machines, networks, and storage. # Programming languages and environments. # I/O, file systems, and data management. # Data intensive computing. # Resource management, scheduling, and load-balancing. # Performance modeling, simulation, and prediction. # Fault tolerance, reliability and availability. # Security, configuration, policy, and management issues. # Models and use cases for utility, grid, and cloud computing. Authors are invited to submit technical papers of at most 12 pages in PDF format, including all figures and references. Papers should be formatted in the ACM Proceedings Style and submitted via the conference web site. Accepted papers will appear in the conference proceedings, and will be incorporated into the ACM Digital Library. Papers must be self-contained and provide the technical substance required for the program committee to evaluate the paper's contribution. Papers should thoughtfully address all related work, particularly work presented at previous HPDC events. Submitted papers must be original work that has not appeared in and is not under consideration for another conference or a journal. See the ACM Prior Publication Policy for more details. Workshops ------------------------------------------------------------------------------- We invite proposals for workshops affiliated with HPDC to be held on Wednesday, June 8th. For more information, see the Call for Workshops at http://www.hpdc.org/2011/cfw.php. Important Dates ------------------------------------------------------------------------------- Workshop Proposals Due 1 October 2010 Technical Papers Due: 17 January 2011 PAPER DEADLINE EXTENDED: 24 January 2011 (No further extensions!) Author Notifications: 28 February 2011 Final Papers Due: 24 March 2011 Conference Dates: 8-11 June 2011 Organization ------------------------------------------------------------------------------- General Chair Barney Maccabe, Oak Ridge National Laboratory Program Chair Douglas Thain, University of Notre Dame Workshops Chair Mike Lewis, Binghamton University Local Arrangements Chair Nick Wright, Lawrence Berkeley National Laboratory Publicity Chairs Alexandru Iosup, Delft University John Lange, University of Pittsburgh Ioan Raicu, Illinois Institute of Technology Yong Zhao, Microsoft Program Committee Kento Aida, National Institute of Informatics Henri Bal, Vrije Universiteit Roger Barga, Microsoft Jim Basney, NCSA John Bent, Los Alamos National Laboratory Ron Brightwell, Sandia National Laboratories Shawn Brown, Pittsburgh Supercomputer Center Claris Castillo, IBM Andrew A. Chien, UC San Diego and SDSC Ewa Deelman, USC Information Sciences Institute Peter Dinda, Northwestern University Scott Emrich, University of Notre Dame Dick Epema, TU-Delft Gilles Fedak, INRIA Renato Figuierdo, University of Florida Ian Foster, University of Chicago and Argonne National Laboratory Gabriele Garzoglio, Fermi National Accelerator Laboratory Rong Ge, Marquette University Sebastien Goasguen, Clemson University Kartik Gopalan, Binghamton University Dean Hildebrand, IBM Almaden Adriana Iamnitchi, University of South Florida Alexandru Iosup, TU-Delft Keith Jackson, Lawrence Berkeley Shantenu Jha, Louisiana State University Daniel S. Katz, University of Chicago and Argonne National Laboratory Thilo Kielmann, Vrije Universiteit Charles Killian, Purdue University Tevfik Kosar, Louisiana State University John Lange, University of Pittsburgh Mike Lewis, Binghamton University Barney Maccabe, Oak Ridge National Laboratory Grzegorz Malewicz, Google Satoshi Matsuoka, Tokyo Institute of Technology Jarek Nabrzyski, University of Notre Dame Manish Parashar, Rutgers University Beth Plale, Indiana University Ioan Raicu, Illinois Institute of Technology Philip Rhodes, University of Mississippi Philip Roth, Oak Ridge National Laboratory Karsten Schwan, Georgia Tech Martin Swany, University of Delaware Jon Weissman, University of Minnesota Dongyan Xu, Purdue University Ken Yocum, UCSD Yong Zhao, Microsoft Steering Committee Henri Bal, Vrije Universiteit Andrew A. Chien, UC San Diego and SDSC Peter Dinda, Northwestern University Ian Foster, Argonne National Laboratory and University of Chicago Dennis Gannon, Microsoft Salim Hariri, University of Arizona Dieter Kranzlmueller, Ludwig-Maximilians-Univ. Muenchen Manish Parashar, Rutgers University Karsten Schwan, Georgia Tech Jon Weissman, University of Minnesota (Chair) -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Tue Aug 10 20:47:53 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 10 Aug 2010 20:47:53 -0500 Subject: [Swift-user] Call for Workshops at ACM HPDC 2011 Message-ID: <4C620149.3060609@cs.uchicago.edu> Call for Workshops The 20th International ACM Symposium on High-Performance Parallel and Distributed Computing http://www.hpdc.org/2011/ San Jose, California, June 8-11, 2011 ------------------------------------------------------------------------------- The ACM Symposium on High Performance Distributed Computing (HPDC) conference organizers invite proposals for Workshops to be held with HPDC in San Jose, California in June 2011. Workshops will run on June 8, preceding the main conference sessions June 9-11. HPDC 2011 is the 20th anniversary of HPDC, a preeminent conference in high performance computing, including cloud and grid computing. This year's conference will be held in conjunction with the Federated Computing Research Conference (FCRC), which includes high profile conferences in complementary research areas, providing a unique opportunity for a broader technical audience and wider impact for successful workshops. Workshops provide forums for discussion among researchers and practitioners on focused topics, emerging research areas, or both. Organizers may structure workshops as they see fit, possibly including invited talks, panel discussions, presentations of work in progress, fully peer-reviewed papers, or some combination. Workshops could be scheduled for a half day or a full day, depending on interest, space constraints, and organizer preference. Organizers should design workshops for approximately 20-40 participants, to balance impact and effective discussion. A workshop proposal must be made in writing, sent to Mike Lewis at mlewis at cs.binghamton.edu, and should include: # The name of the workshop # Several paragraphs describing the theme of the workshop and how it relates to the HPDC conference # Data about previous offerings of the workshop (if any), including attendance, number of papers, or presentations submitted and accepted # Names and affiliations of the workshop organizers, and if applicable, a significant portion of the program committee # A plan for attracting submissions and attendees Due to publication deadlines, workshops must operate within roughly the following timeline: papers due in early February (2-3 weeks after the HPDC deadline) and selected and sent to the publisher by late February. IMPORTANT DATES: # Workshop Proposals Deadline: October 1, 2010 # Notification: October 25, 2010 # Workshop CFPs Online and Distributed: November 8, 2010 -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From dk0966 at cs.ship.edu Fri Aug 20 11:04:30 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 20 Aug 2010 12:04:30 -0400 Subject: [Swift-user] Exitcode file not found Message-ID: Hello, While running Mike's MODIS demo on PADS with pbs and coasters, I receive the following error: Worker task failed: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exitcode file not found 5 queue polls after the job was reported done at org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:66) at org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177) at org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126) at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169) at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82) at java.lang.Thread.run(Thread.java:619) I also receive errors relating to qdel: Canceling job Failed to shut down block org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Failed to cancel task. qdel returned with an exit code of 1 at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:159) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:101) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:90) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44) at org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:293) at org.globus.cog.abstraction.coaster.service.job.manager.Block.shutdown(Block.java:274) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.shutdownBlocks(BlockQueueProcessor.java:518) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.shutdown(BlockQueueProcessor.java:510) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.shutdown(JobQueue.java:108) at org.globus.cog.abstraction.coaster.service.CoasterService.shutdown(CoasterService.java:249) at org.globus.cog.abstraction.coaster.service.ServiceShutdownHandler.requestComplete(ServiceShutdownHandler.java:28) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:387) at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel.actualSend(AbstractPipedChannel.java:86) at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:115) Canceling job Checking through the mailing list archives, I found an instance where this was happening when the work directory was /var/tmp and not consistent across all nodes. The work directory in my configuration is /home/davidk/swiftwork, so I'm not sure what's causing it. Attached are the sites.xml, tc.data, swift.properties and the script I'm using. The full log can be found in /home/davidk/modis/run.0019. Thanks, David -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: modis.swift Type: application/octet-stream Size: 1270 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sites.xml Type: text/xml Size: 730 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swift.properties Type: application/octet-stream Size: 11600 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tc.data Type: application/octet-stream Size: 2026 bytes Desc: not available URL: From dk0966 at cs.ship.edu Fri Aug 20 12:25:37 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 20 Aug 2010 13:25:37 -0400 Subject: [Swift-user] Exitcode file not found In-Reply-To: <4C6EA95F.1000106@gmail.com> References: <4C6EA95F.1000106@gmail.com> Message-ID: <1282325137.2492.11.camel@hbk> I added the internalhostname profile like you suggested, but for some reason I am still getting the exitcode and qdel errors. Is there anything else I'm missing? The updated info is in /home/davidk/modis/run.0020. Thanks. 3500 8 4 4 32 fast 3 10000 192.5.86.5 /home/davidk/swiftwork On Fri, 2010-08-20 at 11:12 -0500, Jonathan Monette wrote: > You must set the interanlhostname parameter for coasters. > > namespace="globus">192.5.86.6. This is when you are on the > login2 node for PADS. > namespace="globus">192.5.86.5. This is when you are on the > login1 node for PADS. From aespinosa at cs.uchicago.edu Mon Aug 23 14:41:35 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 23 Aug 2010 14:41:35 -0500 Subject: [Swift-user] coaster maxtime Message-ID: Hi, Can someone remind me what is the units of the maxtime parameter for coasters? Table 13 of the user guide does not specify it. Thanks, -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Mon Aug 23 14:44:15 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Aug 2010 13:44:15 -0600 (GMT-06:00) Subject: [Swift-user] coaster maxtime In-Reply-To: Message-ID: <409785339.1210161282592655201.JavaMail.root@zimbra.anl.gov> seconds. - Mike ----- "Allan Espinosa" wrote: > Hi, > > Can someone remind me what is the units of the maxtime parameter for > coasters? Table 13 of the user guide does not specify it. > > Thanks, > -Allan > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Aug 23 14:46:42 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Aug 2010 14:46:42 -0500 Subject: [Swift-user] coaster maxtime In-Reply-To: References: Message-ID: <1282592802.2046.0.camel@blabla2.none> Seconds. This should be changed to be the same as a walltime spec. On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote: > Hi, > > Can someone remind me what is the units of the maxtime parameter for > coasters? Table 13 of the user guide does not specify it. > > Thanks, > -Allan > From wilde at mcs.anl.gov Mon Aug 23 14:49:55 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Aug 2010 13:49:55 -0600 (GMT-06:00) Subject: [Swift-user] coaster maxtime In-Reply-To: <1282592802.2046.0.camel@blabla2.none> Message-ID: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov> I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields: hh:mm:ss or nnnn seconds. - Mike ----- "Mihael Hategan" wrote: > Seconds. This should be changed to be the same as a walltime spec. > > On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote: > > Hi, > > > > Can someone remind me what is the units of the maxtime parameter > for > > coasters? Table 13 of the user guide does not specify it. > > > > Thanks, > > -Allan > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Mon Aug 23 14:54:31 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 23 Aug 2010 14:54:31 -0500 Subject: [Swift-user] coaster maxtime In-Reply-To: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov> References: <1282592802.2046.0.camel@blabla2.none> <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov> Message-ID: In other fields like "maxwalltime", the default in nnnn format is minutes 2010/8/23 Michael Wilde : > I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields: > hh:mm:ss or nnnn seconds. > > - Mike > > ----- "Mihael Hategan" wrote: > >> Seconds. This should be changed to be the same as a walltime spec. >> >> On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote: >> > Hi, >> > >> > Can someone remind me what is the units of the maxtime parameter >> for >> > coasters? ?Table 13 of the user guide does not specify it. >> > >> > Thanks, >> > -Allan From aespinosa at cs.uchicago.edu Mon Aug 23 14:55:05 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 23 Aug 2010 14:55:05 -0500 Subject: [Swift-user] coaster maxtime In-Reply-To: <1282592802.2046.0.camel@blabla2.none> References: <1282592802.2046.0.camel@blabla2.none> Message-ID: Thanks guys. Looking at my old sites.xml files make sense now :) -Allan 2010/8/23 Mihael Hategan : > Seconds. This should be changed to be the same as a walltime spec. > > On Mon, 2010-08-23 at 14:41 -0500, Allan Espinosa wrote: >> Hi, >> >> Can someone remind me what is the units of the maxtime parameter for >> coasters? ?Table 13 of the user guide does not specify it. >> >> Thanks, >> -Allan From hategan at mcs.anl.gov Mon Aug 23 15:01:56 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Aug 2010 15:01:56 -0500 Subject: [Swift-user] coaster maxtime In-Reply-To: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov> References: <571057521.1210591282592995544.JavaMail.root@zimbra.anl.gov> Message-ID: <1282593716.2228.2.camel@blabla2.none> On Mon, 2010-08-23 at 13:49 -0600, Michael Wilde wrote: > I agree. How about for backwards compatibility, accept an entry without ":"s as seconds. That seems a reasonable convention for all time fields: > hh:mm:ss or nnnn seconds. Except that for walltimes in general that would mean minutes. Which is exactly the cause for some funny troubles. Justin an I stared at (maxtime="3600" walltime="3600" ) for quite a while before figuring out that one was minutes and the other seconds. Mihael From wilde at mcs.anl.gov Thu Aug 26 23:11:55 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 26 Aug 2010 22:11:55 -0600 (GMT-06:00) Subject: [Swift-user] Errors in 13-site OSG run: lazy error question In-Reply-To: Message-ID: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov> Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution? Can anyone confirm that this is whats happening, and if it is the expected behavior? Also, Glen, 2 questions: 1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week? 2) Do you know what errors the "Failed but can retry:8" message is referring to? Where is the log/run directory for this run? How long did it take to get the 589 jobs finished? It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing. - Mike ----- "Glen Hocky" wrote: > here's the result of my 13 site run that ran while i was out this > evening. It did pretty well! > but seems to have that problem of not quite lazy errors > ........ > Progress: Submitting:3 Submitted:262 Active:147 Checking status:3 > Stage out:1 Finished successfully:586 > Progress: Submitting:3 Submitted:262 Active:144 Checking status:4 > Stage out:2 Finished successfully:587 > Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished > successfully:587 Failed but can retry:6 > Progress: Submitting:3 Submitted:262 Active:140 Finished > successfully:589 Failed but can retry:8 > Failed to transfer wrapper log from > glassRunCavities-20100826-1718-7gi0dzs1/info/5 on > UCHC_CBG_vdgateway.vcell.uchc.edu > Execution failed: > org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile) > for org.griphyn.vdl.mapping.DataNode identifier > tag:benc at ci.uchicago.edu > ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut > with no value at dataset=modelOut path=[3][1][11] (not closed) -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Aug 26 23:15:44 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Aug 2010 23:15:44 -0500 Subject: [Swift-user] Errors in 13-site OSG run: lazy error question In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov> References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov> Message-ID: <1282882544.17690.0.camel@blabla2.none> Wait, wait, wait. Is this a new "invalid path (..logfile)" error? On Thu, 2010-08-26 at 22:11 -0600, Michael Wilde wrote: > Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution? > > Can anyone confirm that this is whats happening, and if it is the expected behavior? > > Also, Glen, 2 questions: > > 1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week? > > 2) Do you know what errors the "Failed but can retry:8" message is referring to? > > Where is the log/run directory for this run? How long did it take to get the 589 jobs finished? It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing. > > - Mike > > > ----- "Glen Hocky" wrote: > > > here's the result of my 13 site run that ran while i was out this > > evening. It did pretty well! > > but seems to have that problem of not quite lazy errors > > ........ > > Progress: Submitting:3 Submitted:262 Active:147 Checking status:3 > > Stage out:1 Finished successfully:586 > > Progress: Submitting:3 Submitted:262 Active:144 Checking status:4 > > Stage out:2 Finished successfully:587 > > Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished > > successfully:587 Failed but can retry:6 > > Progress: Submitting:3 Submitted:262 Active:140 Finished > > successfully:589 Failed but can retry:8 > > Failed to transfer wrapper log from > > glassRunCavities-20100826-1718-7gi0dzs1/info/5 on > > UCHC_CBG_vdgateway.vcell.uchc.edu > > Execution failed: > > org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile) > > for org.griphyn.vdl.mapping.DataNode identifier > > tag:benc at ci.uchicago.edu > > ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut > > with no value at dataset=modelOut path=[3][1][11] (not closed) > From hategan at mcs.anl.gov Thu Aug 26 23:27:38 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Aug 2010 23:27:38 -0500 Subject: [Swift-user] Errors in 13-site OSG run: lazy error question In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov> References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov> Message-ID: <1282883258.17811.2.camel@blabla2.none> On Thu, 2010-08-26 at 22:11 -0600, Michael Wilde wrote: > Glen, I wonder if whats happening here is that Swift will retry and > lazily run past *job* errors, but the error below (a mapping error) is > maybe being treated as an error in Swift's interpretation of the > script itself, and this causes an immediate halt to execution? > > Can anyone confirm that this is whats happening, and if it is the expected behavior? Right. Some errors are re-triable. Jobs get retried in the hope that they will go away. Which means that they don't get reported until the last round (and currently only the last error is reported). Some errors, such as the ones considered to be internal inconsistencies, will cause everything to fail immediately. From hockyg at uchicago.edu Thu Aug 26 23:54:22 2010 From: hockyg at uchicago.edu (Glen Hocky) Date: Fri, 27 Aug 2010 00:54:22 -0400 Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question In-Reply-To: <-3390454574925164218@unknownmsgid> References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov> <-3390454574925164218@unknownmsgid> Message-ID: log is on engage-submit /home/hockyg/swift_logs/glassRunCavities-20100826-1718-7gi0dzs1.log On Fri, Aug 27, 2010 at 12:35 AM, Glen Hocky wrote: > Yes nominally the same error but it's not at the beginning but in the > middle now for some reason. I think it's a mid-stated error message. > I'll attach the log soon > > On Aug 27, 2010, at 12:11 AM, Michael Wilde wrote: > > > Glen, I wonder if whats happening here is that Swift will retry and > lazily run past *job* errors, but the error below (a mapping error) is maybe > being treated as an error in Swift's interpretation of the script itself, > and this causes an immediate halt to execution? > > > > Can anyone confirm that this is whats happening, and if it is the > expected behavior? > > > > Also, Glen, 2 questions: > > > > 1) Isn't the error below the one that was fixed by Mihael in a recent > revision - the same one I looked at earlier in the week? > > > > 2) Do you know what errors the "Failed but can retry:8" message is > referring to? > > > > Where is the log/run directory for this run? How long did it take to get > the 589 jobs finished? It would be good to start plotting these large > multi-site runs to get a sense of how the scheduler is doing. > > > > - Mike > > > > > > ----- "Glen Hocky" wrote: > > > >> here's the result of my 13 site run that ran while i was out this > >> evening. It did pretty well! > >> but seems to have that problem of not quite lazy errors > >> ........ > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3 > >> Stage out:1 Finished successfully:586 > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4 > >> Stage out:2 Finished successfully:587 > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished > >> successfully:587 Failed but can retry:6 > >> Progress: Submitting:3 Submitted:262 Active:140 Finished > >> successfully:589 Failed but can retry:8 > >> Failed to transfer wrapper log from > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on > >> UCHC_CBG_vdgateway.vcell.uchc.edu > >> Execution failed: > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile) > >> for org.griphyn.vdl.mapping.DataNode identifier > >> tag:benc at ci.uchicago.edu > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut > >> with no value at dataset=modelOut path=[3][1][11] (not closed) > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Aug 27 10:06:03 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 27 Aug 2010 09:06:03 -0600 (GMT-06:00) Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question In-Reply-To: <-3390454574925164218@unknownmsgid> Message-ID: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov> Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct? Is it possible to re-create this similar error in a similar test script? Mihael, any thoughts on whether its likely that the prior fix did not address all cases? Thanks, - Mike ----- "Glen Hocky" wrote: > Yes nominally the same error but it's not at the beginning but in the > middle now for some reason. I think it's a mid-stated error message. > I'll attach the log soon > > On Aug 27, 2010, at 12:11 AM, Michael Wilde > wrote: > > > Glen, I wonder if whats happening here is that Swift will retry and > lazily run past *job* errors, but the error below (a mapping error) is > maybe being treated as an error in Swift's interpretation of the > script itself, and this causes an immediate halt to execution? > > > > Can anyone confirm that this is whats happening, and if it is the > expected behavior? > > > > Also, Glen, 2 questions: > > > > 1) Isn't the error below the one that was fixed by Mihael in a > recent revision - the same one I looked at earlier in the week? > > > > 2) Do you know what errors the "Failed but can retry:8" message is > referring to? > > > > Where is the log/run directory for this run? How long did it take > to get the 589 jobs finished? It would be good to start plotting > these large multi-site runs to get a sense of how the scheduler is > doing. > > > > - Mike > > > > > > ----- "Glen Hocky" wrote: > > > >> here's the result of my 13 site run that ran while i was out this > >> evening. It did pretty well! > >> but seems to have that problem of not quite lazy errors > >> ........ > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3 > >> Stage out:1 Finished successfully:586 > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4 > >> Stage out:2 Finished successfully:587 > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 > Finished > >> successfully:587 Failed but can retry:6 > >> Progress: Submitting:3 Submitted:262 Active:140 Finished > >> successfully:589 Failed but can retry:8 > >> Failed to transfer wrapper log from > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on > >> UCHC_CBG_vdgateway.vcell.uchc.edu > >> Execution failed: > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path > (..logfile) > >> for org.griphyn.vdl.mapping.DataNode identifier > >> tag:benc at ci.uchicago.edu > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type > GlassOut > >> with no value at dataset=modelOut path=[3][1][11] (not closed) > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Aug 27 11:34:05 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 27 Aug 2010 11:34:05 -0500 Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question In-Reply-To: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov> References: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov> Message-ID: <1282926845.19454.6.camel@blabla2.none> Or if you can find the stack trace of that specific error in the log, that might be useful. On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote: > Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct? > > Is it possible to re-create this similar error in a similar test script? > > Mihael, any thoughts on whether its likely that the prior fix did not address all cases? > > Thanks, > > - Mike > > > ----- "Glen Hocky" wrote: > > > Yes nominally the same error but it's not at the beginning but in the > > middle now for some reason. I think it's a mid-stated error message. > > I'll attach the log soon > > > > On Aug 27, 2010, at 12:11 AM, Michael Wilde > > wrote: > > > > > Glen, I wonder if whats happening here is that Swift will retry and > > lazily run past *job* errors, but the error below (a mapping error) is > > maybe being treated as an error in Swift's interpretation of the > > script itself, and this causes an immediate halt to execution? > > > > > > Can anyone confirm that this is whats happening, and if it is the > > expected behavior? > > > > > > Also, Glen, 2 questions: > > > > > > 1) Isn't the error below the one that was fixed by Mihael in a > > recent revision - the same one I looked at earlier in the week? > > > > > > 2) Do you know what errors the "Failed but can retry:8" message is > > referring to? > > > > > > Where is the log/run directory for this run? How long did it take > > to get the 589 jobs finished? It would be good to start plotting > > these large multi-site runs to get a sense of how the scheduler is > > doing. > > > > > > - Mike > > > > > > > > > ----- "Glen Hocky" wrote: > > > > > >> here's the result of my 13 site run that ran while i was out this > > >> evening. It did pretty well! > > >> but seems to have that problem of not quite lazy errors > > >> ........ > > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3 > > >> Stage out:1 Finished successfully:586 > > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4 > > >> Stage out:2 Finished successfully:587 > > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 > > Finished > > >> successfully:587 Failed but can retry:6 > > >> Progress: Submitting:3 Submitted:262 Active:140 Finished > > >> successfully:589 Failed but can retry:8 > > >> Failed to transfer wrapper log from > > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on > > >> UCHC_CBG_vdgateway.vcell.uchc.edu > > >> Execution failed: > > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path > > (..logfile) > > >> for org.griphyn.vdl.mapping.DataNode identifier > > >> tag:benc at ci.uchicago.edu > > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type > > GlassOut > > >> with no value at dataset=modelOut path=[3][1][11] (not closed) > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > From hategan at mcs.anl.gov Fri Aug 27 11:41:07 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 27 Aug 2010 11:41:07 -0500 Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question In-Reply-To: <1282926845.19454.6.camel@blabla2.none> References: <1099728118.1372671282921563325.JavaMail.root@zimbra.anl.gov> <1282926845.19454.6.camel@blabla2.none> Message-ID: <1282927267.19454.7.camel@blabla2.none> Or even the log itself, because I don't think I have access to engage-submit. On Fri, 2010-08-27 at 11:34 -0500, Mihael Hategan wrote: > Or if you can find the stack trace of that specific error in the log, > that might be useful. > > On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote: > > Glen, as I recall, in the previous incident of this error we re-created with a simpler script, using only the "cat" app(), correct? > > > > Is it possible to re-create this similar error in a similar test script? > > > > Mihael, any thoughts on whether its likely that the prior fix did not address all cases? > > > > Thanks, > > > > - Mike > > > > > > ----- "Glen Hocky" wrote: > > > > > Yes nominally the same error but it's not at the beginning but in the > > > middle now for some reason. I think it's a mid-stated error message. > > > I'll attach the log soon > > > > > > On Aug 27, 2010, at 12:11 AM, Michael Wilde > > > wrote: > > > > > > > Glen, I wonder if whats happening here is that Swift will retry and > > > lazily run past *job* errors, but the error below (a mapping error) is > > > maybe being treated as an error in Swift's interpretation of the > > > script itself, and this causes an immediate halt to execution? > > > > > > > > Can anyone confirm that this is whats happening, and if it is the > > > expected behavior? > > > > > > > > Also, Glen, 2 questions: > > > > > > > > 1) Isn't the error below the one that was fixed by Mihael in a > > > recent revision - the same one I looked at earlier in the week? > > > > > > > > 2) Do you know what errors the "Failed but can retry:8" message is > > > referring to? > > > > > > > > Where is the log/run directory for this run? How long did it take > > > to get the 589 jobs finished? It would be good to start plotting > > > these large multi-site runs to get a sense of how the scheduler is > > > doing. > > > > > > > > - Mike > > > > > > > > > > > > ----- "Glen Hocky" wrote: > > > > > > > >> here's the result of my 13 site run that ran while i was out this > > > >> evening. It did pretty well! > > > >> but seems to have that problem of not quite lazy errors > > > >> ........ > > > >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3 > > > >> Stage out:1 Finished successfully:586 > > > >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4 > > > >> Stage out:2 Finished successfully:587 > > > >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 > > > Finished > > > >> successfully:587 Failed but can retry:6 > > > >> Progress: Submitting:3 Submitted:262 Active:140 Finished > > > >> successfully:589 Failed but can retry:8 > > > >> Failed to transfer wrapper log from > > > >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on > > > >> UCHC_CBG_vdgateway.vcell.uchc.edu > > > >> Execution failed: > > > >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path > > > (..logfile) > > > >> for org.griphyn.vdl.mapping.DataNode identifier > > > >> tag:benc at ci.uchicago.edu > > > >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type > > > GlassOut > > > >> with no value at dataset=modelOut path=[3][1][11] (not closed) > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From matthew.woitaszek at gmail.com Fri Aug 27 14:10:54 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Fri, 27 Aug 2010 13:10:54 -0600 Subject: [Swift-user] Deleting no longer necessary anonymous files in _concurrent Message-ID: Good afternoon, I'm working with a script that creates arrays of intermediate files using the anonymous concurrent mapper, such as: file wgt_file[]; As I expect, all of these files get generated in the remote swift temporary directory and are then returned to the _concurrent directory on the host executing Swift. However, in this particular application, they're then immediately consumed by a subsequent procedure and never needed again. Is there a way to configure Swift or the file mapper declaration to delete these files after the remaining script "consumes" them? (That is, after all procedures relying on them as inputs have been executed?) Or can (should?) that be done manually? More speculatively, is there a way to keep files like these on the execution host and not even bring them back to _concurrent? (With loss of generality, I'm executing on a single site, and don't really ever need the file locally, for restarts or staging to another site.) Any advice about managing copies of large intermediate data files in the Swift execution context would be appreciated! Matthew From wozniak at mcs.anl.gov Mon Aug 30 16:54:29 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 30 Aug 2010 16:54:29 -0500 (CDT) Subject: [Swift-user] Deleting no longer necessary anonymous files in _concurrent In-Reply-To: References: Message-ID: Hi Matthew Deleting files is out of the scope of the Swift language. You can of course remove them yourself in your scripts, and as long as Swift does not try to stage them out you should be fine. You may want to look at external variables as another way to approach this (manual 2.5). Using external variables you can manage the files in your scripts while maintaining the Swift progress model. Justin On Fri, 27 Aug 2010, Matthew Woitaszek wrote: > Good afternoon, > > I'm working with a script that creates arrays of intermediate files > using the anonymous concurrent mapper, such as: > > file wgt_file[]; > > As I expect, all of these files get generated in the remote swift > temporary directory and are then returned to the _concurrent directory > on the host executing Swift. However, in this particular application, > they're then immediately consumed by a subsequent procedure and never > needed again. > > Is there a way to configure Swift or the file mapper declaration to > delete these files after the remaining script "consumes" them? (That > is, after all procedures relying on them as inputs have been > executed?) Or can (should?) that be done manually? > > More speculatively, is there a way to keep files like these on the > execution host and not even bring them back to _concurrent? (With loss > of generality, I'm executing on a single site, and don't really ever > need the file locally, for restarts or staging to another site.) > > Any advice about managing copies of large intermediate data files in > the Swift execution context would be appreciated! > > Matthew > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -- Justin M Wozniak From hockyg at gmail.com Thu Aug 26 23:36:23 2010 From: hockyg at gmail.com (Glen Hocky) Date: Fri, 27 Aug 2010 04:36:23 -0000 Subject: [Swift-user] Re: Errors in 13-site OSG run: lazy error question In-Reply-To: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov> References: <862050724.1362391282882315252.JavaMail.root@zimbra.anl.gov> Message-ID: <-3390454574925164218@unknownmsgid> Yes nominally the same error but it's not at the beginning but in the middle now for some reason. I think it's a mid-stated error message. I'll attach the log soon On Aug 27, 2010, at 12:11 AM, Michael Wilde wrote: > Glen, I wonder if whats happening here is that Swift will retry and lazily run past *job* errors, but the error below (a mapping error) is maybe being treated as an error in Swift's interpretation of the script itself, and this causes an immediate halt to execution? > > Can anyone confirm that this is whats happening, and if it is the expected behavior? > > Also, Glen, 2 questions: > > 1) Isn't the error below the one that was fixed by Mihael in a recent revision - the same one I looked at earlier in the week? > > 2) Do you know what errors the "Failed but can retry:8" message is referring to? > > Where is the log/run directory for this run? How long did it take to get the 589 jobs finished? It would be good to start plotting these large multi-site runs to get a sense of how the scheduler is doing. > > - Mike > > > ----- "Glen Hocky" wrote: > >> here's the result of my 13 site run that ran while i was out this >> evening. It did pretty well! >> but seems to have that problem of not quite lazy errors >> ........ >> Progress: Submitting:3 Submitted:262 Active:147 Checking status:3 >> Stage out:1 Finished successfully:586 >> Progress: Submitting:3 Submitted:262 Active:144 Checking status:4 >> Stage out:2 Finished successfully:587 >> Progress: Submitting:3 Submitted:262 Active:142 Stage out:2 Finished >> successfully:587 Failed but can retry:6 >> Progress: Submitting:3 Submitted:262 Active:140 Finished >> successfully:589 Failed but can retry:8 >> Failed to transfer wrapper log from >> glassRunCavities-20100826-1718-7gi0dzs1/info/5 on >> UCHC_CBG_vdgateway.vcell.uchc.edu >> Execution failed: >> org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile) >> for org.griphyn.vdl.mapping.DataNode identifier >> tag:benc at ci.uchicago.edu >> ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 type GlassOut >> with no value at dataset=modelOut path=[3][1][11] (not closed) > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory >