[Swift-devel] Cant get auto-coasters to run from midway to beagle

Michael Wilde wilde at mcs.anl.gov
Sat Mar 9 16:24:17 CST 2013


I forgot to paste the error, sorry. Its below now, fer real.  When I dial down the throttle to 48 and only start 2 beagle nodes, I get further and the app calls make it to active state.  The 317 files being staged in here are 17MB each.

The swift progress output and error are below:

RunID: 20130309-2204-qu9ck076
Progress:  time: Sat, 09 Mar 2013 22:04:34 +0000
Progress:  time: Sat, 09 Mar 2013 22:04:45 +0000  Submitting:316  Submitted:1
Progress:  time: Sat, 09 Mar 2013 22:04:51 +0000  Stage in:1  Submitted:316
Progress:  time: Sat, 09 Mar 2013 22:04:52 +0000  Stage in:25  Submitted:292
Progress:  time: Sat, 09 Mar 2013 22:04:53 +0000  Stage in:68  Submitted:249
Progress:  time: Sat, 09 Mar 2013 22:04:55 +0000  Stage in:113  Submitted:204
Progress:  time: Sat, 09 Mar 2013 22:04:56 +0000  Stage in:165  Submitted:152
Progress:  time: Sat, 09 Mar 2013 22:04:58 +0000  Stage in:177  Submitted:140
Progress:  time: Sat, 09 Mar 2013 22:05:00 +0000  Stage in:225  Submitted:92
Progress:  time: Sat, 09 Mar 2013 22:05:04 +0000  Stage in:241  Submitted:76
Progress:  time: Sat, 09 Mar 2013 22:05:05 +0000  Stage in:289  Submitted:28
Progress:  time: Sat, 09 Mar 2013 22:05:09 +0000  Stage in:305  Submitted:12
Progress:  time: Sat, 09 Mar 2013 22:05:34 +0000  Stage in:317
Progress:  time: Sat, 09 Mar 2013 22:06:04 +0000  Stage in:317
Progress:  time: Sat, 09 Mar 2013 22:06:34 +0000  Stage in:317
Progress:  time: Sat, 09 Mar 2013 22:07:04 +0000  Stage in:317
Progress:  time: Sat, 09 Mar 2013 22:07:34 +0000  Stage in:317
Progress:  time: Sat, 09 Mar 2013 22:08:04 +0000  Stage in:317
Progress:  time: Sat, 09 Mar 2013 22:08:34 +0000  Stage in:317
Channels: {null at https://192.5.86.107:50000=MetaChannel[https://192.5.86.107:50000] -> GSSCChannel-https://192.5.86.107:50000(2)[https://192.5.86.107:50000], /C=US/O=JavaCoG/OU=AutoCA/CN=User at https://192.5.86.107:50000=MetaChannel[service-60640] -> BufferingChannel, null at id://u-23c37c02-13d512f435d--7fff-u66598f98-13d512f434d--7fffC=MetaChannel[https://192.5.86.107:50000] -> GSSCChannel-https://192.5.86.107:50000(2)[https://192.5.86.107:50000], null at id://u66598f98-13d512f434d--8000-u-23c37c02-13d512f435d--8000S=MetaChannel[service-60640] -> BufferingChannel}
Context: service-60822
Meta context: service-60640
Channels: {null at https://192.5.86.107:50000=MetaChannel[https://192.5.86.107:50000] -> GSSCChannel-https://192.5.86.107:50000(2)[https://192.5.86.107:50000], /C=US/O=JavaCoG/OU=AutoCA/CN=User at https://192.5.86.107:50000=MetaChannel[service-60640] -> BufferingChannel, null at id://u-23c37c02-13d512f435d--7fff-u66598f98-13d512f434d--7fffC=MetaChannel[https://192.5.86.107:50000] -> GSSCChannel-https://192.5.86.107:50000(2)[https://192.5.86.107:50000], null at id://u66598f98-13d512f434d--8000-u-23c37c02-13d512f435d--8000S=MetaChannel[service-60640] -> BufferingChannel}
Context: service-60116
Meta context: service-60640
Channels: {null at https://192.5.86.107:50000=MetaChannel[https://192.5.86.107:50000] -> GSSCChannel-https://192.5.86.107:50000(2)[https://192.5.86.107:50000], /C=US/O=JavaCoG/OU=AutoCA/CN=User at https://192.5.86.107:50000=MetaChannel[service-60640] -> BufferingChannel, null at id://u-23c37c02-13d512f435d--7fff-u66598f98-13d512f434d--7fffC=MetaChannel[https://192.5.86.107:50000] -> GSSCChannel-https://192.5.86.107:50000(2)[https://192.5.86.107:50000], null at id://u66598f98-13d512f434d--8000-u-23c37c02-13d512f435d--8000S=MetaChannel[service-60640] -> BufferingChannel}
Context: service-60598
Meta context: service-60640
Progress:  time: Sat, 09 Mar 2013 22:09:04 +0000  Stage in:317
Progress:  time: Sat, 09 Mar 2013 22:09:08 +0000  Stage in:316  Active:1
Execution failed:
	Exception in getlanduse:
    Arguments: [home/wilde/osgdemo/modis/svn/data/modis/2002/h15v02.rgb]
    Host: beagle
    Directory: modis02-20130309-2204-qu9ck076/jobs/b/getlanduse-bmscjd6l

Caused by:
	Shutting down worker
	getLandUse, modis02.swift, line 20
error null

real	4m36.777s
user	2m55.240s
sys	0m3.837s


---

With a throttle of 48 (.47) and 2 beagle nodes, I see:

Swift 0.94RC4 swift-r6284 cog-r3607 (cog modified locally)

RunID: 20130309-2214-1oi3rvea
Progress:  time: Sat, 09 Mar 2013 22:14:06 +0000
Progress:  time: Sat, 09 Mar 2013 22:14:17 +0000  Selecting site:269  Submitting:47  Submitted:1
Progress:  time: Sat, 09 Mar 2013 22:14:22 +0000  Selecting site:269  Stage in:1  Submitted:47
Progress:  time: Sat, 09 Mar 2013 22:14:28 +0000  Selecting site:269  Stage in:25  Submitted:23
Progress:  time: Sat, 09 Mar 2013 22:14:36 +0000  Selecting site:269  Stage in:48
Progress:  time: Sat, 09 Mar 2013 22:15:06 +0000  Selecting site:269  Stage in:48
Progress:  time: Sat, 09 Mar 2013 22:15:36 +0000  Selecting site:269  Stage in:48
Progress:  time: Sat, 09 Mar 2013 22:16:06 +0000  Selecting site:269  Stage in:48
Progress:  time: Sat, 09 Mar 2013 22:16:26 +0000  Selecting site:269  Stage in:47  Active:1
Progress:  time: Sat, 09 Mar 2013 22:16:27 +0000  Selecting site:269  Stage in:36  Active:12
Progress:  time: Sat, 09 Mar 2013 22:16:29 +0000  Selecting site:269  Stage in:24  Active:24
Progress:  time: Sat, 09 Mar 2013 22:16:34 +0000  Selecting site:269  Stage in:24  Active:23  Stage out:1
Progress:  time: Sat, 09 Mar 2013 22:16:35 +0000  Selecting site:269  Stage in:14  Active:33  Stage out:1
Execution failed:
	Exception in getlanduse:
    Arguments: [home/wilde/osgdemo/modis/svn/data/modis/2002/h08v04.rgb]
    Host: beagle
    Directory: modis02-20130309-2214-1oi3rvea/jobs/k/getlanduse-ko5qjd6l

Caused by:
	Application /lustre/beagle/davidk/modis/bin/getlanduse.sh failed with an exit code of 1
	getLandUse, modis02.swift, line 20

real	2m31.463s
user	1m33.238s
sys	0m2.160s
+ mv /home/wilde/.swift/runs/current/run024.1362867244 /home/wilde/.swift/runs/completed

This error is likely in the demo app code; just pasting here to show that with less concurrency it makes progress.

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Saturday, March 9, 2013 4:11:24 PM
> Subject: Re: [Swift-devel] Cant get auto-coasters to run from midway	to	beagle
> 
> Now Im getting the error below (from running 317 simple MODIS apps
> concurrently).  Im going to dial down the throttle first to see if
> the staging load is overwhelming either coasters or the
> midway-beagle path.
> 
> - Mike
> 
> 
> ----- Original Message -----
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "Mihael Hategan" <hategan at mcs.anl.gov>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Saturday, March 9, 2013 3:59:22 PM
> > Subject: Re: [Swift-devel] Cant get auto-coasters to run from
> > midway to	beagle
> > 
> > I think we just got this working. Problems may have included the
> > need
> > to pre-create the workdirectory and to specify a dotted IP address
> > on the external network for GLOBUS_HOSTNAME.  Will need to
> > experiment.  So likely that proxy expiration time was not a problem
> > (although its confusing).
> > 
> > Will report back on this once the needed steps are clear.
> > 
> > Thanks,
> > 
> > - Mike
> > 
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Saturday, March 9, 2013 3:56:36 PM
> > > Subject: Re: Cant get auto-coasters to run from midway to beagle
> > > 
> > > Can you post ,globus/coasters/coaster.log from beagle?
> > > 
> > > On Sat, 2013-03-09 at 15:46 -0600, Michael Wilde wrote:
> > > > Mihael, can you advise on this problem?
> > > > 
> > > > David and I are trying to run automatic coaster jobs from
> > > > midway
> > > > login hosts and swift.rcc to beagle using ssh-cl:pbs.
> > > > 
> > > > My failed attempts are on midway under
> > > > /home/wilde/osgdemo/modis/svn, see eg run020 (which has
> > > > complete
> > > > logs).
> > > > 
> > > > Quick question about the proxy files that get copied. Does this
> > > > look OK? :
> > > > 
> > > >   2013-03-09 21:24:46,895+0000 INFO  AutoCA Checking
> > > >   certificate
> > > >   /home/wilde/.globus/coasters/proxy.0.pem
> > > > 2013-03-09 21:24:46,967+0000 INFO  AutoCA Using certificate
> > > > /home/wilde/.globus/coasters/proxy.0.pem with expiration date
> > > > Sat
> > > > Mar 23\
> > > >  19:25:53 GMT 2013
> > > > 
> > > > The proxy expiration time listed above is two hours *earlier*
> > > > than
> > > > the current time (as seen in the message's UTC timestamp).  Is
> > > > that correct, or a possible cause of this problem?
> > > > 
> > > > The main symptom seems to be this:
> > > > 
> > > > Execution failed:
> > > > 	Exception in getlanduse:
> > > >     Arguments: [../data/modis/2002/h00v09.rgb]
> > > >     Host: beagle
> > > >     Directory:
> > > >     modis01-20130309-2124-7ua3bde3/jobs/d/getlanduse-d24rhd6l
> > > > 
> > > > Caused by:
> > > > 	Could not submit job
> > > > Caused by:
> > > > 	Could not start coaster service
> > > > Caused by:
> > > > 	Task ended before registration was received.
> > > > Failed to download bootstrap jar from
> > > > http://midway001.rcc.uchicago.edu:50001
> > > > ---
> > > > 
> > > > Yet Ive verified that midway login4 (which is the target
> > > > system)
> > > > can connect to this hostname and port (with nc -l and telnet)
> > > > 
> > > > - Mike
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list