[Swift-devel] Montage workload

David Kelly davidk at ci.uchicago.edu
Thu Apr 12 09:27:10 CDT 2012


For what it's worth, I see the same hang checker messages early on in an unrelated script I am working on. It seems to be triggered by reading a large number of input files from a slower shared filesystem. In my case, once it finds all the input files, the hang checker messages stop and the job continues as normal.

[davidk at communicado scec-sim]$ swift -sites.file sites.grid-ps.xml -tc.file tc.data -config cf scec-sim.swift
Swift trunk swift-r5746 cog-r3371

RunID: 20120412-0914-0dsnyia7
No events in 10s.

Registered futures:
----

Waiting threads:
----

 (input): found 5938 files
Progress:  time: Thu, 12 Apr 2012 09:14:34 -0500
Progress:  time: Thu, 12 Apr 2012 09:14:40 -0500  Initializing:1
Find: http://localhost:50000
Find:  keepalive(120), reconnect - http://localhost:50000
Passive queue processor initialized. Callback URI is null
Progress:  time: Thu, 12 Apr 2012 09:14:42 -0500  Selecting site:25  Submitting:998  Submitted:1


----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Emalayan Vairavanathan" <svemalayan at yahoo.com>
> Cc: swift-devel at ci.uchicago.edu, "MosaStore" <mosastore at googlegroups.com>
> Sent: Thursday, April 12, 2012 9:12:10 AM
> Subject: Re: [Swift-devel] Montage workload
> So this looks like a problem in the Swift code. The hang checker is
> activated at the start of the execution which is not good. Could you
> point me to where you ran this? Was this on surveyor? If it was not on
> surveyor I can give it a try. It looks like the projection phase is
> trying to project empty files. This could be due to the files actually
> being empty(I sent corrupted data) or Swift cannot find the files but
> ran mProjectPP anyways.
> 
> 
> 
> On Apr 12, 2012, at 12:44 AM, Emalayan Vairavanathan wrote:
> 
> 
> 
> 
> 
> Hi Jon,
> 
> 
> I tired to run the large Montage-workload which I got from you
> recently on both PVFS and MosaStore. With both systems the workload
> failed (I copied the standard output messages below). I guess this is
> due to the problem with the workload (because the system works with
> the small workloads).
> 
> Do you have any idea ? Did this workload work for you ?
> 
> 
> 
> Thank you
> Emalayan
> 
> 
> 
> 
> 
> Swift trunk swift-r5704 (swift modified locally) cog-r3361 (cog
> modified locally)
> 
> RunID: 20120412-0530-vj96mfz5
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> (input): found 4116 files
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> Failed to acquire exclusive lock on log file.
> Progress: time: Thu, 12 Apr 2012 05:31:02 +0000
> Progress: time: Progress: time: Thu, 12 Apr 2012 05:31:11 +0000Thu, 12
> Apr 2012 05:31:11 +0000 Initializing:2 Initializing:2
> 
> Progress: time: Thu, 12 Apr 2012 05:31:12 +0000 Initializing:1023
> Selecting site:1
> Progress: time: Thu, 12 Apr 2012 05:31:13 +0000 Selecting site:1020
> Initializing site shared directory:1 Stage in:3
> Progress: time: Thu, 12 Apr 2012 05:31:15 +0000 Selecting site:1018
> Stage in:5 Submitting:1
> Find: http://172.17.3.12:12346
> Find: keepalive(120), reconnect - http://172.17.3.12:12346
> Passive queue processor initialized. Callback URI is
> http://172.17.3.12:12345
> Progress: time: Thu, 12 Apr 2012 05:31:16 +0000 Selecting site:1018
> Active:6
> Progress: time: Thu, 12 Apr 2012 05:31:24 +0000 Selecting site:1018
> Active:5 Failed but can retry:1
> EXCEPTION Exception in mProjectPP_wrap:
> Arguments: [-X, raw_dir/2mass-atlas-991207s-j1130256.fits,
> proj_dir/proj_2mass-atlas-991207s-j1130256.fits, header.hdr]
> Host: persistent-coasters
> Directory:
> SwiftMontage-20120412-0530-vj96mfz5/jobs/e/mProjectPP_wrap-eozxvrpk
> stderr.txt:
> stdout.txt: [struct stat="ERROR", msg="All pixels are blank."]
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list