[Swift-devel] Montage workload

Jonathan Monette jonmon at mcs.anl.gov
Thu Apr 12 09:29:42 CDT 2012


So that is my conclusion for the hang checker part(I think I saw this before when I ran on surveyor but that was a long time ago).  I am not sure about the app failing though. When I sent the tarball I may have sent corrupted data.  That is what I am checking right now.

On Apr 12, 2012, at 9:27 AM, David Kelly wrote:

> For what it's worth, I see the same hang checker messages early on in an unrelated script I am working on. It seems to be triggered by reading a large number of input files from a slower shared filesystem. In my case, once it finds all the input files, the hang checker messages stop and the job continues as normal.
> 
> [davidk at communicado scec-sim]$ swift -sites.file sites.grid-ps.xml -tc.file tc.data -config cf scec-sim.swift
> Swift trunk swift-r5746 cog-r3371
> 
> RunID: 20120412-0914-0dsnyia7
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> (input): found 5938 files
> Progress:  time: Thu, 12 Apr 2012 09:14:34 -0500
> Progress:  time: Thu, 12 Apr 2012 09:14:40 -0500  Initializing:1
> Find: http://localhost:50000
> Find:  keepalive(120), reconnect - http://localhost:50000
> Passive queue processor initialized. Callback URI is null
> Progress:  time: Thu, 12 Apr 2012 09:14:42 -0500  Selecting site:25  Submitting:998  Submitted:1
> 
> 
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "Emalayan Vairavanathan" <svemalayan at yahoo.com>
>> Cc: swift-devel at ci.uchicago.edu, "MosaStore" <mosastore at googlegroups.com>
>> Sent: Thursday, April 12, 2012 9:12:10 AM
>> Subject: Re: [Swift-devel] Montage workload
>> So this looks like a problem in the Swift code. The hang checker is
>> activated at the start of the execution which is not good. Could you
>> point me to where you ran this? Was this on surveyor? If it was not on
>> surveyor I can give it a try. It looks like the projection phase is
>> trying to project empty files. This could be due to the files actually
>> being empty(I sent corrupted data) or Swift cannot find the files but
>> ran mProjectPP anyways.
>> 
>> 
>> 
>> On Apr 12, 2012, at 12:44 AM, Emalayan Vairavanathan wrote:
>> 
>> 
>> 
>> 
>> 
>> Hi Jon,
>> 
>> 
>> I tired to run the large Montage-workload which I got from you
>> recently on both PVFS and MosaStore. With both systems the workload
>> failed (I copied the standard output messages below). I guess this is
>> due to the problem with the workload (because the system works with
>> the small workloads).
>> 
>> Do you have any idea ? Did this workload work for you ?
>> 
>> 
>> 
>> Thank you
>> Emalayan
>> 
>> 
>> 
>> 
>> 
>> Swift trunk swift-r5704 (swift modified locally) cog-r3361 (cog
>> modified locally)
>> 
>> RunID: 20120412-0530-vj96mfz5
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> (input): found 4116 files
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> Failed to acquire exclusive lock on log file.
>> Progress: time: Thu, 12 Apr 2012 05:31:02 +0000
>> Progress: time: Progress: time: Thu, 12 Apr 2012 05:31:11 +0000Thu, 12
>> Apr 2012 05:31:11 +0000 Initializing:2 Initializing:2
>> 
>> Progress: time: Thu, 12 Apr 2012 05:31:12 +0000 Initializing:1023
>> Selecting site:1
>> Progress: time: Thu, 12 Apr 2012 05:31:13 +0000 Selecting site:1020
>> Initializing site shared directory:1 Stage in:3
>> Progress: time: Thu, 12 Apr 2012 05:31:15 +0000 Selecting site:1018
>> Stage in:5 Submitting:1
>> Find: http://172.17.3.12:12346
>> Find: keepalive(120), reconnect - http://172.17.3.12:12346
>> Passive queue processor initialized. Callback URI is
>> http://172.17.3.12:12345
>> Progress: time: Thu, 12 Apr 2012 05:31:16 +0000 Selecting site:1018
>> Active:6
>> Progress: time: Thu, 12 Apr 2012 05:31:24 +0000 Selecting site:1018
>> Active:5 Failed but can retry:1
>> EXCEPTION Exception in mProjectPP_wrap:
>> Arguments: [-X, raw_dir/2mass-atlas-991207s-j1130256.fits,
>> proj_dir/proj_2mass-atlas-991207s-j1130256.fits, header.hdr]
>> Host: persistent-coasters
>> Directory:
>> SwiftMontage-20120412-0530-vj96mfz5/jobs/e/mProjectPP_wrap-eozxvrpk
>> stderr.txt:
>> stdout.txt: [struct stat="ERROR", msg="All pixels are blank."]
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> 
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list