[Swift-devel] Montage workload
Jonathan Monette
jonmon at mcs.anl.gov
Thu Apr 12 09:29:42 CDT 2012
So that is my conclusion for the hang checker part(I think I saw this before when I ran on surveyor but that was a long time ago). I am not sure about the app failing though. When I sent the tarball I may have sent corrupted data. That is what I am checking right now.
On Apr 12, 2012, at 9:27 AM, David Kelly wrote:
> For what it's worth, I see the same hang checker messages early on in an unrelated script I am working on. It seems to be triggered by reading a large number of input files from a slower shared filesystem. In my case, once it finds all the input files, the hang checker messages stop and the job continues as normal.
>
> [davidk at communicado scec-sim]$ swift -sites.file sites.grid-ps.xml -tc.file tc.data -config cf scec-sim.swift
> Swift trunk swift-r5746 cog-r3371
>
> RunID: 20120412-0914-0dsnyia7
> No events in 10s.
>
> Registered futures:
> ----
>
> Waiting threads:
> ----
>
> (input): found 5938 files
> Progress: time: Thu, 12 Apr 2012 09:14:34 -0500
> Progress: time: Thu, 12 Apr 2012 09:14:40 -0500 Initializing:1
> Find: http://localhost:50000
> Find: keepalive(120), reconnect - http://localhost:50000
> Passive queue processor initialized. Callback URI is null
> Progress: time: Thu, 12 Apr 2012 09:14:42 -0500 Selecting site:25 Submitting:998 Submitted:1
>
>
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "Emalayan Vairavanathan" <svemalayan at yahoo.com>
>> Cc: swift-devel at ci.uchicago.edu, "MosaStore" <mosastore at googlegroups.com>
>> Sent: Thursday, April 12, 2012 9:12:10 AM
>> Subject: Re: [Swift-devel] Montage workload
>> So this looks like a problem in the Swift code. The hang checker is
>> activated at the start of the execution which is not good. Could you
>> point me to where you ran this? Was this on surveyor? If it was not on
>> surveyor I can give it a try. It looks like the projection phase is
>> trying to project empty files. This could be due to the files actually
>> being empty(I sent corrupted data) or Swift cannot find the files but
>> ran mProjectPP anyways.
>>
>>
>>
>> On Apr 12, 2012, at 12:44 AM, Emalayan Vairavanathan wrote:
>>
>>
>>
>>
>>
>> Hi Jon,
>>
>>
>> I tired to run the large Montage-workload which I got from you
>> recently on both PVFS and MosaStore. With both systems the workload
>> failed (I copied the standard output messages below). I guess this is
>> due to the problem with the workload (because the system works with
>> the small workloads).
>>
>> Do you have any idea ? Did this workload work for you ?
>>
>>
>>
>> Thank you
>> Emalayan
>>
>>
>>
>>
>>
>> Swift trunk swift-r5704 (swift modified locally) cog-r3361 (cog
>> modified locally)
>>
>> RunID: 20120412-0530-vj96mfz5
>> No events in 10s.
>>
>> Registered futures:
>> ----
>>
>> Waiting threads:
>> ----
>>
>> No events in 10s.
>>
>> Registered futures:
>> ----
>>
>> Waiting threads:
>> ----
>>
>> No events in 10s.
>>
>> Registered futures:
>> ----
>>
>> Waiting threads:
>> ----
>>
>> No events in 10s.
>>
>> Registered futures:
>> ----
>>
>> Waiting threads:
>> ----
>>
>> (input): found 4116 files
>> No events in 10s.
>>
>> Registered futures:
>> ----
>>
>> Waiting threads:
>> ----
>>
>> Failed to acquire exclusive lock on log file.
>> Progress: time: Thu, 12 Apr 2012 05:31:02 +0000
>> Progress: time: Progress: time: Thu, 12 Apr 2012 05:31:11 +0000Thu, 12
>> Apr 2012 05:31:11 +0000 Initializing:2 Initializing:2
>>
>> Progress: time: Thu, 12 Apr 2012 05:31:12 +0000 Initializing:1023
>> Selecting site:1
>> Progress: time: Thu, 12 Apr 2012 05:31:13 +0000 Selecting site:1020
>> Initializing site shared directory:1 Stage in:3
>> Progress: time: Thu, 12 Apr 2012 05:31:15 +0000 Selecting site:1018
>> Stage in:5 Submitting:1
>> Find: http://172.17.3.12:12346
>> Find: keepalive(120), reconnect - http://172.17.3.12:12346
>> Passive queue processor initialized. Callback URI is
>> http://172.17.3.12:12345
>> Progress: time: Thu, 12 Apr 2012 05:31:16 +0000 Selecting site:1018
>> Active:6
>> Progress: time: Thu, 12 Apr 2012 05:31:24 +0000 Selecting site:1018
>> Active:5 Failed but can retry:1
>> EXCEPTION Exception in mProjectPP_wrap:
>> Arguments: [-X, raw_dir/2mass-atlas-991207s-j1130256.fits,
>> proj_dir/proj_2mass-atlas-991207s-j1130256.fits, header.hdr]
>> Host: persistent-coasters
>> Directory:
>> SwiftMontage-20120412-0530-vj96mfz5/jobs/e/mProjectPP_wrap-eozxvrpk
>> stderr.txt:
>> stdout.txt: [struct stat="ERROR", msg="All pixels are blank."]
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list