[Swift-devel] Montage workload

Emalayan Vairavanathan svemalayan at yahoo.com
Thu Apr 12 09:39:25 CDT 2012


Hi Jon and David,

I was running this on Surveyor and I got the workload from your home directory. Could you please check the workload for data corruption (may be using MD5  sum ?)?


Regarding hang checker :  Hang checked kicked in with swift-pipeline benchmarks too. I agree with David and I too think this is due to storage slow down.

Thank you
Emalayan


________________________________
 From: Jonathan Monette <jonmon at mcs.anl.gov>
To: David Kelly <davidk at ci.uchicago.edu> 
Cc: swift-devel at ci.uchicago.edu; MosaStore <mosastore at googlegroups.com>; Emalayan Vairavanathan <svemalayan at yahoo.com> 
Sent: Thursday, 12 April 2012 7:29 AM
Subject: Re: [Swift-devel] Montage workload
 
So that is my conclusion for the hang checker part(I think I saw this before when I ran on surveyor but that was a long time ago).  I am not sure about the app failing though. When I sent the tarball I may have sent corrupted data.  That is what I am checking right now.

On Apr 12, 2012, at 9:27 AM, David Kelly wrote:

> For what it's worth, I see the same hang checker messages early on in an unrelated script I am working on. It seems to be triggered by reading a large number of input files from a slower shared filesystem. In my case, once it finds all the input files, the hang checker messages stop and the job continues as normal.
> 
> [davidk at communicado scec-sim]$ swift -sites.file sites.grid-ps.xml -tc.file tc.data -config cf scec-sim.swift
> Swift trunk swift-r5746 cog-r3371
> 
> RunID: 20120412-0914-0dsnyia7
> No events in 10s.
> 
> Registered futures:
> ----
> 
> Waiting threads:
> ----
> 
> (input): found 5938 files
> Progress:  time: Thu, 12 Apr 2012 09:14:34 -0500
> Progress:  time: Thu, 12 Apr 2012 09:14:40 -0500  Initializing:1
> Find: http://localhost:50000
> Find:  keepalive(120), reconnect - http://localhost:50000
> Passive queue processor initialized. Callback URI is null
> Progress:  time: Thu, 12 Apr 2012 09:14:42 -0500  Selecting site:25  Submitting:998  Submitted:1
> 
> 
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "Emalayan Vairavanathan" <svemalayan at yahoo.com>
>> Cc: swift-devel at ci.uchicago.edu, "MosaStore" <mosastore at googlegroups.com>
>> Sent: Thursday, April 12, 2012 9:12:10 AM
>> Subject: Re: [Swift-devel] Montage workload
>> So this looks like a problem in the Swift code. The hang checker is
>> activated at the start of the execution which is not good. Could you
>> point me to where you ran this? Was this on surveyor? If it was not on
>> surveyor I can give it a try. It looks like the projection phase is
>> trying to project empty files. This could be due to the files actually
>> being empty(I sent corrupted data) or Swift cannot find the files but
>> ran mProjectPP anyways.
>> 
>> 
>> 
>> On Apr 12, 2012, at 12:44 AM, Emalayan Vairavanathan wrote:
>> 
>> 
>> 
>> 
>> 
>> Hi Jon,
>> 
>> 
>> I tired to run the large Montage-workload which I got from you
>> recently on both PVFS and MosaStore. With both systems the workload
>> failed (I copied the standard output messages below). I guess this is
>> due to the problem with the workload (because the system works with
>> the small workloads).
>> 
>> Do you have any idea ? Did this workload work for you ?
>> 
>> 
>> 
>> Thank you
>> Emalayan
>> 
>> 
>> 
>> 
>> 
>> Swift trunk swift-r5704 (swift modified locally) cog-r3361 (cog
>> modified locally)
>> 
>> RunID: 20120412-0530-vj96mfz5
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> (input): found 4116 files
>> No events in 10s.
>> 
>> Registered futures:
>> ----
>> 
>> Waiting threads:
>> ----
>> 
>> Failed to acquire exclusive lock on log file.
>> Progress: time: Thu, 12 Apr 2012 05:31:02 +0000
>> Progress: time: Progress: time: Thu, 12 Apr 2012 05:31:11 +0000Thu, 12
>> Apr 2012 05:31:11 +0000 Initializing:2 Initializing:2
>> 
>> Progress: time: Thu, 12 Apr 2012 05:31:12 +0000 Initializing:1023
>> Selecting site:1
>> Progress: time: Thu, 12 Apr 2012 05:31:13 +0000 Selecting site:1020
>> Initializing site shared directory:1 Stage in:3
>> Progress: time: Thu, 12 Apr 2012 05:31:15 +0000 Selecting site:1018
>> Stage in:5 Submitting:1
>> Find: http://172.17.3.12:12346
>> Find: keepalive(120), reconnect - http://172.17.3.12:12346
>> Passive queue processor initialized. Callback URI is
>> http://172.17.3.12:12345
>> Progress: time: Thu, 12 Apr 2012 05:31:16 +0000 Selecting site:1018
>> Active:6
>> Progress: time: Thu, 12 Apr 2012 05:31:24 +0000 Selecting site:1018
>> Active:5 Failed but can retry:1
>> EXCEPTION Exception in mProjectPP_wrap:
>> Arguments: [-X, raw_dir/2mass-atlas-991207s-j1130256.fits,
>> proj_dir/proj_2mass-atlas-991207s-j1130256.fits, header.hdr]
>> Host: persistent-coasters
>> Directory:
>> SwiftMontage-20120412-0530-vj96mfz5/jobs/e/mProjectPP_wrap-eozxvrpk
>> stderr.txt:
>> stdout.txt: [struct stat="ERROR", msg="All pixels are blank."]
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> 
>> 
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
You received this message because you are subscribed to the Google Groups "MosaStore" group.
To post to this group, send email to mosastore at googlegroups.com.
To unsubscribe from this group, send email to mosastore+unsubscribe at googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mosastore?hl=en.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120412/ed208c4d/attachment.html>


More information about the Swift-devel mailing list