<html><body><div style="color:#000; background-color:#fff; font-family:times new roman, new york, times, serif;font-size:12pt"><div><span>Hi Jon and David,</span></div><div><br><span></span></div><div><span>I was running this on Surveyor and I got the workload from your home directory. Could you please check the workload for data corruption (may be using MD5 sum ?)?</span></div><div><br></div><div>Regarding hang checker : Hang checked kicked in with swift-pipeline benchmarks too. I agree with David and I too think this is due to storage slow down.</div><div><br></div><div>Thank you</div><div>Emalayan</div><div><br></div><div style="font-family: times new roman, new york, times, serif; font-size: 12pt;"> <div style="font-family: times new roman, new york, times, serif; font-size: 12pt;"> <div dir="ltr"> <font face="Arial" size="2"> <hr size="1"> <b><span style="font-weight:bold;">From:</span></b> Jonathan Monette
<jonmon@mcs.anl.gov><br> <b><span style="font-weight: bold;">To:</span></b> David Kelly <davidk@ci.uchicago.edu> <br><b><span style="font-weight: bold;">Cc:</span></b> swift-devel@ci.uchicago.edu; MosaStore <mosastore@googlegroups.com>; Emalayan Vairavanathan <svemalayan@yahoo.com> <br> <b><span style="font-weight: bold;">Sent:</span></b> Thursday, 12 April 2012 7:29 AM<br> <b><span style="font-weight: bold;">Subject:</span></b> Re: [Swift-devel] Montage workload<br> </font> </div> <br>So that is my conclusion for the hang checker part(I think I saw this before when I ran on surveyor but that was a long time ago). I am not sure about the app failing though. When I sent the tarball I may have sent corrupted data. That is what I am checking right now.<br><br>On Apr 12, 2012, at 9:27 AM, David Kelly wrote:<br><br>> For what it's worth, I see the same hang checker messages early on in an unrelated script I am working
on. It seems to be triggered by reading a large number of input files from a slower shared filesystem. In my case, once it finds all the input files, the hang checker messages stop and the job continues as normal.<br>> <br>> [davidk@communicado scec-sim]$ swift -sites.file sites.grid-ps.xml -tc.file tc.data -config cf scec-sim.swift<br>> Swift trunk swift-r5746 cog-r3371<br>> <br>> RunID: 20120412-0914-0dsnyia7<br>> No events in 10s.<br>> <br>> Registered futures:<br>> ----<br>> <br>> Waiting threads:<br>> ----<br>> <br>> (input): found 5938 files<br>> Progress: time: Thu, 12 Apr 2012 09:14:34 -0500<br>> Progress: time: Thu, 12 Apr 2012 09:14:40 -0500 Initializing:1<br>> Find: <a href="http://localhost:50000" target="_blank">http://localhost:50000</a><br>> Find: keepalive(120), reconnect - <a href="http://localhost:50000" target="_blank">http://localhost:50000</a><br>>
Passive queue processor initialized. Callback URI is null<br>> Progress: time: Thu, 12 Apr 2012 09:14:42 -0500 Selecting site:25 Submitting:998 Submitted:1<br>> <br>> <br>> ----- Original Message -----<br>>> From: "Jonathan Monette" <<a ymailto="mailto:jonmon@mcs.anl.gov" href="mailto:jonmon@mcs.anl.gov">jonmon@mcs.anl.gov</a>><br>>> To: "Emalayan Vairavanathan" <<a ymailto="mailto:svemalayan@yahoo.com" href="mailto:svemalayan@yahoo.com">svemalayan@yahoo.com</a>><br>>> Cc: <a ymailto="mailto:swift-devel@ci.uchicago.edu" href="mailto:swift-devel@ci.uchicago.edu">swift-devel@ci.uchicago.edu</a>, "MosaStore" <<a ymailto="mailto:mosastore@googlegroups.com" href="mailto:mosastore@googlegroups.com">mosastore@googlegroups.com</a>><br>>> Sent: Thursday, April 12, 2012 9:12:10 AM<br>>> Subject: Re: [Swift-devel] Montage workload<br>>> So this looks like a problem in the
Swift code. The hang checker is<br>>> activated at the start of the execution which is not good. Could you<br>>> point me to where you ran this? Was this on surveyor? If it was not on<br>>> surveyor I can give it a try. It looks like the projection phase is<br>>> trying to project empty files. This could be due to the files actually<br>>> being empty(I sent corrupted data) or Swift cannot find the files but<br>>> ran mProjectPP anyways.<br>>> <br>>> <br>>> <br>>> On Apr 12, 2012, at 12:44 AM, Emalayan Vairavanathan wrote:<br>>> <br>>> <br>>> <br>>> <br>>> <br>>> Hi Jon,<br>>> <br>>> <br>>> I tired to run the large Montage-workload which I got from you<br>>> recently on both PVFS and MosaStore. With both systems the workload<br>>> failed (I copied the standard output messages below). I guess this is<br>>> due to the problem
with the workload (because the system works with<br>>> the small workloads).<br>>> <br>>> Do you have any idea ? Did this workload work for you ?<br>>> <br>>> <br>>> <br>>> Thank you<br>>> Emalayan<br>>> <br>>> <br>>> <br>>> <br>>> <br>>> Swift trunk swift-r5704 (swift modified locally) cog-r3361 (cog<br>>> modified locally)<br>>> <br>>> RunID: 20120412-0530-vj96mfz5<br>>> No events in 10s.<br>>> <br>>> Registered futures:<br>>> ----<br>>> <br>>> Waiting threads:<br>>> ----<br>>> <br>>> No events in 10s.<br>>> <br>>> Registered futures:<br>>> ----<br>>> <br>>> Waiting threads:<br>>> ----<br>>> <br>>> No events in 10s.<br>>> <br>>> Registered futures:<br>>> ----<br>>> <br>>> Waiting threads:<br>>> ----<br>>>
<br>>> No events in 10s.<br>>> <br>>> Registered futures:<br>>> ----<br>>> <br>>> Waiting threads:<br>>> ----<br>>> <br>>> (input): found 4116 files<br>>> No events in 10s.<br>>> <br>>> Registered futures:<br>>> ----<br>>> <br>>> Waiting threads:<br>>> ----<br>>> <br>>> Failed to acquire exclusive lock on log file.<br>>> Progress: time: Thu, 12 Apr 2012 05:31:02 +0000<br>>> Progress: time: Progress: time: Thu, 12 Apr 2012 05:31:11 +0000Thu, 12<br>>> Apr 2012 05:31:11 +0000 Initializing:2 Initializing:2<br>>> <br>>> Progress: time: Thu, 12 Apr 2012 05:31:12 +0000 Initializing:1023<br>>> Selecting site:1<br>>> Progress: time: Thu, 12 Apr 2012 05:31:13 +0000 Selecting site:1020<br>>> Initializing site shared directory:1 Stage in:3<br>>> Progress: time: Thu, 12 Apr 2012 05:31:15 +0000 Selecting
site:1018<br>>> Stage in:5 Submitting:1<br>>> Find: <a href="http://172.17.3.12:12346" target="_blank" >http://172.17.3.12:12346</a><br>>> Find: keepalive(120), reconnect - <a href="http://172.17.3.12:12346" target="_blank" >http://172.17.3.12:12346</a><br>>> Passive queue processor initialized. Callback URI is<br>>> <a href="http://172.17.3.12:12345" target="_blank" >http://172.17.3.12:12345</a><br>>> Progress: time: Thu, 12 Apr 2012 05:31:16 +0000 Selecting site:1018<br>>> Active:6<br>>> Progress: time: Thu, 12 Apr 2012 05:31:24 +0000 Selecting site:1018<br>>> Active:5 Failed but can retry:1<br>>> EXCEPTION Exception in mProjectPP_wrap:<br>>> Arguments: [-X, raw_dir/2mass-atlas-991207s-j1130256.fits,<br>>> proj_dir/proj_2mass-atlas-991207s-j1130256.fits, header.hdr]<br>>> Host: persistent-coasters<br>>> Directory:<br>>>
SwiftMontage-20120412-0530-vj96mfz5/jobs/e/mProjectPP_wrap-eozxvrpk<br>>> stderr.txt:<br>>> stdout.txt: [struct stat="ERROR", msg="All pixels are blank."]<br>>> _______________________________________________<br>>> Swift-devel mailing list<br>>> <a ymailto="mailto:Swift-devel@ci.uchicago.edu" href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>>> <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>>> <br>>> <br>>> _______________________________________________<br>>> Swift-devel mailing list<br>>> <a ymailto="mailto:Swift-devel@ci.uchicago.edu" href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>>> <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel"
target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br><br>-- <br>You received this message because you are subscribed to the Google Groups "MosaStore" group.<br>To post to this group, send email to <a ymailto="mailto:mosastore@googlegroups.com" href="mailto:mosastore@googlegroups.com">mosastore@googlegroups.com</a>.<br>To unsubscribe from this group, send email to mosastore+<a ymailto="mailto:unsubscribe@googlegroups.com" href="mailto:unsubscribe@googlegroups.com">unsubscribe@googlegroups.com</a>.<br>For more options, visit this group at <a href="http://groups.google.com/group/mosastore?hl=en" target="_blank">http://groups.google.com/group/mosastore?hl=en</a>.<br><br><br><br> </div> </div> </div></body></html>