From davidk at ci.uchicago.edu Mon Jul 2 16:49:48 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 2 Jul 2012 16:49:48 -0500 (CDT) Subject: [Swift-devel] Java memory strangeness In-Reply-To: <729118651.14275.1341264731032.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <2071486739.14422.1341265788826.JavaMail.root@zimbra-mb2.anl.gov> Hello, I installed Sun Java on a new machine I am working on. When I try to run it I see this: -bash-3.2$ java -version Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. The machine has 27G of memory free. When I specify a value of -Xmx it seems to work fine until I get somewhere around between 4 and 8 gigs. From what I've read, the default heap size is 1/4th of the total memory up to 1 gig, so I have no idea why it's failing here. I can run Swift manually from this machine because the swift shell script explicitly sets the heap size. But, I run into problems when I use ssh:pbs to the machine. How is the heap size determined when using ssh/coasters/bootstrapping? I've tried setting SWIFT_MAX_HEAP and COG_OPTS in my bashrc but that didn't seem to help. I might be able to get around this by creating some kind of java wrapper that explicitly sets heap size.. just curious how it currently gets set in this situation. Thanks, David From wozniak at mcs.anl.gov Mon Jul 2 16:56:17 2012 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 02 Jul 2012 16:56:17 -0500 Subject: [Swift-devel] Java memory strangeness In-Reply-To: <2071486739.14422.1341265788826.JavaMail.root@zimbra-mb2.anl.gov> References: <2071486739.14422.1341265788826.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <4FF21901.7000300@mcs.anl.gov> Is this a 32-bit machine? On 07/02/2012 04:49 PM, David Kelly wrote: > Hello, > > I installed Sun Java on a new machine I am working on. When I try to run it I see this: > > -bash-3.2$ java -version > Error occurred during initialization of VM > Could not reserve enough space for object heap > Error: Could not create the Java Virtual Machine. > Error: A fatal exception has occurred. Program will exit. > > The machine has 27G of memory free. > > When I specify a value of -Xmx it seems to work fine until I get somewhere around between 4 and 8 gigs. From what I've read, the default heap size is 1/4th of the total memory up to 1 gig, so I have no idea why it's failing here. > > I can run Swift manually from this machine because the swift shell script explicitly sets the heap size. > > But, I run into problems when I use ssh:pbs to the machine. > > How is the heap size determined when using ssh/coasters/bootstrapping? I've tried setting SWIFT_MAX_HEAP and COG_OPTS in my bashrc but that didn't seem to help. I might be able to get around this by creating some kind of java wrapper that explicitly sets heap size.. just curious how it currently gets set in this situation. > > Thanks, > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Justin M Wozniak From davidk at ci.uchicago.edu Mon Jul 2 18:34:10 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 2 Jul 2012 18:34:10 -0500 (CDT) Subject: [Swift-devel] Java memory strangeness In-Reply-To: <4FF21901.7000300@mcs.anl.gov> Message-ID: <2028772575.14836.1341272050613.JavaMail.root@zimbra-mb2.anl.gov> bash-3.2$ uname -a Linux makena.uchicago.edu 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Seems to be 64 bit ----- Original Message ----- > From: "Justin M Wozniak" > To: "David Kelly" > Cc: "swift-devel Devel" > Sent: Monday, July 2, 2012 4:56:17 PM > Subject: Re: [Swift-devel] Java memory strangeness > Is this a 32-bit machine? > > On 07/02/2012 04:49 PM, David Kelly wrote: > > Hello, > > > > I installed Sun Java on a new machine I am working on. When I try to > > run it I see this: > > > > -bash-3.2$ java -version > > Error occurred during initialization of VM > > Could not reserve enough space for object heap > > Error: Could not create the Java Virtual Machine. > > Error: A fatal exception has occurred. Program will exit. > > > > The machine has 27G of memory free. > > > > When I specify a value of -Xmx it seems to work fine until I get > > somewhere around between 4 and 8 gigs. From what I've read, the > > default heap size is 1/4th of the total memory up to 1 gig, so I > > have no idea why it's failing here. > > > > I can run Swift manually from this machine because the swift shell > > script explicitly sets the heap size. > > > > But, I run into problems when I use ssh:pbs to the machine. > > > > How is the heap size determined when using > > ssh/coasters/bootstrapping? I've tried setting SWIFT_MAX_HEAP and > > COG_OPTS in my bashrc but that didn't seem to help. I might be able > > to get around this by creating some kind of java wrapper that > > explicitly sets heap size.. just curious how it currently gets set > > in this situation. > > > > Thanks, > > David > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- > Justin M Wozniak From lpesce at uchicago.edu Mon Jul 2 19:16:39 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 2 Jul 2012 19:16:39 -0500 Subject: [Swift-devel] Java memory strangeness In-Reply-To: <2028772575.14836.1341272050613.JavaMail.root@zimbra-mb2.anl.gov> References: <2028772575.14836.1341272050613.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <4CEB8755-662B-46D8-8E2B-F7B58C93A6D4@uchicago.edu> David, I have observed a similar behavior on Beagle when I was running IBM java. The guessed diagnosis is that the JVM was in some way part 64 and part 32, consistently with an heap analysis we did with some software. The solution was to reinstall java and that worked. you can try it by yourself by using the default java on Beagle and then do module load java. This is my 2 cents. Java is not my thing and never will be. Lorenzo On Jul 2, 2012, at 6:34 PM, David Kelly wrote: > bash-3.2$ uname -a > Linux makena.uchicago.edu 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux > > Seems to be 64 bit > > ----- Original Message ----- >> From: "Justin M Wozniak" >> To: "David Kelly" >> Cc: "swift-devel Devel" >> Sent: Monday, July 2, 2012 4:56:17 PM >> Subject: Re: [Swift-devel] Java memory strangeness >> Is this a 32-bit machine? >> >> On 07/02/2012 04:49 PM, David Kelly wrote: >>> Hello, >>> >>> I installed Sun Java on a new machine I am working on. When I try to >>> run it I see this: >>> >>> -bash-3.2$ java -version >>> Error occurred during initialization of VM >>> Could not reserve enough space for object heap >>> Error: Could not create the Java Virtual Machine. >>> Error: A fatal exception has occurred. Program will exit. >>> >>> The machine has 27G of memory free. >>> >>> When I specify a value of -Xmx it seems to work fine until I get >>> somewhere around between 4 and 8 gigs. From what I've read, the >>> default heap size is 1/4th of the total memory up to 1 gig, so I >>> have no idea why it's failing here. >>> >>> I can run Swift manually from this machine because the swift shell >>> script explicitly sets the heap size. >>> >>> But, I run into problems when I use ssh:pbs to the machine. >>> >>> How is the heap size determined when using >>> ssh/coasters/bootstrapping? I've tried setting SWIFT_MAX_HEAP and >>> COG_OPTS in my bashrc but that didn't seem to help. I might be able >>> to get around this by creating some kind of java wrapper that >>> explicitly sets heap size.. just curious how it currently gets set >>> in this situation. >>> >>> Thanks, >>> David >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> >> -- >> Justin M Wozniak > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Tue Jul 3 12:27:03 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 3 Jul 2012 12:27:03 -0500 (CDT) Subject: [Swift-devel] Java memory strangeness In-Reply-To: <2071486739.14422.1341265788826.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <864075473.17881.1341336423938.JavaMail.root@zimbra-mb2.anl.gov> I tried with the default version of Java installed on that system, as well as a version I installed locally and saw the same behavior in both. I ended up just writing a wrapper shell script that calls the real Java binary with a specified heap size. This seems to work well enough for now. David ----- Original Message ----- > From: "David Kelly" > To: "swift-devel Devel" > Sent: Monday, July 2, 2012 4:49:48 PM > Subject: [Swift-devel] Java memory strangeness > Hello, > > I installed Sun Java on a new machine I am working on. When I try to > run it I see this: > > -bash-3.2$ java -version > Error occurred during initialization of VM > Could not reserve enough space for object heap > Error: Could not create the Java Virtual Machine. > Error: A fatal exception has occurred. Program will exit. > > The machine has 27G of memory free. > > When I specify a value of -Xmx it seems to work fine until I get > somewhere around between 4 and 8 gigs. From what I've read, the > default heap size is 1/4th of the total memory up to 1 gig, so I have > no idea why it's failing here. > > I can run Swift manually from this machine because the swift shell > script explicitly sets the heap size. > > But, I run into problems when I use ssh:pbs to the machine. > > How is the heap size determined when using ssh/coasters/bootstrapping? > I've tried setting SWIFT_MAX_HEAP and COG_OPTS in my bashrc but that > didn't seem to help. I might be able to get around this by creating > some kind of java wrapper that explicitly sets heap size.. just > curious how it currently gets set in this situation. > > Thanks, > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sat Jul 7 19:43:41 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 07 Jul 2012 17:43:41 -0700 Subject: [Swift-devel] TUI Message-ID: <1341708221.4653.0.camel@blabla> The TUI is back. Please test and let me know if there are problems or if you want new things in there. Mihael From wilde at mcs.anl.gov Sun Jul 8 10:02:25 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 8 Jul 2012 10:02:25 -0500 (CDT) Subject: [Swift-devel] TUI In-Reply-To: <1341708221.4653.0.camel@blabla> Message-ID: <990860902.11481.1341759745936.JavaMail.root@zimbra.anl.gov> Thanks, Mihael! Eventually (maybe gradually) I'd like to explore a more top-like plain-text version (including top -b, which would be an enhanced version of the current progress output, perhaps with %-like formatting options); and an HTML version with visually attractive output equivalent to the current TUI screens. The mechanism could drive real-time plotting of run behavior. But all this in due time. First we need to use and tune the current TUI as-is and get a better feeling for whats useful for users and compelling and meaningful for demos. Then we need to learn the code; I assuming its structured in a way that already collects all the data and makes many different renderings "easy" to do. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Swift Devel" > Sent: Saturday, July 7, 2012 7:43:41 PM > Subject: [Swift-devel] TUI > The TUI is back. Please test and let me know if there are problems or > if > you want new things in there. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Jul 8 14:45:15 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 08 Jul 2012 12:45:15 -0700 Subject: [Swift-devel] TUI In-Reply-To: <990860902.11481.1341759745936.JavaMail.root@zimbra.anl.gov> References: <990860902.11481.1341759745936.JavaMail.root@zimbra.anl.gov> Message-ID: <1341776715.11863.4.camel@blabla> On Sun, 2012-07-08 at 10:02 -0500, Michael Wilde wrote: > Thanks, Mihael! > > Eventually (maybe gradually) I'd like to explore a more top-like > plain-text version (including top -b, which would be an enhanced > version of the current progress output, perhaps with %-like formatting > options); Yeah. There is already some minimal code for different frontends, and a top like thing is one of them. > and an HTML version with visually attractive output equivalent to the > current TUI screens. The mechanism could drive real-time plotting of > run behavior. Right. It's made so that it could also parse logs (instead of intercepting log calls), so this could also be used offline. > > But all this in due time. First we need to use and tune the current > TUI as-is and get a better feeling for whats useful for users and > compelling and meaningful for demos. Right. Please give me feedback on that. > > Then we need to learn the code; I assuming its structured in a way > that already collects all the data and makes many different renderings > "easy" to do. Hopefully it is. What it does is to build a dynamic state out of a stream of events. That state can be interpreted and displayed in whatever ways are needed. Mihael From wilde at mcs.anl.gov Mon Jul 9 08:19:55 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 9 Jul 2012 08:19:55 -0500 (CDT) Subject: [Swift-devel] Java hangs on new rcc hardware In-Reply-To: <1157733320.11787.1341809347312.JavaMail.root@zimbra.anl.gov> Message-ID: <698872.11969.1341839995322.JavaMail.root@zimbra.anl.gov> Java is acting strange for me on the new RCC "midway" cluster. The symptom is that the jvm seems to go into a tight cpu loop across several (3 or more) cores. I see this first in the polling loop in the local scheduler provider, which calls Thread.sleep() and seems to not return. But each time I suspect and resume the jvm with ^Z, fg, ^Z, bg, it progresses further. Doing this twice enables the jvm to successfully complete the Swift script its running (which tests a single PBS job). I see what appears to be similar behavior in the Swift build. The ant redist will hang somewhere around where Swift compiles the antlr output, then a similar suspect-resume sequence will cause it to continue. I saw this first with the Java 1.7 that was installed on midway; then with the latest JDK 1.6, and also with what I think is a more recent/latest JDK 1.7. Im still debugging, but any help or suggestions would be most welcome. Thanks, - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Jul 9 11:24:35 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 9 Jul 2012 11:24:35 -0500 (CDT) Subject: [Swift-devel] Java hangs on new rcc hardware In-Reply-To: <698872.11969.1341839995322.JavaMail.root@zimbra.anl.gov> Message-ID: <879865581.12569.1341851075702.JavaMail.root@zimbra.anl.gov> The problematic Swift script seems to run fine with 32-bit Java 1.6. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Monday, July 9, 2012 8:19:55 AM > Subject: [Swift-devel] Java hangs on new rcc hardware > Java is acting strange for me on the new RCC "midway" cluster. The > symptom is that the jvm seems to go into a tight cpu loop across > several (3 or more) cores. > > I see this first in the polling loop in the local scheduler provider, > which calls Thread.sleep() and seems to not return. But each time I > suspect and resume the jvm with ^Z, fg, ^Z, bg, it progresses further. > Doing this twice enables the jvm to successfully complete the Swift > script its running (which tests a single PBS job). > > I see what appears to be similar behavior in the Swift build. The ant > redist will hang somewhere around where Swift compiles the antlr > output, then a similar suspect-resume sequence will cause it to > continue. > > I saw this first with the Java 1.7 that was installed on midway; then > with the latest JDK 1.6, and also with what I think is a more > recent/latest JDK 1.7. > > Im still debugging, but any help or suggestions would be most welcome. > > Thanks, > > - Mike > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Jul 14 01:16:47 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Jul 2012 23:16:47 -0700 Subject: [Swift-devel] hang checker updates Message-ID: <1342246607.6265.2.camel@blabla> I think mike requested swift stack traces in the hang checker instead of cryptic thread ids. That's in now. Also in is a dependency loop detector in the hang checker. It doesn't detect static cycles, but ones that actually cause a hang. I'm not sure how well it works for real life situations, but I can confirm it works for simple things like a = f(b); b = f(a);. Please give it a shot. From hategan at mcs.anl.gov Sat Jul 14 01:22:08 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Jul 2012 23:22:08 -0700 Subject: [Swift-devel] Java hangs on new rcc hardware In-Reply-To: <879865581.12569.1341851075702.JavaMail.root@zimbra.anl.gov> References: <879865581.12569.1341851075702.JavaMail.root@zimbra.anl.gov> Message-ID: <1342246928.6265.5.camel@blabla> For reference, this was solved. The issue was a bug in the linux kernel futex() implementation (which is how Thread.sleep() and Object.wait() are using) that was triggered by the leap second introduced a while ago. For details, see https://lkml.org/lkml/2012/6/30/122 Mihael On Mon, 2012-07-09 at 11:24 -0500, Michael Wilde wrote: > The problematic Swift script seems to run fine with 32-bit Java 1.6. > > - Mike > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Mihael Hategan" > > Cc: "Swift Devel" > > Sent: Monday, July 9, 2012 8:19:55 AM > > Subject: [Swift-devel] Java hangs on new rcc hardware > > Java is acting strange for me on the new RCC "midway" cluster. The > > symptom is that the jvm seems to go into a tight cpu loop across > > several (3 or more) cores. > > > > I see this first in the polling loop in the local scheduler provider, > > which calls Thread.sleep() and seems to not return. But each time I > > suspect and resume the jvm with ^Z, fg, ^Z, bg, it progresses further. > > Doing this twice enables the jvm to successfully complete the Swift > > script its running (which tests a single PBS job). > > > > I see what appears to be similar behavior in the Swift build. The ant > > redist will hang somewhere around where Swift compiles the antlr > > output, then a similar suspect-resume sequence will cause it to > > continue. > > > > I saw this first with the Java 1.7 that was installed on midway; then > > with the latest JDK 1.6, and also with what I think is a more > > recent/latest JDK 1.7. > > > > Im still debugging, but any help or suggestions would be most welcome. > > > > Thanks, > > > > - Mike > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Sat Jul 14 06:29:05 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 14 Jul 2012 06:29:05 -0500 (CDT) Subject: [Swift-devel] hang checker updates In-Reply-To: <1342246607.6265.2.camel@blabla> Message-ID: <944936261.20011.1342265345927.JavaMail.root@zimbra.anl.gov> This sounds great, Mihael. Im eager to try it. In the meantime, can you help diagnose the specific deadlock in the PNNL "SPH" script? The deadlock doesnt occur until several hours into a large run on their Hopper Cray system, using complex MPI applications, so its not easy for us (or even them) to replicate. But it does deadlock on every run they've tried recently. >From the Swift .log we have of one such deadlock, we determined the variables that are the likely cause. Now we're trying to determine the deadlocking statements by analyzing the source code and the .kml file, which tells us where the partial array closes are and where the code waits on those closes. One feature which I *think* would help in this debugging is a source code listing that annotates where these closes and waits are inserted. Would it be possible to generate that from the .kml file? (Or even a listing that gives the lines or expressions where these events take place?) The files for this problem are on the CI net at: /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712 In that dir: - I extracted the source, hybrid.swift, from the .log. - open00 gives the open variables in the first hang-checker event in the log: egrep -i -w 'local_output|writeDataOut|sphOutArr|sphOutNameArr|tarfile' *.log - Khushbu thinks that line 388 is not getting executed as expected when the hang occurs: (local_forward_dat, gpg_stdout, conca_dat, concb_dat, concc_dat) = gpg(writeDataOut, h5part_files, iter, plot); This suggests that writeDataOut is the open variable blocking this statement. Working backward, writeDataOut is declared and set at lines 356-358: file writeDataOut ; trace("file writeDataOut = ", writeDataOut); writeDataOut = writeData(sphOutNameArr); This in turn is blocking on open var sphOutNameArr which in turn is possibly blocking on sphOutArr. In a similar deadlock we debugged in ParVis code, the script made a reference to a full array (ie, passed an array to a function that blocked on a complete close of the array) *within* a code block in which the array was still open. Ie, a partial close could not execute because the block had not completed, and the block could not complete until the partial close was done. Im not sure this is the same situation, but its possible. The script has many conditionals, which could explain why it doesnt deadlock until long into execution. If we could trace all the references to the open variables, including all partial closes and all waits on those closes, we might be able to identify and eliminate the deadlock. We can clearly see the partial closes and the waits on these in the kml, but mapping the KML to source code lines, while possible, is tedious and manual as far as I can tell. Ideally, we could have a tool that does this given the source, the kml, and the log. Im hoping your new trace code either does this or comes close. Early next week I'll try to help Karen and Khushbu do this, unless you can help them sooner. Thanks, - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Swift Devel" > Sent: Saturday, July 14, 2012 1:16:47 AM > Subject: [Swift-devel] hang checker updates > I think mike requested swift stack traces in the hang checker instead > of > cryptic thread ids. That's in now. > > Also in is a dependency loop detector in the hang checker. It doesn't > detect static cycles, but ones that actually cause a hang. I'm not > sure > how well it works for real life situations, but I can confirm it works > for simple things like a = f(b); b = f(a);. Please give it a shot. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sat Jul 14 06:41:21 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 14 Jul 2012 06:41:21 -0500 (CDT) Subject: [Swift-devel] hang checker updates In-Reply-To: <944936261.20011.1342265345927.JavaMail.root@zimbra.anl.gov> Message-ID: <1870636456.20018.1342266081332.JavaMail.root@zimbra.anl.gov> Mihael, thinking over the PNNL SPH hang, if your new code would indeed print the stack traces of the hanging Swift threads, that would likely identify the deadlock right away - essentially performing the logic that we're trying to do manual by deduction. So I'll try to get them to test with the new version asap. In the meantime, can you test against the hang below? This is a simple re-creation of the ParVis deadlock I mentioned in my prior post. One think I noticed in the current PNNL incident is that the thread IDs which are listed by the hang checker are not found anywhere in the log. Often they are, which helps in the diagnosis. So Im assuming these must be internal functions which are just not logged. Im wondering if that will interfere with your new tracing, or not? Here's the ParVis case. - Mike ----- Forwarded Message ----- From: "Michael Wilde" To: "Sheri Mickelson" Sent: Saturday, February 18, 2012 12:05:16 PM Subject: Re: No events in 10s. Hi Sheri, A quick update: good news is that Ive been able to re-create whats causing the hang in a few very tiny Swift scripts that show whats happening. Im trying to turn those into a "how to avoid this situation" example and suggest how to change your ocean script accordingly to get around this. If I cant give you a good solution very soon, I'll send some prelim info. Basically, if you are setting an array's elements *inside* an if() statement, you can't process the array's contents as a whole (ie pass it to an app) inside the same if statement block. Instead you need to process it outside the if statement, so that swift knows that the array is "closed", ie, completely filed. Here's an example. I'll try to work up an example in terms of your exact code, to show you a few was to work around this. In the meantime, Im sending you what I have in case you want to try something on your own sooner. Another approach I think works is to fill the array in a function that returns the array as a whole object. Sorry that it took me so long to get to this. I'll also send something on your sites.xml question for Andy for PBS. Regards, - Mike com$ swift acint.works.swift no sites file specified, setting to default: /home/wilde/swift/rev/swift-0.93RC4/etc/sites.xml Swift svn swift-r5277 cog-r3320 RunID: 20120218-1159-o08mzyd6 Progress: time: Sat, 18 Feb 2012 11:59:46 -0600 Final status: time: Sat, 18 Feb 2012 11:59:46 -0600 Finished successfully:1 com$ swift acint.hangs.swift no sites file specified, setting to default: /home/wilde/swift/rev/swift-0.93RC4/etc/sites.xml Swift svn swift-r5277 cog-r3320 RunID: 20120218-1159-8kgadrq6 Progress: time: Sat, 18 Feb 2012 11:59:55 -0600 No events in 10s. Registered futures: int[] out Open, 2 elements, 1 listeners ---- Waiting threads: 0-1-1 ---- com$ cat acint.works.swift type file; app (file o) echo(int i[]) { echo i stdout=@filename(o); } int out[]; file f<"out.txt">; if ( true ) { foreach j in [1:2] { out[j] = j; } } f = echo(out); com$ cat acint.hangs.swift type file; app (file o) echo(int i[]) { echo i stdout=@filename(o); } int out[]; file f<"out.txt">; if ( true ) { foreach j in [1:2] { out[j] = j; } f = echo(out); } com$ diff acint.works.swift acint.hangs.swift 14a15 > f = echo(out); 16c17 < f = echo(out); --- > com$ cat out.txt 1 2 com$ ----- Original Message ----- > From: "Sheri Mickelson" > To: "Michael Wilde" > Sent: Tuesday, February 14, 2012 9:29:26 AM > Subject: No events in 10s. > Hi Mike, > > I've been trying to sort out an issue that I've been having with my > ocean Swift code for a couple of days now and I'm stuck. Would you > have time to give it a quick look? > > I've attached both my Swift file and the log file. I'm running local > with coasters using the Swift 0.93 from the release page. > > Here's the exact error I'm seeing: > > Progress: time: Tue, 14 Feb 2012 07:51:52 -0700 Finished > successfully:126 > No events in 10s. > > Registered futures: > file[] ncl_finished Open, 12 elements, 1 listeners > file psFileList - F/psFileList:file - Open > file[] mocmYearlyFiles Open, 2 elements, 1 listeners > file moctsa - F/moctsa:file - Open > string[] psFiles Open, 0 elements, 1 listeners > file[] mocaYearlyFiles Open, 2 elements, 1 listeners > ---- > > Waiting threads: > 0-66-1 > 0-68-1-5-1 > 0-68-1-10-1 > 0-77 > 0-78 > 0-68-1-4-1 > 0-68-1-11-1 > 0-68-1-1-1 > 0-68-1-8-1 > 0-68-1-3-1 > 0-68-1-6-1 > 0-68-1-9-1 > 0-68-1-2-1 > 0-68-1-0-1 > 0-68-1-7-1 > 0-64-1 > 0-76 > ---- > > I think there are two sections that are having problems. > > The first one starts at line 327. > > I checked my _concurrent directory and I have both of the files that > ncks_var produced > yearlyFile-8e533cf7-76aa-4110-ade1-857a13f77134-64-0-0 > yearlyFile-8e533cf7-76aa-4110-ade1-857a13f77134-64-0-1 > > I also tried changing line 338 to > mocaYearlyFiles[y] = create_blank_file_File(yearlyFile); > > and I get > _concurrent/mocaYearlyFiles-f872c3e5-7574-4e51-9d69-3a33b9802725--array/ > elt-0 elt-1 > > I'm only running this on two years of data so there are two files > produced - one for each year. > > The problem is that line 341 never executes > moctsa = Record_Cat(mocaYearlyFiles); > > I have a similar problem with line 368 never executing: > moctsm = Record_Cat(mocmYearlyFiles); > > With this section everything is created and is also in _concurrent. > > The code stalls because it's waiting for the above Record_Cat calls to > continue on and start running the ncl scripts. > > Does anything pop out at you? I'm at Argonne all day today if you > want to stop by. > > Thanks, Sheri -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Saturday, July 14, 2012 6:29:05 AM > Subject: Re: [Swift-devel] hang checker updates > This sounds great, Mihael. Im eager to try it. > > In the meantime, can you help diagnose the specific deadlock in the > PNNL "SPH" script? The deadlock doesnt occur until several hours into > a large run on their Hopper Cray system, using complex MPI > applications, so its not easy for us (or even them) to replicate. But > it does deadlock on every run they've tried recently. > > From the Swift .log we have of one such deadlock, we determined the > variables that are the likely cause. Now we're trying to determine the > deadlocking statements by analyzing the source code and the .kml file, > which tells us where the partial array closes are and where the code > waits on those closes. > > One feature which I *think* would help in this debugging is a source > code listing that annotates where these closes and waits are inserted. > Would it be possible to generate that from the .kml file? (Or even a > listing that gives the lines or expressions where these events take > place?) > > The files for this problem are on the CI net at: > /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712 > > In that dir: > > - I extracted the source, hybrid.swift, from the .log. > > - open00 gives the open variables in the first hang-checker event in > the log: > egrep -i -w > 'local_output|writeDataOut|sphOutArr|sphOutNameArr|tarfile' *.log > > - Khushbu thinks that line 388 is not getting executed as expected > when the hang occurs: > (local_forward_dat, gpg_stdout, conca_dat, concb_dat, concc_dat) = > gpg(writeDataOut, h5part_files, iter, plot); > > This suggests that writeDataOut is the open variable blocking this > statement. > > Working backward, writeDataOut is declared and set at lines 356-358: > > file writeDataOut "/writeData.out")>; > trace("file writeDataOut = ", writeDataOut); > writeDataOut = writeData(sphOutNameArr); > > This in turn is blocking on open var sphOutNameArr which in turn is > possibly blocking on sphOutArr. > > In a similar deadlock we debugged in ParVis code, the script made a > reference to a full array (ie, passed an array to a function that > blocked on a complete close of the array) *within* a code block in > which the array was still open. Ie, a partial close could not execute > because the block had not completed, and the block could not complete > until the partial close was done. Im not sure this is the same > situation, but its possible. The script has many conditionals, which > could explain why it doesnt deadlock until long into execution. > > If we could trace all the references to the open variables, including > all partial closes and all waits on those closes, we might be able to > identify and eliminate the deadlock. > > We can clearly see the partial closes and the waits on these in the > kml, but mapping the KML to source code lines, while possible, is > tedious and manual as far as I can tell. > > Ideally, we could have a tool that does this given the source, the > kml, and the log. Im hoping your new trace code either does this or > comes close. Early next week I'll try to help Karen and Khushbu do > this, unless you can help them sooner. > > Thanks, > > - Mike > > > > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Swift Devel" > > Sent: Saturday, July 14, 2012 1:16:47 AM > > Subject: [Swift-devel] hang checker updates > > I think mike requested swift stack traces in the hang checker > > instead > > of > > cryptic thread ids. That's in now. > > > > Also in is a dependency loop detector in the hang checker. It > > doesn't > > detect static cycles, but ones that actually cause a hang. I'm not > > sure > > how well it works for real life situations, but I can confirm it > > works > > for simple things like a = f(b); b = f(a);. Please give it a shot. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Jul 14 11:04:28 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 14 Jul 2012 09:04:28 -0700 Subject: [Swift-devel] hang checker updates In-Reply-To: <944936261.20011.1342265345927.JavaMail.root@zimbra.anl.gov> References: <944936261.20011.1342265345927.JavaMail.root@zimbra.anl.gov> Message-ID: <1342281868.8830.0.camel@blabla> On Sat, 2012-07-14 at 06:29 -0500, Michael Wilde wrote: > In the meantime, can you help diagnose the specific deadlock in the PNNL "SPH" script? I can try. > The files for this problem are on the CI net at: > /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712 scp: /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712: Permission denied From wilde at mcs.anl.gov Sat Jul 14 11:33:52 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 14 Jul 2012 11:33:52 -0500 (CDT) Subject: [Swift-devel] hang checker updates In-Reply-To: <1342281868.8830.0.camel@blabla> Message-ID: <901261046.20135.1342283632936.JavaMail.root@zimbra.anl.gov> Sorry, should be readable now. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Saturday, July 14, 2012 11:04:28 AM > Subject: Re: [Swift-devel] hang checker updates > On Sat, 2012-07-14 at 06:29 -0500, Michael Wilde wrote: > > In the meantime, can you help diagnose the specific deadlock in the > > PNNL "SPH" script? > > I can try. > > > The files for this problem are on the CI net at: > > /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712 > > scp: /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712: Permission > denied -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Jul 14 13:38:17 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 14 Jul 2012 11:38:17 -0700 Subject: [Swift-devel] hang checker updates In-Reply-To: <901261046.20135.1342283632936.JavaMail.root@zimbra.anl.gov> References: <901261046.20135.1342283632936.JavaMail.root@zimbra.anl.gov> Message-ID: <1342291097.10921.13.camel@blabla> The waiting threads are as follows: 0-17-84-2-66 local_output.h5part = sphOutArr 0-17-84-2-73 foreach myfile in local_output.h5part 0-17-84-2-72 output = local_output 0-17-84-2-63 gpg(local_forward_dat, gpg_stdout, conca_dat, concb_dat, concc_dat, writeDataOut, h5part_files, iter, plot) 0-17-84-2-54 trace(writeDataOut) 0-17-84-2-55 writeDataOut = writeData(sphOutNameArr) 0-17-84-2-47-4-3-7 tarfiles[i] = tarfile 54 waits on writeDataOut which waits on sphOutNameArr 55 waits on sphOutNameArr 63 waits on writeDataOut who waits in sphOutNameArr 66 waits on sphOutArr 72 waits on local_output.h5part who waits on sphOutArr 73 waits on local_output who waits on sphOutArr sphOutNameArr and sphOutArr wait on two partial closes: 88043 and 88075 Those are the if (n > NUM_SPH_RUNS) {} (line 250) and the iterate on line 313 The first one is the problem. In particular: 0-17-84-2-47-4-3-7 tarfiles[4] = tarfile For some reason tarfile is open. Since it should be closed by copySph (and all other returns of copySph are closed), I can only conclude that it's a swift bug. Do you have a different run (just the log file with hang checker triggered will do) to confirm? Mihael On Sat, 2012-07-14 at 11:33 -0500, Michael Wilde wrote: > Sorry, should be readable now. > > - Mike > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "Swift Devel" > > Sent: Saturday, July 14, 2012 11:04:28 AM > > Subject: Re: [Swift-devel] hang checker updates > > On Sat, 2012-07-14 at 06:29 -0500, Michael Wilde wrote: > > > In the meantime, can you help diagnose the specific deadlock in the > > > PNNL "SPH" script? > > > > I can try. > > > > > The files for this problem are on the CI net at: > > > /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712 > > > > scp: /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712: Permission > > denied > From wilde at mcs.anl.gov Sat Jul 14 13:55:26 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 14 Jul 2012 13:55:26 -0500 (CDT) Subject: [Swift-devel] hang checker updates In-Reply-To: <1342291097.10921.13.camel@blabla> Message-ID: <596466755.20210.1342292126851.JavaMail.root@zimbra.anl.gov> Wow - great analysis! Is the logic you applied here embedded in the new trace code? (Ie if users and Swift support folks could get this right off the bat, that would be excellent). I'll forward this to the PNNL folks and see if they have more logs. Everything I got from them so far is in the dir you already have. I was slowly going down this chain, but had no clue where to get these thread IDs. I only looked at the first hang checker trace, whose thread IDs I could not find in the log. How did you get all the details below? (Dont need to answer that now - might be good to put the technique in both a page of debuggging tips and/or the automated tracer...) Thanks! - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Saturday, July 14, 2012 1:38:17 PM > Subject: Re: [Swift-devel] hang checker updates > The waiting threads are as follows: > > 0-17-84-2-66 local_output.h5part = sphOutArr > 0-17-84-2-73 foreach myfile in local_output.h5part > 0-17-84-2-72 output = local_output > 0-17-84-2-63 gpg(local_forward_dat, gpg_stdout, conca_dat, concb_dat, > concc_dat, writeDataOut, h5part_files, iter, plot) > 0-17-84-2-54 trace(writeDataOut) > 0-17-84-2-55 writeDataOut = writeData(sphOutNameArr) > 0-17-84-2-47-4-3-7 tarfiles[i] = tarfile > > 54 waits on writeDataOut which waits on sphOutNameArr > 55 waits on sphOutNameArr > 63 waits on writeDataOut who waits in sphOutNameArr > 66 waits on sphOutArr > 72 waits on local_output.h5part who waits on sphOutArr > 73 waits on local_output who waits on sphOutArr > > sphOutNameArr and sphOutArr wait on two partial closes: 88043 and > 88075 > Those are the if (n > NUM_SPH_RUNS) {} (line 250) and the iterate on > line 313 > > The first one is the problem. In particular: > 0-17-84-2-47-4-3-7 tarfiles[4] = tarfile > > For some reason tarfile is open. Since it should be closed by copySph > (and all other returns of copySph are closed), I can only conclude > that > it's a swift bug. > > Do you have a different run (just the log file with hang checker > triggered will do) to confirm? > > Mihael > > On Sat, 2012-07-14 at 11:33 -0500, Michael Wilde wrote: > > Sorry, should be readable now. > > > > - Mike > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Cc: "Swift Devel" > > > Sent: Saturday, July 14, 2012 11:04:28 AM > > > Subject: Re: [Swift-devel] hang checker updates > > > On Sat, 2012-07-14 at 06:29 -0500, Michael Wilde wrote: > > > > In the meantime, can you help diagnose the specific deadlock in > > > > the > > > > PNNL "SPH" script? > > > > > > I can try. > > > > > > > The files for this problem are on the CI net at: > > > > /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712 > > > > > > scp: /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712: > > > Permission > > > denied > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Jul 14 14:07:41 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 14 Jul 2012 12:07:41 -0700 Subject: [Swift-devel] hang checker updates In-Reply-To: <596466755.20210.1342292126851.JavaMail.root@zimbra.anl.gov> References: <596466755.20210.1342292126851.JavaMail.root@zimbra.anl.gov> Message-ID: <1342292861.10921.18.camel@blabla> On Sat, 2012-07-14 at 13:55 -0500, Michael Wilde wrote: > Wow - great analysis! Is the logic you applied here embedded in the new trace code? (Ie if users and Swift support folks could get this right off the bat, that would be excellent). It doesn't deal with partial closes and array analysis. I'm working on that. > > I'll forward this to the PNNL folks and see if they have more logs. Everything I got from them so far is in the dir you already have. > > I was slowly going down this chain, but had no clue where to get these > thread IDs. You start with thread 0 if you have a parallel(), then each block inside that gets a new level and a sequential id: parallel( sequential(// happens in thread 0-0 ... ) foo(b); // happens in thread 0-1 ) Foreach loops also add their own level and use a sequential id for each iteration. It's a bit of a manual work to look at the kml structure and figure out where things are. You can speed up the process when you have a compound invocation by looking at the first few levels in the thread and compare with those in the log. That should not be needed with the new stack traces, though that might need some improvement. > I only looked at the first hang checker trace, whose thread IDs I > could not find in the log. How did you get all the details below? > (Dont need to answer that now - might be good to put the technique in > both a page of debuggging tips and/or the automated tracer...) > > Thanks! > > - Mike > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "Swift Devel" > > Sent: Saturday, July 14, 2012 1:38:17 PM > > Subject: Re: [Swift-devel] hang checker updates > > The waiting threads are as follows: > > > > 0-17-84-2-66 local_output.h5part = sphOutArr > > 0-17-84-2-73 foreach myfile in local_output.h5part > > 0-17-84-2-72 output = local_output > > 0-17-84-2-63 gpg(local_forward_dat, gpg_stdout, conca_dat, concb_dat, > > concc_dat, writeDataOut, h5part_files, iter, plot) > > 0-17-84-2-54 trace(writeDataOut) > > 0-17-84-2-55 writeDataOut = writeData(sphOutNameArr) > > 0-17-84-2-47-4-3-7 tarfiles[i] = tarfile > > > > 54 waits on writeDataOut which waits on sphOutNameArr > > 55 waits on sphOutNameArr > > 63 waits on writeDataOut who waits in sphOutNameArr > > 66 waits on sphOutArr > > 72 waits on local_output.h5part who waits on sphOutArr > > 73 waits on local_output who waits on sphOutArr > > > > sphOutNameArr and sphOutArr wait on two partial closes: 88043 and > > 88075 > > Those are the if (n > NUM_SPH_RUNS) {} (line 250) and the iterate on > > line 313 > > > > The first one is the problem. In particular: > > 0-17-84-2-47-4-3-7 tarfiles[4] = tarfile > > > > For some reason tarfile is open. Since it should be closed by copySph > > (and all other returns of copySph are closed), I can only conclude > > that > > it's a swift bug. > > > > Do you have a different run (just the log file with hang checker > > triggered will do) to confirm? > > > > Mihael > > > > On Sat, 2012-07-14 at 11:33 -0500, Michael Wilde wrote: > > > Sorry, should be readable now. > > > > > > - Mike > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "Michael Wilde" > > > > Cc: "Swift Devel" > > > > Sent: Saturday, July 14, 2012 11:04:28 AM > > > > Subject: Re: [Swift-devel] hang checker updates > > > > On Sat, 2012-07-14 at 06:29 -0500, Michael Wilde wrote: > > > > > In the meantime, can you help diagnose the specific deadlock in > > > > > the > > > > > PNNL "SPH" script? > > > > > > > > I can try. > > > > > > > > > The files for this problem are on the CI net at: > > > > > /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712 > > > > > > > > scp: /home/wilde/swift/support/PNNL.SPH.deadlock.2012.0712: > > > > Permission > > > > denied > > > > From hategan at mcs.anl.gov Sat Jul 14 19:11:42 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 14 Jul 2012 17:11:42 -0700 Subject: [Swift-devel] hang checker updates In-Reply-To: <1342292861.10921.18.camel@blabla> References: <596466755.20210.1342292126851.JavaMail.root@zimbra.anl.gov> <1342292861.10921.18.camel@blabla> Message-ID: <1342311102.26932.3.camel@blabla> On Sat, 2012-07-14 at 12:07 -0700, Mihael Hategan wrote: > On Sat, 2012-07-14 at 13:55 -0500, Michael Wilde wrote: > > Wow - great analysis! Is the logic you applied here embedded in the new trace code? (Ie if users and Swift support folks could get this right off the bat, that would be excellent). > > It doesn't deal with partial closes and array analysis. I'm working on > that. Hmm, that was harder and I'm not quite sure how it will behave for large scripts (there are plenty of > O(N) things there). But I committed it if you want to try. It now detects things like: foreach i in [0:4] { c[i] = cat(d); } foreach i in [5:9] { c[i] = gen(); } d = mcat(c); Dependency loop found: d (declared on line 21) is needed by: cat, many-cat.swift, line 26 foreach, many-cat.swift, line 24 the above must complete before the block below can complete: foreach, many-cat.swift, line 24 which produces c (declared on line 20) c (declared on line 20) is needed by: mcat, many-cat.swift, line 34 which produces d (declared on line 21) From hategan at mcs.anl.gov Sat Jul 14 19:22:35 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 14 Jul 2012 17:22:35 -0700 Subject: [Swift-devel] misc changes Message-ID: <1342311755.26932.13.camel@blabla> I committed a few unrelated things: 1. No more synchronization on the handles in data nodes. There is a one-to-one mapping between a handles map and a data node. The never quite knowing whether to sync on the node or the handles problem (which also caused deadlocks) was a bit annoying. So now all the synchronization happens on the node itself. 2. The and friends were replaced by and . The latter do the logging and also a bit of data tracking. 3. Previously, the following would fail: int b; foreach i in [0:100] { if (i == 4) { b = 1; } } The assumption was that an assignment to a non-array in a loop can only mean multiple-writes. That's not quite true. This is now allowed with a warning. If multiple writes actually occur, there will be a run-time error. 4. There were partialCloseDataset invocations after app invocations. Those seem useless since apps always close their returns before returning. I removed them. Might add them back if it causes problems. Mihael From wilde at mcs.anl.gov Sun Jul 15 10:01:47 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 15 Jul 2012 10:01:47 -0500 (CDT) Subject: [Swift-devel] hang checker updates In-Reply-To: <1342311102.26932.3.camel@blabla> Message-ID: <540044608.20541.1342364507242.JavaMail.root@zimbra.anl.gov> Mihael, this analysis-with-messages and the subsequent changes look excellent - very exciting enhancements and improvements. We'll need to do thorough testing and add tests to exercise the deadlock detection and avoidance. But very nice work. I'll make a new trunk release for PNNL to try. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Saturday, July 14, 2012 7:11:42 PM > Subject: Re: [Swift-devel] hang checker updates > On Sat, 2012-07-14 at 12:07 -0700, Mihael Hategan wrote: > > On Sat, 2012-07-14 at 13:55 -0500, Michael Wilde wrote: > > > Wow - great analysis! Is the logic you applied here embedded in > > > the new trace code? (Ie if users and Swift support folks could get > > > this right off the bat, that would be excellent). > > > > It doesn't deal with partial closes and array analysis. I'm working > > on > > that. > > Hmm, that was harder and I'm not quite sure how it will behave for > large > scripts (there are plenty of > O(N) things there). But I committed it > if > you want to try. It now detects things like: > foreach i in [0:4] { > c[i] = cat(d); > } > > foreach i in [5:9] { > c[i] = gen(); > } > > d = mcat(c); > > Dependency loop found: > d (declared on line 21) is needed by: > cat, many-cat.swift, line 26 > foreach, many-cat.swift, line 24 > > the above must complete before the block below can complete: > foreach, many-cat.swift, line 24 > which produces c (declared on line 20) > > c (declared on line 20) is needed by: > mcat, many-cat.swift, line 34 > which produces d (declared on line 21) -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Jul 16 13:37:49 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 16 Jul 2012 11:37:49 -0700 Subject: [Swift-devel] bugzilla change Message-ID: <1342463869.7314.3.camel@blabla> Hi, I added the "WAITING FOR USER INPUT" state in bugzilla. It's there for us to distinguish between something the bug assignee is actively supposed to be working on or waiting for some external thing to happen. Mihael From hategan at mcs.anl.gov Tue Jul 17 12:51:50 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Jul 2012 10:51:50 -0700 Subject: [Swift-devel] gt4 provider Message-ID: <1342547510.28032.7.camel@blabla> Is anyone still using GT4? Mihael From davidk at ci.uchicago.edu Tue Jul 17 13:30:59 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 17 Jul 2012 13:30:59 -0500 (CDT) Subject: [Swift-devel] Foreach with floats In-Reply-To: <451213546.77108.1342549430097.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <470663924.77146.1342549859167.JavaMail.root@zimbra-mb2.anl.gov> Hello, Is it possible to use foreach with floating point numbers? Right now I can do something like this using ints: foreach i in [0:10] { tracef("i is %i\n", i); } Is it possible to do something similar with floats using some delta value? For example, for every value between 0.0 and 10.0 in increments of 0.5? David From tim.g.armstrong at gmail.com Tue Jul 17 13:44:01 2012 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 17 Jul 2012 11:44:01 -0700 Subject: [Swift-devel] Foreach with floats In-Reply-To: <470663924.77146.1342549859167.JavaMail.root@zimbra-mb2.anl.gov> References: <451213546.77108.1342549430097.JavaMail.root@zimbra-mb2.anl.gov> <470663924.77146.1342549859167.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: Something like the following might be the easiest solution: foreach i in [0:20] { float j = i * 0.5; } I think most languages avoid encouraging use of floats as loop variables since its fairly easy to run into surprising behaviour with floating point precision, depending on the way things are implemented. - Tim On Tue, Jul 17, 2012 at 11:30 AM, David Kelly wrote: > Hello, > > Is it possible to use foreach with floating point numbers? Right now I can > do something like this using ints: > > foreach i in [0:10] { > tracef("i is %i\n", i); > } > > Is it possible to do something similar with floats using some delta value? > For example, for every value between 0.0 and 10.0 in increments of 0.5? > > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jul 17 13:57:14 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Jul 2012 11:57:14 -0700 Subject: [Swift-devel] Foreach with floats In-Reply-To: <470663924.77146.1342549859167.JavaMail.root@zimbra-mb2.anl.gov> References: <470663924.77146.1342549859167.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1342551434.30338.1.camel@blabla> Theoretically this should work: foreach i in [0.0:10:0.2] { } (i.e. the first thing is a float and you specify a step). But I seem to be getting an error when I try it. Mihael On Tue, 2012-07-17 at 13:30 -0500, David Kelly wrote: > Hello, > > Is it possible to use foreach with floating point numbers? Right now I can do something like this using ints: > > foreach i in [0:10] { > tracef("i is %i\n", i); > } > > Is it possible to do something similar with floats using some delta value? For example, for every value between 0.0 and 10.0 in increments of 0.5? > > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jul 17 13:59:02 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Jul 2012 11:59:02 -0700 Subject: [Swift-devel] Foreach with floats In-Reply-To: <1342551434.30338.1.camel@blabla> References: <470663924.77146.1342549859167.JavaMail.root@zimbra-mb2.anl.gov> <1342551434.30338.1.camel@blabla> Message-ID: <1342551542.30338.2.camel@blabla> On Tue, 2012-07-17 at 11:57 -0700, Mihael Hategan wrote: > Theoretically this should work: > > foreach i in [0.0:10:0.2] { > } > > (i.e. the first thing is a float and you specify a step). > > But I seem to be getting an error when I try it. nevermind. They all need to be floats: foreach i in [0.0:10.0:0.2] { } From davidk at ci.uchicago.edu Tue Jul 17 14:06:18 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 17 Jul 2012 14:06:18 -0500 (CDT) Subject: [Swift-devel] Foreach with floats In-Reply-To: <1342551542.30338.2.camel@blabla> Message-ID: <1506462719.77437.1342551978866.JavaMail.root@zimbra-mb2.anl.gov> Great, thanks! I'll make a note to add an example like that to the user guide - that seems pretty useful. David ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: swift-devel at ci.uchicago.edu > Sent: Tuesday, July 17, 2012 1:59:02 PM > Subject: Re: [Swift-devel] Foreach with floats > On Tue, 2012-07-17 at 11:57 -0700, Mihael Hategan wrote: > > Theoretically this should work: > > > > foreach i in [0.0:10:0.2] { > > } > > > > (i.e. the first thing is a float and you specify a step). > > > > But I seem to be getting an error when I try it. > > nevermind. They all need to be floats: > > foreach i in [0.0:10.0:0.2] { > } From hategan at mcs.anl.gov Wed Jul 18 18:21:37 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Jul 2012 16:21:37 -0700 Subject: [Swift-devel] iterate bug from fast branch Message-ID: <1342653697.14809.1.camel@blabla> The fast branch contained a bug that cause all errors happening within an iterate loop to not be propagated to the user. I committed a fix to SVN, and I think we should make a patch release of 0.93. Btw, what happened to 0.93.1? Mihael From lpesce at uchicago.edu Wed Jul 25 09:01:43 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Wed, 25 Jul 2012 09:01:43 -0500 Subject: [Swift-devel] Vanilla python Message-ID: Dear all, I am going to build a version of python to be used in a test pipeline, which will eventually become a swift pipeline for genomics research (you will receive a lot of messages about it in the next few weeks). Python doesn't seem to be capable of running on the compute notes with more than one instance on a Cray XE6 (Beagle), as Mike told me multiple times ;-) My understanding is that what I need is a vanilla version which has to be build with the gcc compiler only, without using the scripts. I plan to strip it of multithreading too if necessary, but right now I don't see why it should be. I will first try as: -install it under vanilla_python (any better name? I would like to use the same name for all the similarly build environments, from perl to R) -using gcc directly -I will reinstall all relevant packages under its won tree and built in the same way Any other suggestions? Am I missing something? Lorenzo From leggett at ci.uchicago.edu Wed Jul 25 09:08:00 2012 From: leggett at ci.uchicago.edu (Ti Leggett) Date: Wed, 25 Jul 2012 09:08:00 -0500 Subject: [Swift-devel] Vanilla python In-Reply-To: References: Message-ID: <03147C55-B1FB-4261-9D38-EA7C9CBB8A49@ci.uchicago.edu> How's that different than the python already built in /soft/python? I don't recall doing anything out of the ordinary to build that. On Jul 25, 2012, at 9:01 AM, Lorenzo Pesce wrote: > Dear all, > > I am going to build a version of python to be used in a test pipeline, which will eventually become a swift pipeline for genomics research (you will receive a lot of messages about it in the next few weeks). > > Python doesn't seem to be capable of running on the compute notes with more than one instance on a Cray XE6 (Beagle), as Mike told me multiple times ;-) > > My understanding is that what I need is a vanilla version which has to be build with the gcc compiler only, without using the scripts. I plan to strip it of multithreading too if necessary, but right now I don't see why it should be. > > I will first try as: > -install it under vanilla_python (any better name? I would like to use the same name for all the similarly build environments, from perl to R) > -using gcc directly > -I will reinstall all relevant packages under its won tree and built in the same way > > Any other suggestions? Am I missing something? > > Lorenzo > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail URL: From lpesce at uchicago.edu Wed Jul 25 09:25:11 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Wed, 25 Jul 2012 09:25:11 -0500 Subject: [Swift-devel] Vanilla python In-Reply-To: <03147C55-B1FB-4261-9D38-EA7C9CBB8A49@ci.uchicago.edu> References: <03147C55-B1FB-4261-9D38-EA7C9CBB8A49@ci.uchicago.edu> Message-ID: <9CE510F8-C905-4DF5-8252-4ECCA67C7B02@uchicago.edu> cc and CC add a lot of tricks for ALPS to control programs and those add too much control when Python per se has additional problems with interpreter locks and I don't know if those apply here. Mike, Glen and others know a lot more about this than I do. On Jul 25, 2012, at 9:08 AM, Ti Leggett wrote: > How's that different than the python already built in /soft/python? I don't recall doing anything out of the ordinary to build that. > > On Jul 25, 2012, at 9:01 AM, Lorenzo Pesce wrote: > >> Dear all, >> >> I am going to build a version of python to be used in a test pipeline, which will eventually become a swift pipeline for genomics research (you will receive a lot of messages about it in the next few weeks). >> >> Python doesn't seem to be capable of running on the compute notes with more than one instance on a Cray XE6 (Beagle), as Mike told me multiple times ;-) >> >> My understanding is that what I need is a vanilla version which has to be build with the gcc compiler only, without using the scripts. I plan to strip it of multithreading too if necessary, but right now I don't see why it should be. >> >> I will first try as: >> -install it under vanilla_python (any better name? I would like to use the same name for all the similarly build environments, from perl to R) >> -using gcc directly >> -I will reinstall all relevant packages under its won tree and built in the same way >> >> Any other suggestions? Am I missing something? >> >> Lorenzo >> > From lpesce at uchicago.edu Fri Jul 27 14:24:38 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Fri, 27 Jul 2012 14:24:38 -0500 Subject: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle Message-ID: Has any one a version of R that works on the compute nodes and can be packed? If not, I will build one. I just convinced a user of beagle to move all her urgent analysis to swift. Please :-) Lorenzo From wilde at mcs.anl.gov Fri Jul 27 17:02:31 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 27 Jul 2012 17:02:31 -0500 (CDT) Subject: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle In-Reply-To: Message-ID: <775022834.39835.1343426551690.JavaMail.root@zimbra.anl.gov> Hi Lorenzo, sorry, just saw your message. Try this R: login1$ pwd /home/wilde/R/beagle/R-2.13.1 login1$ ls COPYING INSTALL Makefile.fw NEWS ONEWS README VERSION config.site configure.ac etc/ m4/ share/ tests/ ChangeLog Makeconf.in Makefile.in NEWS.pdf OONEWS SVN-REVISION bin/ configure* doc/ lib64/ po/ src/ tools/ login1$ bin/R R version 2.13.1 (2011-07-08) Copyright (C) 2011 The R Foundation for Statistical Computing - Mike ----- Original Message ----- > From: "Lorenzo Pesce" > To: "Michael Wilde" > Cc: "swift-devel Devel" > Sent: Friday, July 27, 2012 2:24:38 PM > Subject: Urgent Friday evening problem ;-) :: vanilla R on Beagle > Has any one a version of R that works on the compute nodes and can be > packed? > If not, I will build one. > > I just convinced a user of beagle to move all her urgent analysis to > swift. Please :-) > > > Lorenzo -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Jul 27 17:11:18 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 27 Jul 2012 17:11:18 -0500 (CDT) Subject: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle In-Reply-To: <775022834.39835.1343426551690.JavaMail.root@zimbra.anl.gov> Message-ID: <1830544355.39845.1343427078113.JavaMail.root@zimbra.anl.gov> Lorenzo, I think the R in /home/wilde/R/beagle/R-2.13.1 ws indeed built with module gcc, and thus should run with "node packing" (ie multiple parallel copies per node). Build notes are in /home/wilde/R/README The build script used was: /home/wilde/R/releases/build.beagle which does: --- mod unload $(mod list | grep -i prg) mod load gcc echo Using compiler: $(mod list | grep gcc) echo 'Checking that PrgEnv has been unloaded (should be blank!):' $(mod list | grep -i prg) rm -rf $MYRBUILD/beagle/$REL #mkdir -p $MYRBUILD/beagle/$REL cd $MYRBUILD/beagle tar zxf $MYRBUILD/$REL.tar.gz cd $REL # make clean ./configure --prefix=$MYRINSTALL/beagle/$REL # --enable-R-shlib make make install --- - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Lorenzo Pesce" > Cc: "swift-devel Devel" > Sent: Friday, July 27, 2012 5:02:31 PM > Subject: Re: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle > Hi Lorenzo, sorry, just saw your message. Try this R: > > login1$ pwd > /home/wilde/R/beagle/R-2.13.1 > login1$ ls > COPYING INSTALL Makefile.fw NEWS ONEWS README VERSION config.site > configure.ac etc/ m4/ share/ tests/ > ChangeLog Makeconf.in Makefile.in NEWS.pdf OONEWS SVN-REVISION bin/ > configure* doc/ lib64/ po/ src/ tools/ > login1$ bin/R > > R version 2.13.1 (2011-07-08) > Copyright (C) 2011 The R Foundation for Statistical Computing > > - Mike > > > ----- Original Message ----- > > From: "Lorenzo Pesce" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" > > Sent: Friday, July 27, 2012 2:24:38 PM > > Subject: Urgent Friday evening problem ;-) :: vanilla R on Beagle > > Has any one a version of R that works on the compute nodes and can > > be > > packed? > > If not, I will build one. > > > > I just convinced a user of beagle to move all her urgent analysis to > > swift. Please :-) > > > > > > Lorenzo > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From lpesce at uchicago.edu Mon Jul 30 13:27:06 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 30 Jul 2012 13:27:06 -0500 Subject: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle In-Reply-To: <1830544355.39845.1343427078113.JavaMail.root@zimbra.anl.gov> References: <1830544355.39845.1343427078113.JavaMail.root@zimbra.anl.gov> Message-ID: <1E4ACE91-7D29-444F-A84F-6D23EFDEDA2A@uchicago.edu> The good news is that for python worked as you said for the version they provide us with. We could both run multiple instances at the same time, but the environmental variables were not corrupted anymore. Next step we'll test building a version and see if it does what we want. R-2.15.1 instead of building brings down the login nodes. Literally. This morning I tried to build R-2.15.1 and while running ./configure the login poofed. Correlation is not causality, but I killed another node trying again, so I am pretty sure it is. This is the last line before evaporating: checking whether integer division by zero raises SIGFPE... under investigation. temporary directory was set to /lustre/beagle/`whoami`'''' , but it doesn't seem to matter whether it is /tmp or lustre On Jul 27, 2012, at 5:11 PM, Michael Wilde wrote: > Lorenzo, I think the R in /home/wilde/R/beagle/R-2.13.1 ws indeed built with module gcc, and thus should run with "node packing" (ie multiple parallel copies per node). > > Build notes are in /home/wilde/R/README > > The build script used was: /home/wilde/R/releases/build.beagle > which does: > > --- > mod unload $(mod list | grep -i prg) > mod load gcc > > echo Using compiler: $(mod list | grep gcc) > echo 'Checking that PrgEnv has been unloaded (should be blank!):' $(mod list | grep -i prg) > > rm -rf $MYRBUILD/beagle/$REL > #mkdir -p $MYRBUILD/beagle/$REL > > cd $MYRBUILD/beagle > tar zxf $MYRBUILD/$REL.tar.gz > cd $REL > > # make clean > ./configure --prefix=$MYRINSTALL/beagle/$REL # --enable-R-shlib > make > make install > --- > > - Mike > > ----- Original Message ----- >> From: "Michael Wilde" >> To: "Lorenzo Pesce" >> Cc: "swift-devel Devel" >> Sent: Friday, July 27, 2012 5:02:31 PM >> Subject: Re: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle >> Hi Lorenzo, sorry, just saw your message. Try this R: >> >> login1$ pwd >> /home/wilde/R/beagle/R-2.13.1 >> login1$ ls >> COPYING INSTALL Makefile.fw NEWS ONEWS README VERSION config.site >> configure.ac etc/ m4/ share/ tests/ >> ChangeLog Makeconf.in Makefile.in NEWS.pdf OONEWS SVN-REVISION bin/ >> configure* doc/ lib64/ po/ src/ tools/ >> login1$ bin/R >> >> R version 2.13.1 (2011-07-08) >> Copyright (C) 2011 The R Foundation for Statistical Computing >> >> - Mike >> >> >> ----- Original Message ----- >>> From: "Lorenzo Pesce" >>> To: "Michael Wilde" >>> Cc: "swift-devel Devel" >>> Sent: Friday, July 27, 2012 2:24:38 PM >>> Subject: Urgent Friday evening problem ;-) :: vanilla R on Beagle >>> Has any one a version of R that works on the compute nodes and can >>> be >>> packed? >>> If not, I will build one. >>> >>> I just convinced a user of beagle to move all her urgent analysis to >>> swift. Please :-) >>> >>> >>> Lorenzo >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From wilde at mcs.anl.gov Mon Jul 30 16:20:14 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 30 Jul 2012 16:20:14 -0500 (CDT) Subject: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle In-Reply-To: <1E4ACE91-7D29-444F-A84F-6D23EFDEDA2A@uchicago.edu> Message-ID: <1126635936.42363.1343683214545.JavaMail.root@zimbra.anl.gov> Lorenzo, maybe you should try to use the R 2.13 that I already built? Or did that fail to work for you for some reason? - Mike ----- Original Message ----- > From: "Lorenzo Pesce" > To: "Michael Wilde" > Cc: "swift-devel Devel" > Sent: Monday, July 30, 2012 1:27:06 PM > Subject: Re: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle > The good news is that for python worked as you said for the version > they provide us with. > We could both run multiple instances at the same time, but the > environmental variables were not corrupted anymore. > > Next step we'll test building a version and see if it does what we > want. > > R-2.15.1 instead of building brings down the login nodes. Literally. > This morning I tried to build R-2.15.1 and > while running ./configure the login poofed. > Correlation is not causality, but I killed another node trying again, > so I am pretty sure it is. > > This is the last line before evaporating: > checking whether integer division by zero raises SIGFPE... > > under investigation. > > temporary directory was set to /lustre/beagle/`whoami`'''' , but it > doesn't seem to matter whether it is /tmp or lustre > > On Jul 27, 2012, at 5:11 PM, Michael Wilde wrote: > > > Lorenzo, I think the R in /home/wilde/R/beagle/R-2.13.1 ws indeed > > built with module gcc, and thus should run with "node packing" (ie > > multiple parallel copies per node). > > > > Build notes are in /home/wilde/R/README > > > > The build script used was: /home/wilde/R/releases/build.beagle > > which does: > > > > --- > > mod unload $(mod list | grep -i prg) > > mod load gcc > > > > echo Using compiler: $(mod list | grep gcc) > > echo 'Checking that PrgEnv has been unloaded (should be blank!):' > > $(mod list | grep -i prg) > > > > rm -rf $MYRBUILD/beagle/$REL > > #mkdir -p $MYRBUILD/beagle/$REL > > > > cd $MYRBUILD/beagle > > tar zxf $MYRBUILD/$REL.tar.gz > > cd $REL > > > > # make clean > > ./configure --prefix=$MYRINSTALL/beagle/$REL # --enable-R-shlib > > make > > make install > > --- > > > > - Mike > > > > ----- Original Message ----- > >> From: "Michael Wilde" > >> To: "Lorenzo Pesce" > >> Cc: "swift-devel Devel" > >> Sent: Friday, July 27, 2012 5:02:31 PM > >> Subject: Re: [Swift-devel] Urgent Friday evening problem ; -) :: > >> vanilla R on Beagle > >> Hi Lorenzo, sorry, just saw your message. Try this R: > >> > >> login1$ pwd > >> /home/wilde/R/beagle/R-2.13.1 > >> login1$ ls > >> COPYING INSTALL Makefile.fw NEWS ONEWS README VERSION config.site > >> configure.ac etc/ m4/ share/ tests/ > >> ChangeLog Makeconf.in Makefile.in NEWS.pdf OONEWS SVN-REVISION bin/ > >> configure* doc/ lib64/ po/ src/ tools/ > >> login1$ bin/R > >> > >> R version 2.13.1 (2011-07-08) > >> Copyright (C) 2011 The R Foundation for Statistical Computing > >> > >> - Mike > >> > >> > >> ----- Original Message ----- > >>> From: "Lorenzo Pesce" > >>> To: "Michael Wilde" > >>> Cc: "swift-devel Devel" > >>> Sent: Friday, July 27, 2012 2:24:38 PM > >>> Subject: Urgent Friday evening problem ;-) :: vanilla R on Beagle > >>> Has any one a version of R that works on the compute nodes and can > >>> be > >>> packed? > >>> If not, I will build one. > >>> > >>> I just convinced a user of beagle to move all her urgent analysis > >>> to > >>> swift. Please :-) > >>> > >>> > >>> Lorenzo > >> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From lpesce at uchicago.edu Mon Jul 30 19:09:20 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Mon, 30 Jul 2012 19:09:20 -0500 (CDT) Subject: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle In-Reply-To: <1126635936.42363.1343683214545.JavaMail.root@zimbra.anl.gov> References: <1E4ACE91-7D29-444F-A84F-6D23EFDEDA2A@uchicago.edu> <1126635936.42363.1343683214545.JavaMail.root@zimbra.anl.gov> Message-ID: <20120730190920.BMT04291@mstore01.uchicago.edu> We did not get to use R yet, the user is getting all the rest in shape. First thing I was going to use your version, then experiment with building R. While I was waiting, I tried with building and found that unusual way of killing a computer.... I will try with switches for the various exception handling to see if I can make it work without too many side effects. ---- Original message ---- >Date: Mon, 30 Jul 2012 16:20:14 -0500 (CDT) >From: Michael Wilde >Subject: Re: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle >To: Lorenzo Pesce >Cc: swift-devel Devel > >Lorenzo, maybe you should try to use the R 2.13 that I already built? Or did that fail to work for you for some reason? > >- Mike > > >----- Original Message ----- >> From: "Lorenzo Pesce" >> To: "Michael Wilde" >> Cc: "swift-devel Devel" >> Sent: Monday, July 30, 2012 1:27:06 PM >> Subject: Re: [Swift-devel] Urgent Friday evening problem ; -) :: vanilla R on Beagle >> The good news is that for python worked as you said for the version >> they provide us with. >> We could both run multiple instances at the same time, but the >> environmental variables were not corrupted anymore. >> >> Next step we'll test building a version and see if it does what we >> want. >> >> R-2.15.1 instead of building brings down the login nodes. Literally. >> This morning I tried to build R-2.15.1 and >> while running ./configure the login poofed. >> Correlation is not causality, but I killed another node trying again, >> so I am pretty sure it is. >> >> This is the last line before evaporating: >> checking whether integer division by zero raises SIGFPE... >> >> under investigation. >> >> temporary directory was set to /lustre/beagle/`whoami`'''' , but it >> doesn't seem to matter whether it is /tmp or lustre >> >> On Jul 27, 2012, at 5:11 PM, Michael Wilde wrote: >> >> > Lorenzo, I think the R in /home/wilde/R/beagle/R-2.13.1 ws indeed >> > built with module gcc, and thus should run with "node packing" (ie >> > multiple parallel copies per node). >> > >> > Build notes are in /home/wilde/R/README >> > >> > The build script used was: /home/wilde/R/releases/build.beagle >> > which does: >> > >> > --- >> > mod unload $(mod list | grep -i prg) >> > mod load gcc >> > >> > echo Using compiler: $(mod list | grep gcc) >> > echo 'Checking that PrgEnv has been unloaded (should be blank!):' >> > $(mod list | grep -i prg) >> > >> > rm -rf $MYRBUILD/beagle/$REL >> > #mkdir -p $MYRBUILD/beagle/$REL >> > >> > cd $MYRBUILD/beagle >> > tar zxf $MYRBUILD/$REL.tar.gz >> > cd $REL >> > >> > # make clean >> > ./configure --prefix=$MYRINSTALL/beagle/$REL # --enable-R-shlib >> > make >> > make install >> > --- >> > >> > - Mike >> > >> > ----- Original Message ----- >> >> From: "Michael Wilde" >> >> To: "Lorenzo Pesce" >> >> Cc: "swift-devel Devel" >> >> Sent: Friday, July 27, 2012 5:02:31 PM >> >> Subject: Re: [Swift-devel] Urgent Friday evening problem ; -) :: >> >> vanilla R on Beagle >> >> Hi Lorenzo, sorry, just saw your message. Try this R: >> >> >> >> login1$ pwd >> >> /home/wilde/R/beagle/R-2.13.1 >> >> login1$ ls >> >> COPYING INSTALL Makefile.fw NEWS ONEWS README VERSION config.site >> >> configure.ac etc/ m4/ share/ tests/ >> >> ChangeLog Makeconf.in Makefile.in NEWS.pdf OONEWS SVN-REVISION bin/ >> >> configure* doc/ lib64/ po/ src/ tools/ >> >> login1$ bin/R >> >> >> >> R version 2.13.1 (2011-07-08) >> >> Copyright (C) 2011 The R Foundation for Statistical Computing >> >> >> >> - Mike >> >> >> >> >> >> ----- Original Message ----- >> >>> From: "Lorenzo Pesce" >> >>> To: "Michael Wilde" >> >>> Cc: "swift-devel Devel" >> >>> Sent: Friday, July 27, 2012 2:24:38 PM >> >>> Subject: Urgent Friday evening problem ;-) :: vanilla R on Beagle >> >>> Has any one a version of R that works on the compute nodes and can >> >>> be >> >>> packed? >> >>> If not, I will build one. >> >>> >> >>> I just convinced a user of beagle to move all her urgent analysis >> >>> to >> >>> swift. Please :-) >> >>> >> >>> >> >>> Lorenzo >> >> >> >> -- >> >> Michael Wilde >> >> Computation Institute, University of Chicago >> >> Mathematics and Computer Science Division >> >> Argonne National Laboratory >> >> >> >> _______________________________________________ >> >> Swift-devel mailing list >> >> Swift-devel at ci.uchicago.edu >> >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > >> > -- >> > Michael Wilde >> > Computation Institute, University of Chicago >> > Mathematics and Computer Science Division >> > Argonne National Laboratory >> > > >-- >Michael Wilde >Computation Institute, University of Chicago >Mathematics and Computer Science Division >Argonne National Laboratory > From iraicu at cs.iit.edu Tue Jul 31 16:46:43 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Tue, 31 Jul 2012 16:46:43 -0500 Subject: [Swift-devel] Call for Participation: IEEE eScience 2012 in Chicago, IL, October 8-12, 2012 Message-ID: <50185243.2010909@cs.iit.edu> *Call for Participation* *IEEE eScience 2012* *http://www.ci.uchicago.edu/escience2012/index.php* *October 8^th -12^th , 2012 -- Chicago, IL, USA* The 8th IEEE International Conference on eScience (eScience 2012) will be held at the Hyatt Regency Chicago, Chicago, Illinois, *8-12 October 2012*, with workshops (including the Microsoft eScience Workshop) on 8-9 October and the main conference events on 10-12 October. Scientific research is increasingly carried out by communities of researchers that span disciplines, laboratories, organizations and national boundaries. These activities involve geographically distributed and heterogeneous resources such as computational systems, scientific instruments, databases, sensors, software components, networks, and people. Such large-scale and enhanced scientific endeavors are carried out via collaborations on a global scale in which information and computing technology plays a vital role and are thus popularly termed as e-Science. Keynote Speakers Description: Professor Gerhard Klimeck Professor Gerhard Klimeck Director of the Network for Computational Nanotechnology and Professor of Electrical and Computer Engineering /Purdue University/// Description: Professor Leonard Smith Professor Leonard Smith Director of the Centre for the Analysis of Time Series (CATS) /London School of Economics and Political Science/ Description: Dr. Gregory Wilson Dr. Gregory Wilson Software Carpentry /Mozilla Foundation/ Description: Professor Carole Goble Professor Carole Goble /University of Manchester, UK/ Workshops eScience 2012 will include workshops on 8-9 October. In addition to the Microsoft eScience Workshop, other workshops were solicited in the Call for Workshops . Six workshops were accepted: * 8 October: Extending High-Performance Computing Beyond its Traditional User Communities , Papers due 6 August, Contact Person/email: Sergiu Sanielevici * 8 October: 2nd International Workshop on Analyzing and Improving Collaborative eScience with Social Networks (eSoN 12) , Papers due 17 August, Contact Person/email: Kyle Chard * 8 October: Advances in eHealth , Abstracts due 4 July, Papers due 11 July, Contact Person/email: Rossen Apostolov * 9 October: Maintainable Software Practices in e-Science , Papers due 20 July, Contact Person/email: Neil Chue Hong and Jennifer Schopf * 9 October: eScience Meets the Instrument, Contact Person/email: Richard Farnsworth * 9 October am: Collaborative research using eScience infrastructure and high speed networks , Contact Person/email: Peter Hinrich In addition, there will also be one tutorial: * 9 October pm: Big Data Processing: Lessons from Industry and Applications in Science , Contact Person/email: Roger Barga -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email:iraicu at cs.iit.edu Web:http://www.cs.iit.edu/~iraicu/ Web:http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 95497 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 92700 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 92056 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 82775 bytes Desc: not available URL: