[Swift-commit] r5510 - SwiftApps/SciColSim
wilde at ci.uchicago.edu
wilde at ci.uchicago.edu
Mon Jan 23 12:19:35 CST 2012
Author: wilde
Date: 2012-01-23 12:19:35 -0600 (Mon, 23 Jan 2012)
New Revision: 5510
Added:
SwiftApps/SciColSim/EMAIL
Modified:
SwiftApps/SciColSim/README
Log:
Add notes from email discussions.
Added: SwiftApps/SciColSim/EMAIL
===================================================================
--- SwiftApps/SciColSim/EMAIL (rev 0)
+++ SwiftApps/SciColSim/EMAIL 2012-01-23 18:19:35 UTC (rev 5510)
@@ -0,0 +1,474 @@
+==== Email trail (cronological order):
+
+---------------------------------------------------------------------------------------
+
+On 10/15/11 8:30 PM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+>
+> I've got a basic serial version of optimizer.cpp running on Beagle (on a login
+> node). Now Im trying to parallelize it there, and have some questions:
+>
+> 1) You set NWorkers to a constant, 24. In multi_loss, you have this code:
+>
+> for(int i=0; i<Nworkers; i++){
+> for(int j=0; j<5; j++){
+> un[i]->set_parameter(params[j],j);
+> }
+> for(int i=0; i<Nworkers; i++){
+> dispatch_group_async(group, CustomQueues[i], ^{
+> un[i]->evolve_to_target_and_save(istart, iend, Results,
+> Counters);
+> });
+> istart += step;
+> iend = min(istart+step,N);
+> }
+> }
+>
+> Can you explain the intention here? I think the innermost loop is clear: run
+> evolve() 24 times in parallel, partitioning the istart..iend range among the
+> 24 workers. But I dont understand the outermost loop, which seems to do the
+> entire inner loop 24 (NWorkers) times. I can understand the idea of doing the
+> entire inner loop some number of times. But from the above, I presume that
+> evolve would be run NWorkers^2 times, or 24*24 times. Was that the intention?
+>
+> 2) If you had many processors available (as you do on Beagle) would you want
+> to run set NWorkers higher? I think you mentioned something in out discussion
+> Friday about "a thousand" workers being useful. Ie, NWorkers for the innermost
+> loop could be 1000. Was that really what you meant? What would be a
+> mathematically/scientifically useful value for the NWorkers in the innermost
+> loop?
+>
+> Further: if you do want an NWorkers >> 24, would you still do the evolve
+> NWorkers^2 times? I dont think you'd really want 1000*1000 = 1M evolve calls,
+> given how many outer loops you have at higher levels of the code, including in
+> the Python wrapper.
+>
+> 3) We'll need to discuss what values make sense for the outer loops
+> (multi_annealing and the .py wrapper) once I have this working in parallel.
+>
+> 4) Can you give me a sentence or two about what this code is doing? I think I
+> understand the outer annealing logic, but I have not dug into the code within
+> evolve...() at all. I think you might have explained it once back in January
+> (before you recoded in C) but I dont recall. If you have a paper or a web page
+> on what youre computing here, that would be interesting for me to read, and to
+> help write a slide on this for the Beagle review.
+
+
+-----
+
+
+Re: Status and questions on optimizer code
+From : andrey rzhetsky <arzhetsk at medicine.bsd.uchicago.edu>
+Subject : Re: Status and questions on optimizer code
+To : Michael Wilde <wilde at mcs.anl.gov>
+Sun, Oct 16, 2011 08:28 AM
+Hi Mike,
+
+1. I think, you just uncovered a stupid bug on my part -- thank you! There
+should be only one loop (the outer one).
+2. Yes, of course -- 1000, or even 10000 (I can increase the number of
+repeats then).
+3. OK
+4. The code simulates exploration of a chemical network by a research
+community. The five major parameters determine the strategy of exploration,
+"target" is the number of new interactions discovered and the loss is the
+number of experiments per one new positive finding. I can provide you with
+figures and slides, if that would help.
+
+Thank you!
+
+With kind regards,
+
+Andrey
+
+
+---------------------------------------------------------------------------------------
+
+
+On 10/16/11 9:13 AM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+>
+> Looking deeper, I think the bug was simply that the first for() statement
+> should have enclosed just the parameter setting loop. In other words, the top
+> of multi_loss should start with:
+>
+> for(int i=0; i<Nworkers; i++){
+> for(int j=0; j<5; j++){
+> un[i]->set_parameter(params[j],j);
+> }
+> }
+>
+> Then the real work is done by the next loop:
+> for(i=0; i<Nworkers; i++){
+>
+> Can you confirm that this looks correct to you?
+>
+> I made that change, and the code now seems to run as I would expect. I will
+> send you some output as soon as I clean up my debugging output.
+>
+> Next, I made the code run with 24-way parallelism on a Beagle login node using
+> "OpenMP", simply by adding one "pragma" statement in front of the main worker
+> loop above. So that part of the code now looks like this:
+>
+> int i;
+> #pragma omp parallel for private (i)
+> for(i=0; i<Nworkers; i++){
+>
+> and each call to evolve...() is now done in parallel (with all the Mac
+> dispatch statements commented out). I will test, but I *think* that the same
+> code will run just as well in parallel on your multicore Macs, perhaps just a
+> *tiny* bit slower than under Grand Central Dispatch (likely not a noticeable
+> difference).
+>
+> Now, we have 2 choices:
+>
+> 1) I can simply replace the Python driver script with a Swift script, to do
+> many runs of the optimizer in parallel. That would give you the ability to
+> run *many* 24-core optimization runs in parallel, each using 24 cores. So for
+> example, in your current Python script you do this:
+>
+> for target in range(58,1009,50):
+> for i in range(15):
+>
+> So thats about 20 x 15 = 300 invocations of optimizer. I *think* that each of
+> these runs is totally independent and can run in parallel, correct?
+>
+> So A simple Swift script not much longer than the Python script, along with a
+> few beagle-specific configuration files, will enable all 300 jobs to run in
+> parallel, giving you 300 x 24 (=7200) cores running in parallel. Of course,
+> you can seldom *get* that many cores because the machine is heavily loaded.
+> But you may be able to get 10-30 nodes on a daily basis. We'll need to
+> experiment with this.
+>
+> As a *next* step after, we should consider the benefits of changing the value
+> of NWorkers to > 24. 24 is the "easy" limit on Beagle because we can get
+> 24-way parallelism on each node with just that one "pragma" statement. We can
+> get much greater parallelism with Swift in the inner loop, but for that we
+> need to break up the program a bit more, to have Swift run the inner loop as a
+> separate program, and then return the aggregated results in a file. Even for
+> this option, there are two alternative methods:
+>
+> - we make optimizer call Swift once for each round of parallel annealing. This
+> is fairly easy. It is somewhat limiting to overall parallelism, in that only
+> one round at a time can run. But It may be very adequate.
+>
+> - we break the program up further into parallelizable chunks, in which case
+> you have a lot of flexibility and the work always gets done in a near-optimal
+> manner regardless of the shape of a given optimization run (in terms of the
+> various nested loop sizes and evolve() execution times.
+>
+> I think we'll need to discuss this in person over a whiteboard, but I think I
+> have enough knowledge of the program to at least show you a few alternatives.
+>
+> The main question at the moment, I think, is simply to understand the
+> math/science benefits of extending NWorkers beyond the "low hanging fruit"
+> limit of 24. What is your assessment of that benefit, Andrey?
+>
+
+-----
+
+----- Forwarded Message -----
+From: "andrey rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Sunday, October 16, 2011 12:08:25 PM
+Subject: Re: Status and questions on optimizer code
+
+Mike,
+
+It would be fantastic to have 1000 or 10000 workers (with larger number of
+re-runs -- it would improve precision of my analysis drastically!).
+
+All the very best,
+
+Andrey
+
+---------------------------------------------------------------------------------------
+
+On 10/17/11 8:29 AM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+>
+> Can we meet today to discuss the optimizer? I'd like to show you what Ive done
+> and discuss with you next steps towards getting you running on Beagle. I can
+> meet any time from 10:30 to 3:00.
+>
+> Do you already have a CI and Beagle login and a project set up for Beagle use?
+> If not, we should get that started.
+>
+> On the technical side, I have a question about the the typical shape of your
+> optimization runs.
+>
+> With the sample Python script you gave me, I think we have the following
+> nested iterations in the code:
+>
+> 20 targets (parallel)
+> 15 repeats (parallel)
+> 100 Annealing_cycles (serial)
+> 6 repeats (serial)
+> 1000 to 10000 annealing_repeats (parallel)
+> evolve()
+>
+> The main question I have at this point is regarding the strategy for
+> increasing the innermost annealing repeats (currently 1,000 divided among 24
+> workers; desired to increase to 10,000).
+>
+> The outermost loops in my Swift tests are done in parallel. Thus we can have a
+> 300 optimizations going in parallel and 24 annealings in parallel for a total
+> of 7,200 parallel tasks.
+>
+> The question is: if you will always have a sizeable number of parallel
+> iterations in the outer loops, we dont need to change anything in the inner
+> loop to get more parallelism. In other words, we already have more parallelism
+> than we have CPUs available.
+>
+> 7200 CPUs is about 42% of the overall Beagle system. It will be very rare
+> that we we could get that many cores all at once. But think we can regularly
+> get say 500 to 2000 cores on a daily basis.
+>
+> On the other hand, if you expect to regularly run tests of *single* annealing
+> cycles and want to speed those up, then indeed it may be worth changing the
+> code structure.
+>
+> When me meet I'll try to give you an idea of whats involved. Basically we need
+> to change the structure of the annealing loop to create a function
+> "multi_loss_setup" as a separate executable which defines the annealing
+> parameters and writes them to a file; make multi_loss a separate executable;
+> create another executable "multi_loss_summarize" which reduces the results.
+> We can probably combine multi_loss_summarize into multi_loss_setup.
+>
+> This is not very hard to do, but still sounds to me like a week of programming
+> to get all all restructured and tested. Before investing that effort, we
+> should discuss if it will give you any additional performance gains over just
+> running many optimizations in parallel.
+>
+> I need to run timings on the annealing cycles to see how that change across
+> the parameter space, to see if we can just increase the repeats to 10,000 with
+> no changes to the code. I think the feasibility of doing this the "easy way"
+> is based on how long the longest annealings take at the high end of the
+> parameter space.
+>
+> Regards,
+>
+> - Mike
+
+-----
+
+----- Forwarded Message -----
+From: "Andrey Rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Monday, October 17, 2011 8:40:17 AM
+Subject: Re: Meet today to discuss optimizer?
+
+Hi Mike,
+
+The 6 (or more) annealing repeats can be run in parallel too.
+
+Unfortunately, around 10:15 I have to rush to Evanston to CBC meeting for
+the rest of the day (we can chat before, if you have a minute, I am in my
+office).
+
+I don't have Beagle login, unfortunately.
+
+Typically, I will have a sizeable outer loop, so, probably, the current
+24-worker setup is fine.
+
+Thank you very much for helping me out!
+
+All the best,
+
+Andrey
+
+
+---------------------------------------------------------------------------------------
+
+On 10/18/11 1:52 PM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+>
+> Here's a quick update:
+>
+> - I am now running the optimizer on Beagle compute nodes under Swift.
+>
+> I attach a few tar files of sample runs at reduced parameter values (to shrink
+> the run time for debugging and learning the code's behavior);
+>
+> Now Im trying to run some subset of the full-length parameters you gave me in
+> the python file. Ive got 3 Beagle compute nodes allocated at the moment (72
+> cores total) and Im seeing these times from multi_loss with N=1000 repeats:
+>
+> sandbox$ grep multi_ ./jobs/*/*/output/*.out
+> ./jobs/0/optimizer-01p7lhhk/output/T408.R1.out:multi_loss(N=1000) elapsed
+> time: 122.742 seconds 2.04571 minutes
+> ./jobs/0/optimizer-01p7lhhk/output/T408.R1.out:multi_loss(N=1000) elapsed
+> time: 123.979 seconds 2.06631 minutes
+> ./jobs/0/optimizer-01p7lhhk/output/T408.R1.out:multi_loss(N=1000) elapsed
+> time: 123.624 seconds 2.0604 minutes
+> ./jobs/t/optimizer-t0p7lhhk/output/T958.R1.out:multi_loss(N=1000) elapsed
+> time: 1431.09 seconds 23.8514 minutes
+> ./jobs/x/optimizer-x0p7lhhk/output/T708.R1.out:multi_loss(N=1000) elapsed
+> time: 627.074 seconds 10.4512 minutes
+> ./jobs/x/optimizer-x0p7lhhk/output/T708.R1.out:multi_loss(N=1000) elapsed
+> time: 790.652 seconds 13.1775 minutes
+>
+>
+> Each run of optimizer is going to a file name T(target).R(repeat).out
+>
+> So we're seeing 23.8 mins for 1000 repeats at target=958 and 10-13 mins at
+> target=708. The 1000 repeats are spread over 24 cores each.
+>
+> Whats your time availability later in the week to discuss this further, and to
+> see if either (a) I can show you how to run this version or (b) we can get a
+> set of production run descriptions from you and you can run them yourself?
+>
+> In the compressed tar file at http://www.ci.uchicago.edu/~wilde/AR.snap.01.tgz
+> you will find:
+>
+> - the swift script that I use instead of the Python driver to run the
+> optimizer in parallel (along beagle.xml that specifies scheduler parameters
+> for Beagle like time, cores, queue name and project ID)
+>
+> - the slightly modified version of optimizer (changes in multi_loss() to
+> correct the loops, changes to use OpenMP instead of Grand Central Dispatch,
+> and a few changes in output logging).
+>
+> - a few run directories of runs with shortened parameter settings.
+>
+> If we continue working together on this, we should set up a way to share code
+> using a repository like Subversion (svn). Thats pretty easy once you master a
+> few basic commands.
+>
+> Regards,
+>
+> - Mike
+>
+>
+
+-----
+
+----- Forwarded Message -----
+From: "Andrey Rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Tuesday, October 18, 2011 3:57:36 PM
+Subject: Re: Meet today to discuss optimizer?
+
+Mike,
+
+Thank you! Are you around now? I would be also happy to carve some time
+tomorrow, if this works for you.
+
+With kind regards,
+
+Andrey
+
+
+---------------------------------------------------------------------------------------
+
+
+On 10/19/11 12:10 PM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+>
+> Im in meetings today till about 3PM. Are you available at say 3:30 or later?
+>
+> I did a larger run last night. Only one smaller optimizer run *fully*
+> finished, but many others made significant progress. The results are at:
+>
+> http://www.ci.uchicago.edu/~wilde/AR.optimizer.out.2010.1018.tgz
+>
+> If you have time, could you take a look at that run and see if the
+> optimizations look like they have been running as expected? Ive made only a
+> few cosmetic changes to your debug output.
+>
+> I submitted the run at 21:20; it started running at about 21:27; by about
+> 23:10 it had acquired 12 nodes (x 24 cores each). It ended about 23:18 when
+> the first job exceeded its time limit of 5 hours. Im still trying to calibrate
+> how much time each optimizer invocation needs, and whether some of the
+> internal iterations can be further spread out. Also how to organize the run
+> so that optimizations that time out can be re-run with the smallest
+> reasonable failure unit.
+>
+>
+
+---------------------------------------------------------------------------------------
+
+
+----- Forwarded Message -----
+From: "andrey rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Wednesday, October 26, 2011 8:40:21 PM
+Subject: Re: Question on inner annealing loop
+
+Mike,
+
+
+> Im confused on 3 points here:
+>
+> - the inner loop would always be done between 1 and 5 times, right?
+
+Correct.
+
+> - could each of those times really be done in parallel? (I'll try to determine
+> this by inspection).
+
+Not really -- the acceptance of parameter changes depends on the loss in
+between.
+
+> - when we last met in your office, I *thought* you indicated that this inner
+> loop could be done just *once*. Was that what you meant? And if so, for
+> which of the 5 vars?
+
+Nope, has to be repeated over and over.
+
+All the very best,
+
+Andrey
+
+
+---------------------------------------------------------------------------------------
+
+
+On 10/26/11 10:42 PM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> OK, all that makes sense, Andrey. But then do you recall what you suggested
+> when we met?
+>
+> Lets label the loops as follows:
+>
+> a) 20 targets (parallel)
+> b) 15 repeats (parallel)
+> c) 100 Annealing_cycles (serial)
+> d) 6 repeats (serial)
+> e) 1000 to 10000 annealing_repeats (parallel)
+> f) evolve()
+>
+> What I recalled from our last discussion was that I should reduce loop (c)
+> from 100 to 50 or 25, and loop (d) to 1. But since reducing loop (d) doesn't
+> make sense, do you recall suggesting any other reduction?
+>
+> If not, no problem, I think I know how to proceed.
+>
+> Thanks,
+>
+> - Mike
+>
+>
+-----
+
+----- Forwarded Message -----
+From: "andrey rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Thursday, October 27, 2011 2:54:06 AM
+Subject: Re: Question on inner annealing loop
+
+Hi Mike,
+
+I suggested reducing (b) to 1.
+
+With kind regards,
+
+Andrey
+
+
+
Modified: SwiftApps/SciColSim/README
===================================================================
--- SwiftApps/SciColSim/README 2012-01-23 18:18:10 UTC (rev 5509)
+++ SwiftApps/SciColSim/README 2012-01-23 18:19:35 UTC (rev 5510)
@@ -1,13 +1,34 @@
+=== Overview
+
+The code simulates the exploration of a chemical network by a research
+community. The five major parameters determine the strategy of
+exploration, "target" is the number of new interactions discovered and
+"loss" is the number of experiments per one new positive finding.
+
=== Files
+Code:
+
optirun.swift: replaces top-level py script for the outermost loops
+Data:
+
+
=== How to build
+make
+
=== How to Run
+./test0.sh
+./testO.sh
+./test1.sh
+
+./*.py
+
+
=== C++ app flow logic ===
for target in range(58, 1009 (used 209), 50): // 20 values
@@ -36,7 +57,9 @@
1 initial multi_loss: 1000 to 10000 annealing_repeats
100 Annealing_cycles (groups of 10? : cycle=10 ) (fast:50)
5 repeats (fast: 1)
- multi_loss: 1000 to 10000 annealing_repeats
+ multi_loss: 1000 to 10000 evolve re-runs
evolve() => 2 mins to 10 mins
=== END
+
+
More information about the Swift-commit
mailing list