[Swift-commit] r5510 - SwiftApps/SciColSim

wilde at ci.uchicago.edu wilde at ci.uchicago.edu
Mon Jan 23 12:19:35 CST 2012


Author: wilde
Date: 2012-01-23 12:19:35 -0600 (Mon, 23 Jan 2012)
New Revision: 5510

Added:
   SwiftApps/SciColSim/EMAIL
Modified:
   SwiftApps/SciColSim/README
Log:
Add notes from email discussions.

Added: SwiftApps/SciColSim/EMAIL
===================================================================
--- SwiftApps/SciColSim/EMAIL	                        (rev 0)
+++ SwiftApps/SciColSim/EMAIL	2012-01-23 18:19:35 UTC (rev 5510)
@@ -0,0 +1,474 @@
+==== Email trail (cronological order):
+
+---------------------------------------------------------------------------------------
+
+On 10/15/11 8:30 PM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+> 
+> I've got a basic serial version of optimizer.cpp running on Beagle (on a login
+> node).  Now Im trying to parallelize it there, and have some questions:
+> 
+> 1) You set NWorkers to a constant, 24. In multi_loss, you have this code:
+> 
+>     for(int i=0; i<Nworkers; i++){
+>         for(int j=0; j<5; j++){
+>             un[i]->set_parameter(params[j],j);
+>         }
+>         for(int i=0; i<Nworkers; i++){
+>             dispatch_group_async(group, CustomQueues[i], ^{
+>                 un[i]->evolve_to_target_and_save(istart, iend, Results,
+> Counters);
+>             });
+>             istart += step;
+>             iend = min(istart+step,N);
+>         }
+>     }
+> 
+> Can you explain the intention here? I think the innermost loop is clear: run
+> evolve() 24 times in parallel, partitioning the istart..iend range among the
+> 24 workers.  But I dont understand the outermost loop, which seems to do the
+> entire inner loop 24 (NWorkers) times. I can understand the idea of doing the
+> entire inner loop some number of times. But from the above, I presume that
+> evolve would be run NWorkers^2 times, or 24*24 times.  Was that the intention?
+> 
+> 2) If you had many processors available (as you do on Beagle) would you want
+> to run set NWorkers higher? I think you mentioned something in out discussion
+> Friday about "a thousand" workers being useful. Ie, NWorkers for the innermost
+> loop could be 1000.  Was that really what you meant? What would be a
+> mathematically/scientifically useful value for the NWorkers in the innermost
+> loop?
+> 
+> Further: if you do want an NWorkers >> 24, would you still do the evolve
+> NWorkers^2 times?  I dont think you'd really want 1000*1000 = 1M evolve calls,
+> given how many outer loops you have at higher levels of the code, including in
+> the Python wrapper.
+> 
+> 3) We'll need to discuss what values make sense for the outer loops
+> (multi_annealing and the .py wrapper) once I have this working in parallel.
+> 
+> 4) Can you give me a sentence or two about what this code is doing?  I think I
+> understand the outer annealing logic, but I have not dug into the code within
+> evolve...() at all. I think you might have explained it once back in January
+> (before you recoded in C) but I dont recall. If you have a paper or a web page
+> on what youre computing here, that would be interesting for me to read, and to
+> help write a slide on this for the Beagle review.
+
+
+-----
+
+
+Re: Status and questions on optimizer code
+From :	 andrey rzhetsky <arzhetsk at medicine.bsd.uchicago.edu>
+Subject :	Re: Status and questions on optimizer code
+To :	 Michael Wilde <wilde at mcs.anl.gov>	
+Sun, Oct 16, 2011 08:28 AM
+Hi Mike,
+
+1. I think, you just uncovered a stupid bug on my part -- thank you!  There
+should be only one loop (the outer one).
+2. Yes, of course -- 1000, or even 10000 (I can increase the number of
+repeats then).
+3. OK
+4. The code simulates exploration of a chemical network by a research
+community.  The five major parameters determine the strategy of exploration,
+"target" is the number of new interactions discovered and the loss is the
+number of experiments per one new positive finding.  I can provide you with
+figures and slides, if that would help.
+
+Thank you!
+
+With kind regards,
+
+Andrey
+
+
+---------------------------------------------------------------------------------------
+
+
+On 10/16/11 9:13 AM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+> 
+> Looking deeper, I think the bug was simply that the first for() statement
+> should have enclosed just the parameter setting loop.  In other words, the top
+> of multi_loss should start with:
+> 
+>     for(int i=0; i<Nworkers; i++){
+>         for(int j=0; j<5; j++){
+>             un[i]->set_parameter(params[j],j);
+>         }
+>     }
+> 
+> Then the real work is done by the next loop:
+>     for(i=0; i<Nworkers; i++){
+> 
+> Can you confirm that this looks correct to you?
+> 
+> I made that change, and the code now seems to run as I would expect. I will
+> send you some output as soon as I clean up my debugging output.
+> 
+> Next, I made the code run with 24-way parallelism on a Beagle login node using
+> "OpenMP", simply by adding one "pragma" statement in front of the main worker
+> loop above. So that part of the code now looks like this:
+> 
+>     int i;
+>     #pragma omp parallel for private (i)
+>     for(i=0; i<Nworkers; i++){
+> 
+> and each call to evolve...() is now done in parallel (with all the Mac
+> dispatch statements commented out).  I will test, but I *think* that the same
+> code will run just as well in parallel on your multicore Macs, perhaps just a
+> *tiny* bit slower than under Grand Central Dispatch (likely not a noticeable
+> difference).
+> 
+> Now, we have 2 choices:
+> 
+> 1) I can simply replace the Python driver script with a Swift script, to do
+> many runs of the optimizer in parallel.  That would give you the ability to
+> run *many* 24-core optimization runs in parallel, each using 24 cores.  So for
+> example, in your current Python script you do this:
+> 
+> for target in range(58,1009,50):
+>   for i in range(15):
+> 
+> So thats about 20 x 15 = 300 invocations of optimizer. I *think* that each of
+> these runs is totally independent and can run in parallel, correct?
+> 
+> So A simple Swift script not much longer than the Python script, along with a
+> few beagle-specific configuration files, will enable all 300 jobs to run in
+> parallel, giving you 300 x 24 (=7200) cores running in parallel. Of course,
+> you can seldom *get* that many cores because the machine is heavily loaded.
+> But you may be able to get 10-30 nodes on a daily basis.  We'll need to
+> experiment with this.
+> 
+> As a *next* step after, we should consider the benefits of changing the value
+> of NWorkers to > 24.  24 is the "easy" limit on Beagle because we can get
+> 24-way parallelism on each node with just that one "pragma" statement.  We can
+> get much greater parallelism with Swift in the inner loop, but for that we
+> need to break up the program a bit more, to have Swift run the inner loop as a
+> separate program, and then return the aggregated results in a file. Even for
+> this option, there are two alternative methods:
+> 
+> - we make optimizer call Swift once for each round of parallel annealing. This
+> is fairly easy. It is somewhat limiting to overall parallelism, in that only
+> one round at a time can run. But It may be very adequate.
+> 
+> - we break the program up further into parallelizable chunks, in which case
+> you have a lot of flexibility and the work always gets done in a near-optimal
+> manner regardless of the shape of a given optimization run (in terms of the
+> various nested loop sizes and evolve() execution times.
+> 
+> I think we'll need to discuss this in person over a whiteboard, but I think I
+> have enough knowledge of the program to at least show you a few alternatives.
+> 
+> The main question at the moment, I think, is simply to understand the
+> math/science benefits of extending NWorkers beyond the "low hanging fruit"
+> limit of 24.  What is your assessment of that benefit, Andrey?
+> 
+
+-----
+
+----- Forwarded Message -----
+From: "andrey rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Sunday, October 16, 2011 12:08:25 PM
+Subject: Re: Status and questions on optimizer code
+
+Mike,
+
+It would be fantastic to have 1000 or 10000 workers (with larger number of
+re-runs -- it would improve precision of my analysis drastically!).
+
+All the very best,
+
+Andrey
+
+---------------------------------------------------------------------------------------
+
+On 10/17/11 8:29 AM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+> 
+> Can we meet today to discuss the optimizer? I'd like to show you what Ive done
+> and discuss with you next steps towards getting you running on Beagle. I can
+> meet any time from 10:30 to 3:00.
+> 
+> Do you already have a CI and Beagle login and a project set up for Beagle use?
+> If not, we should get that started.
+> 
+> On the technical side, I have a question about the the typical shape of your
+> optimization runs.
+> 
+> With the sample Python script you gave me, I think we have the following
+> nested iterations in the code:
+> 
+> 20 targets (parallel)
+>   15 repeats (parallel)
+>     100 Annealing_cycles (serial)
+>        6 repeats (serial)
+>          1000 to 10000 annealing_repeats (parallel)
+>            evolve()
+> 
+> The main question I have at this point is regarding the strategy for
+> increasing the innermost annealing repeats (currently 1,000 divided among 24
+> workers; desired to increase to 10,000).
+> 
+> The outermost loops in my Swift tests are done in parallel. Thus we can have a
+> 300 optimizations going in parallel and 24 annealings in parallel for a total
+> of 7,200 parallel tasks.
+> 
+> The question is: if you will always have a sizeable number of parallel
+> iterations in the outer loops, we dont need to change anything in the inner
+> loop to get more parallelism. In other words, we already have more parallelism
+> than we have CPUs available.
+> 
+> 7200 CPUs is about 42% of the overall Beagle system.  It will be very rare
+> that we we could get that many cores all at once.  But  think we can regularly
+> get say 500 to 2000 cores on a daily basis.
+> 
+> On the other hand, if you expect to regularly run tests of *single* annealing
+> cycles and want to speed those up, then indeed it may be worth changing the
+> code structure.
+> 
+> When me meet I'll try to give you an idea of whats involved. Basically we need
+> to change the structure of the annealing loop to create a function
+> "multi_loss_setup" as a separate executable which defines the annealing
+> parameters and writes them to a file; make multi_loss a separate executable;
+> create another executable "multi_loss_summarize" which reduces the results.
+> We can probably combine multi_loss_summarize into multi_loss_setup.
+> 
+> This is not very hard to do, but still sounds to me like a week of programming
+> to get all all restructured and tested.  Before investing that effort, we
+> should discuss if it will give you any additional performance gains over just
+> running many optimizations in parallel.
+> 
+> I need to run timings on the annealing cycles to see how that change across
+> the parameter space, to see if we can just increase the repeats to 10,000 with
+> no changes to the code. I think the feasibility of doing this the "easy way"
+> is based on how long the longest annealings take at the high end of the
+> parameter space.
+> 
+> Regards,
+> 
+> - Mike
+
+-----
+
+----- Forwarded Message -----
+From: "Andrey Rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Monday, October 17, 2011 8:40:17 AM
+Subject: Re: Meet today to discuss optimizer?
+
+Hi Mike,
+
+The 6 (or more) annealing repeats can be run in parallel too.
+
+Unfortunately, around 10:15 I have to rush to Evanston to CBC meeting for
+the rest of the day (we can chat before, if you have a minute, I am in my
+office).
+
+I don't have Beagle login, unfortunately.
+
+Typically, I will have a sizeable outer loop, so, probably, the current
+24-worker setup is fine.
+
+Thank you very much for helping me out!
+
+All the best,
+
+Andrey
+
+
+---------------------------------------------------------------------------------------
+
+On 10/18/11 1:52 PM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+> 
+> Here's a quick update:
+> 
+> - I am now running the optimizer on Beagle compute nodes under Swift.
+> 
+> I attach a few tar files of sample runs at reduced parameter values (to shrink
+> the run time for debugging and learning the code's behavior);
+> 
+> Now Im trying to run some subset of the full-length parameters you gave me in
+> the python file.  Ive got 3 Beagle compute nodes allocated at the moment (72
+> cores total) and Im seeing these times from multi_loss with N=1000 repeats:
+> 
+> sandbox$ grep multi_  ./jobs/*/*/output/*.out
+> ./jobs/0/optimizer-01p7lhhk/output/T408.R1.out:multi_loss(N=1000) elapsed
+> time: 122.742 seconds 2.04571 minutes
+> ./jobs/0/optimizer-01p7lhhk/output/T408.R1.out:multi_loss(N=1000) elapsed
+> time: 123.979 seconds 2.06631 minutes
+> ./jobs/0/optimizer-01p7lhhk/output/T408.R1.out:multi_loss(N=1000) elapsed
+> time: 123.624 seconds 2.0604 minutes
+> ./jobs/t/optimizer-t0p7lhhk/output/T958.R1.out:multi_loss(N=1000) elapsed
+> time: 1431.09 seconds 23.8514 minutes
+> ./jobs/x/optimizer-x0p7lhhk/output/T708.R1.out:multi_loss(N=1000) elapsed
+> time: 627.074 seconds 10.4512 minutes
+> ./jobs/x/optimizer-x0p7lhhk/output/T708.R1.out:multi_loss(N=1000) elapsed
+> time: 790.652 seconds 13.1775 minutes
+> 
+> 
+> Each run of optimizer is going to a file name T(target).R(repeat).out
+> 
+> So we're seeing 23.8 mins for 1000 repeats at target=958 and 10-13 mins at
+> target=708. The 1000 repeats are spread over 24 cores each.
+> 
+> Whats your time availability later in the week to discuss this further, and to
+> see if either (a) I can show you how to run this version or (b) we can get a
+> set of production run descriptions from you and you can run them yourself?
+> 
+> In the compressed tar file at http://www.ci.uchicago.edu/~wilde/AR.snap.01.tgz
+> you will find:
+> 
+> - the swift script that I use instead of the Python driver to run the
+> optimizer in parallel (along beagle.xml that specifies scheduler parameters
+> for Beagle like time, cores, queue name and project ID)
+> 
+> - the slightly modified version of optimizer (changes in multi_loss() to
+> correct the loops, changes to use OpenMP instead of Grand Central Dispatch,
+> and a few changes in output logging).
+> 
+> - a few run directories of runs with shortened parameter settings.
+> 
+> If we continue working together on this, we should set up a way to share code
+> using a repository like Subversion (svn).  Thats pretty easy once you master a
+> few basic commands.
+> 
+> Regards,
+> 
+> - Mike
+> 
+> 
+
+-----
+
+----- Forwarded Message -----
+From: "Andrey Rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Tuesday, October 18, 2011 3:57:36 PM
+Subject: Re: Meet today to discuss optimizer?
+
+Mike,
+
+Thank you!  Are you around now?  I would be also happy to carve some time
+tomorrow, if this works for you.
+
+With kind regards,
+
+Andrey
+
+
+---------------------------------------------------------------------------------------
+
+
+On 10/19/11 12:10 PM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> Hi Andrey,
+> 
+> Im in meetings today till about 3PM. Are you available at say 3:30 or later?
+> 
+> I did a larger run last night. Only one smaller optimizer run *fully*
+> finished, but many others made significant progress.  The results are at:
+> 
+>   http://www.ci.uchicago.edu/~wilde/AR.optimizer.out.2010.1018.tgz
+> 
+> If you have time, could you take a look at that run and see if the
+> optimizations look like they have been running as expected? Ive made only a
+> few cosmetic changes to your debug output.
+> 
+> I submitted the run at 21:20; it started running at about 21:27; by about
+> 23:10 it had acquired 12 nodes (x 24 cores each). It ended about 23:18 when
+> the first job exceeded its time limit of 5 hours. Im still trying to calibrate
+> how much time each optimizer invocation needs, and whether some of the
+> internal iterations can be further spread out.  Also how to organize the run
+> so that optimizations that time out  can be re-run with the smallest
+> reasonable failure unit.
+> 
+>
+
+---------------------------------------------------------------------------------------
+
+
+----- Forwarded Message -----
+From: "andrey rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Wednesday, October 26, 2011 8:40:21 PM
+Subject: Re: Question on inner annealing loop
+
+Mike,
+
+
+> Im confused on 3 points here:
+> 
+> - the inner loop would always be done between 1 and 5 times, right?
+
+Correct.
+
+> - could each of those times really be done in parallel? (I'll try to determine
+> this by inspection).
+
+Not really -- the acceptance of parameter changes depends on the loss in
+between.
+
+> - when we last met in your office, I *thought* you indicated that this inner
+> loop could be done just *once*.  Was that what you meant?  And if so, for
+> which of the 5 vars?
+
+Nope, has to be repeated over and over.
+
+All the very best,
+
+Andrey
+
+
+---------------------------------------------------------------------------------------
+
+
+On 10/26/11 10:42 PM, "Michael Wilde" <wilde at mcs.anl.gov> wrote:
+
+> OK, all that makes sense, Andrey. But then do you recall what you suggested
+> when we met?
+> 
+> Lets label the loops as follows:
+> 
+> a) 20 targets (parallel)
+> b)   15 repeats (parallel)
+> c)     100 Annealing_cycles (serial)
+> d)        6 repeats (serial)
+> e)         1000 to 10000 annealing_repeats (parallel)
+> f)            evolve()
+> 
+> What I recalled from our last discussion was that I should reduce loop (c)
+> from 100 to 50 or 25, and loop (d) to 1.  But since reducing loop (d) doesn't
+> make sense, do you recall suggesting any other reduction?
+> 
+> If not, no problem, I think I know how to proceed.
+> 
+> Thanks,
+> 
+> - Mike
+> 
+> 
+-----
+
+----- Forwarded Message -----
+From: "andrey rzhetsky" <arzhetsk at medicine.bsd.uchicago.edu>
+To: "Michael Wilde" <wilde at mcs.anl.gov>
+Sent: Thursday, October 27, 2011 2:54:06 AM
+Subject: Re: Question on inner annealing loop
+
+Hi Mike,
+
+I suggested reducing (b) to 1.
+
+With kind regards,
+
+Andrey
+
+
+

Modified: SwiftApps/SciColSim/README
===================================================================
--- SwiftApps/SciColSim/README	2012-01-23 18:18:10 UTC (rev 5509)
+++ SwiftApps/SciColSim/README	2012-01-23 18:19:35 UTC (rev 5510)
@@ -1,13 +1,34 @@
+=== Overview
+
+The code simulates the exploration of a chemical network by a research
+community.  The five major parameters determine the strategy of
+exploration, "target" is the number of new interactions discovered and
+"loss" is the number of experiments per one new positive finding.
+
 === Files
 
+Code:
+
 optirun.swift: replaces top-level py script for the outermost loops
 
+Data:
+
+
 === How to build
 
+make
+
 === How to Run
 
+./test0.sh
 
+./testO.sh
 
+./test1.sh
+
+./*.py
+
+
 === C++ app flow logic ===
 
 for target in range(58, 1009 (used 209), 50):  // 20 values
@@ -36,7 +57,9 @@
     1 initial multi_loss: 1000 to 10000 annealing_repeats
     100 Annealing_cycles (groups of 10? : cycle=10 ) (fast:50)
        5 repeats (fast: 1)
-         multi_loss: 1000 to 10000 annealing_repeats
+         multi_loss: 1000 to 10000 evolve re-runs
            evolve()  => 2 mins to 10 mins
 
 === END
+
+




More information about the Swift-commit mailing list