<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Jun 13, 2018 at 1:42 PM, Smith, Barry F. <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
<br>
> On Jun 13, 2018, at 1:09 PM, Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br>
> <br>
> Mark,<br>
> Yes, it is 7-point stencil. I tried your options, -pc_gamg_square_graph 0 -pc_gamg_threshold 0.0 -pc_gamg_repartition, and found they increased the time. I did not try hypre since I don't know how to set its options.<br>
> I also tried periodic boundary condition and ran it with -mat_view ::load_balance. It gives fewer KSP iterations and but PETSc still reports load<br>
> imbalance at coarse levels.<br>
<br>
</span> Is the overall scaling better, worse, or the same for periodic boundary conditions?<br></blockquote><div> OK, the queue job finally finished. I would worse with periodic boundary conditions. But they have different KSP iterations and MatSOR call counts. So the answer is not solid. See the attached figure for difference.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div class="h5"><br>
> <br>
> <br>
> --Junchao Zhang<br>
> <br>
> On Tue, Jun 12, 2018 at 3:17 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br>
> This all looks reasonable to me. The VecScatters times are a little high but these are fast little solves (.2 seconds each).<br>
> <br>
> The RAP times are very low, suggesting we could optimize parameters a bit and reduce the iteration count. These are 7 point stencils as I recall. You could try -pc_gamg_square_graph 0 (instead of 1) and you probably want '-pc_gamg_threshold 0.0'. You could also test hypre.<br>
> <br>
> And you should be able to improve coarse grid load imbalance with -pc_gamg_repartition.<br>
> <br>
> Mark<br>
> <br>
> On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br>
> Mark,<br>
> I tried "-pc_gamg_type agg ..." options you mentioned, and also -ksp_type cg + PETSc's default PC bjacobi. In the latter case, to reduce execution time I called KSPSolve 100 times instead of 1000, and also used -ksp_max_it 100. In the 36x48=1728 ranks case, I also did a test with -log_sync. From there you can see a lot of time is spent on VecNormBarrier, which implies load imbalance. Note VecScatterBarrie time is misleading, since it barriers ALL ranks, but in reality VecScatter sort of syncs in a small neighborhood.<br>
> Barry suggested trying periodic boundary condition so that the nonzeros are perfectly balanced across processes. I will try that to see what happens.<br>
> <br>
> --Junchao Zhang<br>
> <br>
> On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br>
> <br>
> <br>
> On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br>
> I used an LCRC machine named Bebop. I tested on its Intel Broadwell nodes. Each nodes has 2 CPUs and 36 cores in total. I collected data using 36 cores in a node or 18 cores in a node. As you can see, 18 cores/node gave much better performance, which is reasonable as routines like MatSOR, MatMult, MatMultAdd are all bandwidth bound.<br>
> <br>
> The code uses a DMDA 3D grid, 7-point stencil, and defines nodes(vertices) at the surface or second to the surface as boundary nodes. Boundary nodes only have a diagonal one in their row in the matrix. Interior nodes have 7 nonzeros in their row. Boundary processors in the processor grid has less nonzero. This is one source of load-imbalance. Will load-imbalance get severer at coarser grids of an MG level?<br>
> <br>
> Yes. <br>
> <br>
> You can use a simple Jacobi solver to see the basic performance of your operator and machine. Do you see as much time spent in Vec Scatters? VecAXPY? etc.<br>
> <br>
> <br>
> I attach a trace view figure that show activity of each ranks along the time axis in one KSPSove. White color means MPI wait. You can see white takes a large space. <br>
> <br>
> I don't have a good explanation why at large scale (1728 cores), processors wait longer time, as the communication pattern is still 7-point stencil in a cubic processor gird.<br>
> <br>
> --Junchao Zhang<br>
> <br>
> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
> <br>
> Junchao,<br>
> <br>
> Thanks, the load balance of matrix entries is remarkably similar for the two runs so it can't be a matter of worse work load imbalance for SOR for the larger case explaining why the SOR takes more time. <br>
> <br>
> Here is my guess (and I know no way to confirm it). In the smaller case the overlap of different processes on the same node running SOR at the same time is lower than the larger case hence the larger case is slower because there are more SOR processes fighting over the same memory bandwidth at the same time than in the smaller case. Ahh, here is something you can try, lets undersubscribe the memory bandwidth needs, run on say 16 processes per node with 8 nodes and 16 processes per node with 64 nodes and send the two -log_view output files. I assume this is an LCRC machine and NOT a KNL system?<br>
> <br>
> Thanks<br>
> <br>
> <br>
> Barry<br>
> <br>
> <br>
> > On Jun 9, 2018, at 8:29 AM, Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br>
> > <br>
> > -pc_gamg_type classical<br>
> > <br>
> > FYI, we only support smoothed aggregation "agg" (the default). (This thread started by saying you were using GAMG.)<br>
> > <br>
> > It is not clear how much this will make a difference for you, but you don't want to use classical because we do not support it. It is meant as a reference implementation for developers.<br>
> > <br>
> > First, how did you get the idea to use classical? If the documentation lead you to believe this was a good thing to do then we need to fix that!<br>
> > <br>
> > Anyway, here is a generic input for GAMG:<br>
> > <br>
> > -pc_type gamg <br>
> > -pc_gamg_type agg <br>
> > -pc_gamg_agg_nsmooths 1 <br>
> > -pc_gamg_coarse_eq_limit 1000 <br>
> > -pc_gamg_reuse_interpolation true <br>
> > -pc_gamg_square_graph 1 <br>
> > -pc_gamg_threshold 0.05 <br>
> > -pc_gamg_threshold_scale .0 <br>
> > <br>
> > <br>
> > <br>
> > <br>
> > On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br>
> > OK, I have thought that space was a typo. btw, this option does not show up in -h.<br>
> > I changed number of ranks to use all cores on each node to avoid misleading ratio in -log_view. Since one node has 36 cores, I ran with 6^3=216 ranks, and 12^3=1728 ranks. I also found call counts of MatSOR etc in the two tests were different. So they are not strict weak scaling tests. I tried to add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the two have the same MatSOR count. Anyway, I attached the load balance output.<br>
> > <br>
> > I find PCApply_MG calls PCMGMCycle_Private, which is recursive and indirectly calls MatSOR_MPIAIJ. I believe the following code in MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ between processors through VecScatter at each MG level. If SOR and MatMultAdd are imbalanced, the cost is accumulated along MG levels and shows up as large VecScatter cost.<br>
> > 1460: while<br>
> > (its--) {<br>
> > <br>
> > 1461: VecScatterBegin(mat->Mvctx,xx,<wbr>mat->lvec,INSERT_VALUES,<wbr>SCATTER_FORWARD<br>
> > );<br>
> > <br>
> > 1462: VecScatterEnd(mat->Mvctx,xx,<wbr>mat->lvec,INSERT_VALUES,<wbr>SCATTER_FORWARD<br>
> > );<br>
> > <br>
> > <br>
> > 1464: /* update rhs: bb1 = bb - B*x */<br>
> > 1465: VecScale<br>
> > (mat->lvec,-1.0);<br>
> > <br>
> > 1466: (*mat->B->ops->multadd)(mat-><br>
> > B,mat->lvec,bb,bb1);<br>
> > <br>
> > <br>
> > 1468: /* local sweep */<br>
> > 1469: (*mat->A->ops->sor)(mat->A,<wbr>bb1,omega,SOR_SYMMETRIC_SWEEP,<br>
> > fshift,lits,1,xx);<br>
> > <br>
> > 1470: }<br>
> > <br>
> > <br>
> > <br>
> > --Junchao Zhang<br>
> > <br>
> > On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
> > <br>
> > <br>
> > > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br>
> > > <br>
> > > Searched but could not find this option, -mat_view::load_balance<br>
> > <br>
> > There is a space between the view and the : load_balance is a particular viewer format that causes the printing of load balance information about number of nonzeros in the matrix.<br>
> > <br>
> > Barry<br>
> > <br>
> > > <br>
> > > --Junchao Zhang<br>
> > > <br>
> > > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
> > > So the only surprise in the results is the SOR. It is embarrassingly parallel and normally one would not see a jump.<br>
> > > <br>
> > > The load balance for SOR time 1.5 is better at 1000 processes than for 125 processes of 2.1 not worse so this number doesn't easily explain it.<br>
> > > <br>
> > > Could you run the 125 and 1000 with -mat_view ::load_balance and see what you get out?<br>
> > > <br>
> > > Thanks<br>
> > > <br>
> > > Barry<br>
> > > <br>
> > > Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync is on. My only guess is that the MatSOR is sharing memory bandwidth (or some other resource? cores?) with the VecScatter and for some reason this is worse for 1000 cores but I don't know why.<br>
> > > <br>
> > > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br>
> > > > <br>
> > > > Hi, PETSc developers,<br>
> > > > I tested Michael Becker's code. The code calls the same KSPSolve 1000 times in the second stage and needs cubic number of processors to run. I ran with 125 ranks and 1000 ranks, with or without -log_sync option. I attach the log view output files and a scaling loss excel file.<br>
> > > > I profiled the code with 125 processors. It looks {MatSOR, MatMult, MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took ~50% of the time, The other half time was spent on waiting in MPI. MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().<br>
> > > > I tested it on a 36 cores/node machine. I found 32 ranks/node gave better performance (about 10%) than 36 ranks/node in the 125 ranks testing. I guess it is because processors in the former had more balanced memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files). It looks ranks at the two ends have less DP_OPS and TOT_CYC. <br>
> > > > Does anyone familiar with the algorithm have quick explanations?<br>
> > > > <br>
> > > > --Junchao Zhang<br>
> > > > <br>
> > > > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <<a href="mailto:Michael.Becker@physik.uni-giessen.de">Michael.Becker@physik.uni-<wbr>giessen.de</a>> wrote:<br>
> > > > Hello again,<br>
> > > > <br>
> > > > this took me longer than I anticipated, but here we go.<br>
> > > > I did reruns of the cases where only half the processes per node were used (without -log_sync):<br>
> > > > <br>
> > > > 125 procs,1st 125 procs,2nd 1000 procs,1st 1000 procs,2nd<br>
> > > > Max Ratio Max Ratio Max Ratio Max Ratio<br>
> > > > KSPSolve 1.203E+02 1.0 1.210E+02 1.0 1.399E+02 1.1 1.365E+02 1.0<br>
> > > > VecTDot 6.376E+00 3.7 6.551E+00 4.0 7.885E+00 2.9 7.175E+00 3.4<br>
> > > > VecNorm 4.579E+00 7.1 5.803E+00 10.2 8.534E+00 6.9 6.026E+00 4.9<br>
> > > > VecScale 1.070E-01 2.1 1.129E-01 2.2 1.301E-01 2.5 1.270E-01 2.4<br>
> > > > VecCopy 1.123E-01 1.3 1.149E-01 1.3 1.301E-01 1.6 1.359E-01 1.6<br>
> > > > VecSet 7.063E-01 1.7 6.968E-01 1.7 7.432E-01 1.8 7.425E-01 1.8<br>
> > > > VecAXPY 1.166E+00 1.4 1.167E+00 1.4 1.221E+00 1.5 1.279E+00 1.6<br>
> > > > VecAYPX 1.317E+00 1.6 1.290E+00 1.6 1.536E+00 1.9 1.499E+00 2.0<br>
> > > > VecScatterBegin 6.142E+00 3.2 5.974E+00 2.8 6.448E+00 3.0 6.472E+00 2.9<br>
> > > > VecScatterEnd 3.606E+01 4.2 3.551E+01 4.0 5.244E+01 2.7 4.995E+01 2.7<br>
> > > > MatMult 3.561E+01 1.6 3.403E+01 1.5 3.435E+01 1.4 3.332E+01 1.4<br>
> > > > MatMultAdd 1.124E+01 2.0 1.130E+01 2.1 2.093E+01 2.9 1.995E+01 2.7<br>
> > > > MatMultTranspose 1.372E+01 2.5 1.388E+01 2.6 1.477E+01 2.2 1.381E+01 2.1<br>
> > > > MatSolve 1.949E-02 0.0 1.653E-02 0.0 4.789E-02 0.0 4.466E-02 0.0<br>
> > > > MatSOR 6.610E+01 1.3 6.673E+01 1.3 7.111E+01 1.3 7.105E+01 1.3<br>
> > > > MatResidual 2.647E+01 1.7 2.667E+01 1.7 2.446E+01 1.4 2.467E+01 1.5<br>
> > > > PCSetUpOnBlocks 5.266E-03 1.4 5.295E-03 1.4 5.427E-03 1.5 5.289E-03 1.4<br>
> > > > PCApply 1.031E+02 1.0 1.035E+02 1.0 1.180E+02 1.0 1.164E+02 1.0<br>
> > > > <br>
> > > > I also slimmed down my code and basically wrote a simple weak scaling test (source files attached) so you can profile it yourself. I appreciate the offer Junchao, thank you.<br>
> > > > You can adjust the system size per processor at runtime via "-nodes_per_proc 30" and the number of repeated calls to the function containing KSPsolve() via "-iterations 1000". The physical problem is simply calculating the electric potential from a homogeneous charge distribution, done multiple times to accumulate time in KSPsolve().<br>
> > > > A job would be started using something like<br>
> > > > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4 -iterations 1000 \\<br>
> > > > -ksp_rtol 1E-6 \<br>
> > > > -log_view -log_sync\<br>
> > > > -pc_type gamg -pc_gamg_type classical\<br>
> > > > -ksp_type cg \<br>
> > > > -ksp_norm_type unpreconditioned \<br>
> > > > -mg_levels_ksp_type richardson \<br>
> > > > -mg_levels_ksp_norm_type none \<br>
> > > > -mg_levels_pc_type sor \<br>
> > > > -mg_levels_ksp_max_it 1 \<br>
> > > > -mg_levels_pc_sor_its 1 \<br>
> > > > -mg_levels_esteig_ksp_type cg \<br>
> > > > -mg_levels_esteig_ksp_max_it 10 \<br>
> > > > -gamg_est_ksp_type cg<br>
> > > > , ideally started on a cube number of processes for a cubical process grid.<br>
> > > > Using 125 processes and 10.000 iterations I get the output in "log_view_125_new.txt", which shows the same imbalance for me.<br>
> > > > Michael<br>
> > > > <br>
> > > > <br>
> > > > Am 02.06.2018 um 13:40 schrieb Mark Adams:<br>
> > > >> <br>
> > > >> <br>
> > > >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br>
> > > >> Hi,Michael,<br>
> > > >> You can add -log_sync besides -log_view, which adds barriers to certain events but measures barrier time separately from the events. I find this option makes it easier to interpret log_view output.<br>
> > > >> <br>
> > > >> That is great (good to know).<br>
> > > >> <br>
> > > >> This should give us a better idea if your large VecScatter costs are from slow communication or if it catching some sort of load imbalance.<br>
> > > >> <br>
> > > >> <br>
> > > >> --Junchao Zhang<br>
> > > >> <br>
> > > >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker <<a href="mailto:Michael.Becker@physik.uni-giessen.de">Michael.Becker@physik.uni-<wbr>giessen.de</a>> wrote:<br>
> > > >> Barry: On its way. Could take a couple days again.<br>
> > > >> <br>
> > > >> Junchao: I unfortunately don't have access to a cluster with a faster network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking fat-tree network, which I realize causes parallel slowdown if the nodes are not connected to the same switch. Each node has 24 processors (2x12/socket) and four NUMA domains (two for each socket).<br>
> > > >> The ranks are usually not distributed perfectly even, i.e. for 125 processes, of the six required nodes, five would use 21 cores and one 20.<br>
> > > >> Would using another CPU type make a difference communication-wise? I could switch to faster ones (on the same network), but I always assumed this would only improve performance of the stuff that is unrelated to communication.<br>
> > > >> <br>
> > > >> Michael<br>
> > > >> <br>
> > > >> <br>
> > > >> <br>
> > > >>> The log files have something like "Average time for zero size MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow network. A typical machine should give less than 1/10 of the latency you have. An easy way to try is just running the code on a machine with a faster network and see what happens.<br>
> > > >>> <br>
> > > >>> Also, how many cores & numa domains does a compute node have? I could not figure out how you distributed the 125 MPI ranks evenly.<br>
> > > >>> <br>
> > > >>> --Junchao Zhang<br>
> > > >>> <br>
> > > >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker <<a href="mailto:Michael.Becker@physik.uni-giessen.de">Michael.Becker@physik.uni-<wbr>giessen.de</a>> wrote:<br>
> > > >>> Hello again,<br>
> > > >>> <br>
> > > >>> here are the updated log_view files for 125 and 1000 processors. I ran both problems twice, the first time with all processors per node allocated ("-1.txt"), the second with only half on twice the number of nodes ("-2.txt").<br>
> > > >>> <br>
> > > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <<a href="mailto:Michael.Becker@physik.uni-giessen.de">Michael.Becker@physik.uni-<wbr>giessen.de</a>><br>
> > > >>>>> wrote:<br>
> > > >>>>> <br>
> > > >>>>> I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES). <br>
> > > >>>>> <br>
> > > >>>> Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed? We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.<br>
> > > >>>> <br>
> > > >>> <br>
> > > >>> I mean this, right in the log_view output:<br>
> > > >>> <br>
> > > >>>> Memory usage is given in bytes:<br>
> > > >>>> <br>
> > > >>>> Object Type Creations Destructions Memory Descendants' Mem.<br>
> > > >>>> Reports information only for process 0.<br>
> > > >>>> <br>
> > > >>>> --- Event Stage 0: Main Stage<br>
> > > >>>> <br>
> > > >>>> ...<br>
> > > >>>> <br>
> > > >>>> --- Event Stage 1: First Solve<br>
> > > >>>> <br>
> > > >>>> ...<br>
> > > >>>> <br>
> > > >>>> --- Event Stage 2: Remaining Solves<br>
> > > >>>> <br>
> > > >>>> Vector 23904 23904 1295501184 0.<br>
> > > >>> I logged the exact number of KSP iterations over the 999 timesteps and its exactly 23904/6 = 3984.<br>
> > > >>> Michael<br>
> > > >>> <br>
> > > >>> <br>
> > > >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:<br>
> > > >>>> <br>
> > > >>>> Please send the log file for 1000 with cg as the solver.<br>
> > > >>>> <br>
> > > >>>> You should make a bar chart of each event for the two cases to see which ones are taking more time and which are taking less (we cannot tell with the two logs you sent us since they are for different solvers.)<br>
> > > >>>> <br>
> > > >>>> <br>
> > > >>>> <br>
> > > >>>> <br>
> > > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <<a href="mailto:Michael.Becker@physik.uni-giessen.de">Michael.Becker@physik.uni-<wbr>giessen.de</a>><br>
> > > >>>>> wrote:<br>
> > > >>>>> <br>
> > > >>>>> I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES). <br>
> > > >>>>> <br>
> > > >>>> Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed? We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.<br>
> > > >>>> <br>
> > > >>>> <br>
> > > >>>> <br>
> > > >>>> <br>
> > > >>>>> This seems kind of wasteful, is this supposed to be like this? Is this even the reason for my problems? Apart from that, everything seems quite normal to me (but I'm not the expert here).<br>
> > > >>>>> <br>
> > > >>>>> <br>
> > > >>>>> Thanks in advance.<br>
> > > >>>>> <br>
> > > >>>>> Michael<br>
> > > >>>>> <br>
> > > >>>>> <br>
> > > >>>>> <br>
> > > >>>>> <log_view_125procs.txt><log_vi<br>
> > > >>>>> ew_1000procs.txt><br>
> > > >>>>> <br>
> > > >>> <br>
> > > >>> <br>
> > > >> <br>
> > > >> <br>
> > > >> <br>
> > > > <br>
> > > > <br>
> > > > <o-wstest-125.txt><Scaling-<wbr>loss.png><o-wstest-1000.txt><<wbr>o-wstest-sync-125.txt><o-<wbr>wstest-sync-1000.txt><MatSOR_<wbr>SeqAIJ.png><PAPI_TOT_CYC.png><<wbr>PAPI_DP_OPS.png><br>
> > > <br>
> > > <br>
> > <br>
> > <br>
> > <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
</div></div>> <o-wstest-periodic-36x6.txt><<wbr>o-wstest-gamg-agg-repartition-<wbr>36x6.txt><br>
<br>
</blockquote></div><br></div></div>