<div dir="ltr"><div>Thank you Richard for your explanation.</div><div><br></div><div>I first changed the way I was running the code. In the machine there's SLURM and I was using "srun -n <number-of-MPI-procs> ./my_program.x". I've seen more than 60% improvement in execution time by just running with "srun -n <number-of-MPI-procs> --ntasks-per-core=1 --cpu-bind=cores> ./my_program.x". All the results I'm sharing now were obtained by running with the second command.<br></div><div><br></div><div>1) I've tested the configuration you suggested me and I've found mixed results. For one system (matrix) size and distribution of memory per node, there's a small difference in execution time (say, it goes from 208 seconds to 200 seconds) between the two cases: I_MPI_PIN_DOMAIN=socket and I_MPI_PIN_DOMAIN=auto:compact. In another case, I obtained that using I_MPI_PIN_DOMAIN=auto:socket gives actually worse results: I go from an execution time of 5 second to 11 seconds, when using "socket" and "auto:compact" configurations respectively. In this case, I see that the log events that differ in execution time are MatMult, VecNorm and VecScatterEnd. I also see that the imbalance for those log events is smaller when using "socket" (consequently the time difference).<br></div><div><br></div><div>2) I also tested using I_MPI_PIN_PROCESSOR_LIST=0-63 and I get the same numbers that I get when using I_MPI_PIN_DOMAIN=socket (I've seen the BIOS numbering and those correspond to the first of the CPUs in each core).<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">El mar., 25 jun. 2019 a las 2:32, Mills, Richard Tran (<<a href="mailto:rtmills@anl.gov" target="_blank">rtmills@anl.gov</a>>) escribió:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
Hi Ale,<br>
<br>
I don't know if this has anything to do with the strange performance you are seeing, but I notice that some of your Intel MPI settings are inconsistent and I'm not sure what you are intending. You have specified a value for I_MPI_PIN_DOMAIN and also a value
for I_MPI_PIN_PROCESSOR_LIST. You can specify the domains that each MPI rank is pinned to by specifying one of these, but not both. According to what I found by a Google search for the Intel MPI documentation, if you specify both, it is I_MPI_PIN_DOMAIN that
gets used.<br>
<br>
A KNL node has only one socket per node (though it can be made to appear as if it has four virtual sockets if booted into SNC-4 mode, which doesn't seem to be a popular usage model). If you run with "I_MPI_PIN_DOMAIN=socket", then each of your MPI ranks on
a node will be "pinned" to a domain that consists of all of the CPU cores available in a socket, so the MPI processes aren't really pinned at all, and can be migrated around by the system. You may see better performance by having your MPI implementation pin
processes to more restrictive domains. I suggest first trying<br>
<br>
I_MPI_PIN_DOMAIN=auto:compact<br>
<br>
which I believe is the default Intel MPI behavior. This will create domains by dividing the number of logical CPUs (# of hardware threads) by the number of MPI processes being started on the node, and they will be "compact" in the sense that domain members
will be as "close" to each other as possible in terms of sharing cores, caches, etc.<br>
<br>
I think that Intel MPI is ignoring your I_MPI_PIN_PROCESSOR_LIST setting, but I'll go ahead and point out that, if you are running with 64-core KNL nodes, there are 4*64 = 256 logical processors available, because each core supports four hardware threads. Asking
for logical processors 0-63 might not actually be using all of the cores, as, depending on the BIOS numbering (which can be arbitrary), some of these logical processors may actually be hardware threads that share the same core.<br>
<br>
Best regards,<br>
Richard<br>
<br>
<div class="gmail-m_7145753118563305574gmail-m_2715660990897490211gmail-m_1803641538053962811moz-cite-prefix">On 6/24/19 4:19 AM, Ale Foggia via petsc-users wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Yes, I used KNL nodes. I you can perform the test would be great. Could it be that I'm not using the correct configuration of the KNL nodes? These are the environment variables I set:<br>
MKL_NUM_THREADS=1<br>
OMP_NUM_THREADS=1<br>
KMP_HW_SUBSET=1t<br>
KMP_AFFINITY=compact<br>
I_MPI_PIN_DOMAIN=socket<br>
I_MPI_PIN_PROCESSOR_LIST=0-63<br>
MKL_DYNAMIC=0</div>
<div><br>
</div>
<div>The code is in <a href="https://github.com/amfoggia/LSQuantumED" target="_blank">
https://github.com/amfoggia/LSQuantumED</a> and it has a readme to compile it and run it. When I run the test I used only 32 processors per node, and I used 1024 nodes in total, and it's for nspins=38.</div>
<div>Thank you<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">El vie., 21 jun. 2019 a las 20:03, Zhang, Junchao (<<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>>) escribió:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">Ale,
<div> Did you use Intel KNL nodes? Mr. Hong (cc'ed) did experiments on KNL nodes one year ago. He used 32768 processors and called MatAssemblyEnd 118 times and it used only 1.5 seconds in total. So I guess something was wrong with your test. If you can
share your code, I can have a test on our machine to see how it goes.</div>
<div> Thanks.<br clear="all">
<div>
<div dir="ltr" class="gmail-m_7145753118563305574gmail-m_2715660990897490211gmail-m_1803641538053962811gmail-m_-2608733206774855066m_-3757742888527942969gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jun 21, 2019 at 11:00 AM Junchao Zhang <<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">MatAssembly was called once (in stage 5) and cost 2.5% of the total time. Look at stage 5. It says MatAssemblyBegin calls BuildTwoSidedF, which does global synchronization. The high max/min ratio means load imbalance. What I do not understand
is MatAssemblyEnd. The ratio is 1.0. It means processors are already synchronized. With 32768 processors, there are 1.2e+06 messages with average length 1.9e+06 bytes. So each processor sends 36 (1.2e+06/32768) ~2MB messages and it takes 54 seconds. Another
chance is the reduction at MatAssemblyEnd. I don't know why it needs 8 reductions. In my mind, one is enough. I need to look at the code.
<div><br>
</div>
<div><font face="courier new, monospace">Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --<br>
Avg %Total Avg %Total Count %Total Avg %Total Count %Total
<br>
0: Main Stage: 8.5045e+02 13.0% 3.0633e+15 14.0% 8.196e+07 13.1% 7.768e+06 13.1% 2.530e+02 13.0%
<br>
1: Create Basis: 7.9234e-02 0.0% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
<br>
2: Create Lattice: 8.3944e-05 0.0% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
<br>
3: Create Hamilt: 1.0694e+02 1.6% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.1%
<br>
5: Offdiag: 1.6525e+02 2.5% 0.0000e+00 0.0% 1.188e+06 0.2% 1.942e+06 0.0% 8.000e+00 0.4%
<br>
6: Phys quantities: 5.4045e+03 82.8% 1.8866e+16 86.0% 5.417e+08 86.7% 7.768e+06 86.8% 1.674e+03 86.1% </font><br>
</div>
<div><br>
</div>
<div>--- Event Stage 5: Offdiag<br>
</div>
<div><span>BuildTwoSidedF 1 1.0 7.1565e+01 148448.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 28 0 0 0 0 0</span><br>
</div>
<div><font face="courier new, monospace">MatAssemblyBegin 1 1.0 7.1565e+01 127783.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 28 0 0 0 0 0<br>
MatAssemblyEnd 1 1.0 5.3762e+01 1.0 0.00e+00 0.0 1.2e+06 1.9e+06 8.0e+00 1 0 0 0 0 33 0100100100 0<br>
VecSet 1 1.0 7.5533e-02 9.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</font><br>
</div>
<div><br>
</div>
<br clear="all">
<div>
<div dir="ltr" class="gmail-m_7145753118563305574gmail-m_2715660990897490211gmail-m_1803641538053962811gmail-m_-2608733206774855066gmail-m_-3757742888527942969gmail-m_1676599161167762965m_3414052773721223084gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jun 21, 2019 at 10:34 AM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
The load balance is definitely out of whack. <br>
<br>
<br>
<br>
BuildTwoSidedF 1 1.0 1.6722e-0241.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatMult 138 1.0 2.6604e+02 7.4 3.19e+10 2.1 8.2e+07 7.8e+06 0.0e+00 2 4 13 13 0 15 25100100 0 2935476<br>
MatAssemblyBegin 1 1.0 1.6807e-0236.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatAssemblyEnd 1 1.0 3.5680e-01 3.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecNorm 2 1.0 4.4252e+0174.8 1.73e+07 1.0 0.0e+00 0.0e+00 2.0e+00 1 0 0 0 0 5 0 0 0 1 12780<br>
VecCopy 6 1.0 6.5655e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecAXPY 2 1.0 1.3793e-02 2.7 1.73e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 41000838<br>
VecScatterBegin 138 1.0 1.1653e+0285.8 0.00e+00 0.0 8.2e+07 7.8e+06 0.0e+00 1 0 13 13 0 4 0100100 0 0<br>
VecScatterEnd 138 1.0 1.3653e+0222.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 4 0 0 0 0 0<br>
VecSetRandom 1 1.0 9.6668e-01 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
<br>
Note that VecCopy/AXPY/SetRandom which are all embarrassingly parallel have a balance ratio above 2 which means some processes have more than twice the work of others. Meanwhile the ratio for anything with communication is extremely in balanced, some processes
get to the synchronization point well before other processes. <br>
<br>
The first thing I would do is worry about the load imbalance, what is its cause? is it one process with much less work than others (not great but not terrible) or is it one process with much more work then the others (terrible) or something in between. I think
once you get a handle on the load balance the rest may fall into place, otherwise we still have some exploring to do. This is not expected behavior for a good machine with a good network and a well balanced job. After you understand the load balancing you
may need to use one of the parallel performance visualization tools to see why the synchronization is out of whack.<br>
<br>
Good luck<br>
<br>
Barry<br>
<br>
<br>
> On Jun 21, 2019, at 9:27 AM, Ale Foggia <<a href="mailto:amfoggia@gmail.com" target="_blank">amfoggia@gmail.com</a>> wrote:<br>
> <br>
> I'm sending one with a bit less time.<br>
> I'm timing the functions also with std::chronos and for the case of 180 seconds the program runs out of memory (and crushes) before the PETSc log gets to be printed, so I know the time only from my function. Anyway, in every case, the times between std::chronos
and the PETSc log match.<br>
> <br>
> (The large times are in part "4b- Building offdiagonal part" or "Event Stage 5: Offdiag").<br>
> <br>
> El vie., 21 jun. 2019 a las 16:09, Zhang, Junchao (<<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>>) escribió:<br>
> <br>
> <br>
> On Fri, Jun 21, 2019 at 8:07 AM Ale Foggia <<a href="mailto:amfoggia@gmail.com" target="_blank">amfoggia@gmail.com</a>> wrote:<br>
> Thanks both of you for your answers,<br>
> <br>
> El jue., 20 jun. 2019 a las 22:20, Smith, Barry F. (<<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>>) escribió:<br>
> <br>
> Note that this is a one time cost if the nonzero structure of the matrix stays the same. It will not happen in future MatAssemblies.<br>
> <br>
> > On Jun 20, 2019, at 3:16 PM, Zhang, Junchao via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
> > <br>
> > Those messages were used to build MatMult communication pattern for the matrix. They were not part of the matrix entries-passing you imagined, but indeed happened in MatAssemblyEnd. If you want to make sure processors do not set remote entries, you can
use MatSetOption(A,MAT_NO_OFF_PROC_ENTRIES,PETSC_TRUE), which will generate an error when an off-proc entry is set.<br>
> <br>
> I started being concerned about this when I saw that the assembly was taking a few hundreds of seconds in my code, like 180 seconds, which for me it's a considerable time. Do you think (or maybe you need more information to answer this) that this time is
"reasonable" for communicating the pattern for the matrix? I already checked that I'm not setting any remote entries.
<br>
> It is not reasonable. Could you send log view of that test with 180 seconds MatAssembly?<br>
> <br>
> Also I see (in my code) that even if there are no messages being passed during MatAssemblyBegin, it is taking time and the "ratio" is very big.<br>
> <br>
> > <br>
> > <br>
> > --Junchao Zhang<br>
> > <br>
> > <br>
> > On Thu, Jun 20, 2019 at 4:13 AM Ale Foggia via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
> > Hello all!<br>
> > <br>
> > During the conference I showed you a problem happening during MatAssemblyEnd in a particular code that I have. Now, I tried the same with a simple code (a symmetric problem corresponding to the Laplacian operator in 1D, from the SLEPc Hands-On exercises).
As I understand (and please, correct me if I'm wrong), in this case the elements of the matrix are computed locally by each process so there should not be any communication during the assembly. However, in the log I get that there are messages being passed.
Also, the number of messages changes with the number of processes used and the size of the matrix. Could you please help me understand this?<br>
> > <br>
> > I attach the code I used and the log I get for a small problem.<br>
> > <br>
> > Cheers,<br>
> > Ale<br>
> > <br>
> <br>
> <log.txt><br>
<br>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br>
</div>
</blockquote></div>