[petsc-dev] VecScatter scaling problem on KNL

Choong-Seock Chang cschang at pppl.gov
Thu Mar 9 05:55:43 CST 2017


Barry,
By the way, we have another solver kernel that is much more computationally intensive than the global field solver kernel: the nonlinear Fokker-Planck operation kernel.
The solver in this kernel is extremely well parallelized, being a local operation at grid-node level.
CS

> On Mar 9, 2017, at 6:20 AM, Choong-Seock Chang <cschang at pppl.gov> wrote:
> 
> Hi Barry,
> Thanks for helping out on this. With your and your team’s help, I trust that PETSc issue will be resolved soon.  We have not seen this peculiar issue at large scale on other leadership class computers other than Cori.  I do not think this issue has been a problem at smaller scale on Cori.
> 
> About your BTW statement below, "any implementation that requires storing the "entire" vector on each process is, by definition, not scalable,” yes we are fully aware of this issue (this statement applies to our PETSc solver data on the grid).  
> However,  the PETSc solver consumes only a very small fraction of the total computing time, ~2%, in our particle-in-cell XGC code that uses a few hundred billions of particles approaching trillion particles (XGC is a full-function 5D code, as opposed to the usual perturbative delta-f 5D codes that do not use too many particles). The grid data is small compared to the particle data.  The good scalability in XGC comes from the particle operation that is well decomposed in multi-dimensional domains.  
> The trade-off between this technique and the grid-data domain-decomposition has been with the overhead in the data movement cost at every electron subcycling time step (especially on the GPU-CPU machine Titan) when we domain-decomposed the grid data.  The electrons move around very fast between different physical domains.  We have found that the computing becomes much more efficient by replicating the entire grid data (which is of small size) on each process during the ~60 electron subcycling steps per single ion time step.  [For the slow ion motions, we can use domain-decomposed grid data.]  So far, our physics size even for ITER did not suffer from this technique.  Also, the shared memory size on each node is increasing in the next generation machines.  Thus, the multi-dimensional grid-data parallelization has been a lower priority issue so far.
> 
> However, in preparation for the future electron physics that may possibly require huge grid memory, we have a plan in place to parallelize the grid memory in multi-dimansions together with the already well-parallelized  particle data.  This plan is being executed primarily by Mark Shephard, in collaboration with Mark Adams.  I believe Mark Adams also has his own plan in moving forward into this direction.
> 
> Your continuous good advice will be highly appreciated.
> With Best Regards,
> CS
> 
> 
>> On Mar 8, 2017, at 11:36 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>> 
>> 
>> Mark,
>> 
>>  Ok, in this situation VecScatter cannot detect that it is an all to all so will generate a message from each process to each other process. Given my past experience with Cray MPI (why do they even have their own MPI when Intel provides one; in fact why does Cray even exist when they just take other people's products and put their name on them) I am not totally surprised if the Cray MPI chocks on this flood of messages.
>> 
>>  1) Test with Intel MPI, perhaps they handle this case in a scalable way
>> 
>>   2) If Intel MPI also produces poor performance then (interesting, how come on other systems in the past this wasn't a bottleneck for the code?) the easiest solution is to separate the operation into two parts. Use a VecScatterCreateToAll() to get all the data to all the processes and then use another (purely sequential) VecScatter to get the data from this intermediate buffer into the final vector that has the "extra" locations for the boundary conditions in the final destination vector.
>> 
>> BTW: You know this already, but any implementation that requires storing the "entire" vector on each process is, by definition, not scalable and hence should not even be considered for funding by ECP or SciDAC.
>> 
>> 
>> Barry
>> 
>> 
>>> On Mar 8, 2017, at 8:43 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>> 
>>>> 
>>>>  Is the scatter created with VecScatterCreateToAll()? If so, internally the VecScatterBegin/End will use VecScatterBegin_MPI_ToAll() which then uses a MPI_Allgatherv() to do the communication.  You can check  in the debugger for this (on 2 processes) by just putting a break point in VecScatterBegin_MPI_ToAll() to confirm if it is called.
>>> 
>>> Alas, not I did not use VecScatterCreateToAll and
>>> VecScatterCreateToAll will take some code changes.
>>> 
>>> There are boundary conditions in the destination vector, and so we
>>> scatter into a larger vector the the global size of the PETSc vector,
>>> using a general IS. With code that looks like this:
>>> 
>>> call ISCreateGeneral(PETSC_COMM_SELF,nreal,petsc_xgc,PETSC_COPY_VALUES,is,ierr)
>>> call VecScatterCreate(this%xVec,PETSC_NULL_OBJECT,vec,is,this%from_petsc,ierr)
>>> ! reverse scatter object
>>> 
>>> If we want to make this change then I could help a developer or you
>>> can get me set up with a (small) test problem and a branch and I can
>>> do it at NERSC.
>>> 
>>> Thanks,
>> 
> 




More information about the petsc-dev mailing list