since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

Mon Dec 7 19:12:51 CST 2009

On Dec 7, 2009, at 3:17 PM, Matthew Knepley wrote:

> On Mon, Dec 7, 2009 at 3:05 PM, Chris Kees <christopher.e.kees at usace.army.mil 
> > wrote:
> This whole thread has been really interesting. I'm interested in   
> Barry's suggestion, and I know a handful of people not on the list  
> who would also be interested.   I've got a couple of questions/ 
> responses to the MatSetValues, point-wise physics, and "support  
> tools" issues that were raised.
>
> Did you guys settle the issue that Jed brought up about about the  
> Python Global Interpreter Lock (GIL)?  There was some discussion of  
> it at SciPy this summer, but I didn't come away feeling like anybody  
> had an immediate solution or that Guido was in favor of eliminating  
> the GIL anytime soon.  Are you guys assuming that the "worker  
> threads"  for the multiple cores/GPU's are going to be set up by C  
> code (hand- or machine-generated)?
>
> I think this is going to be solved be first, rearranging to the code  
> to segregate reductions or conflicts 2) using atomics and then 3)  
> intelligent reordering to minimize the
> conflicts. I have never seen this be a big problem, but some code is  
> always going to have the bad tradeoff I guess.
>
> It seems to me that the MatSetValues discussion moved on from the  
> threading issue (since you can't call MatSetValues efficiently in  
> python anyway) to the more general issue of how PETSc is going to  
> explicitly support shared memory parallelism. I interpret the last  
> interchange as implying that the solution is another level of domain  
> decomposition that would be hidden from the mat assembly loop level  
> (i.e. users aren't going to have to write their code like the  
> "thread part" is their parallel subdomain). Is that a correct  
> interpretation? Something along the lines of additional args to  
> MatCreate that indicates how many threads to use per subdomain and  
> some OpenMPI style instructions to parallelize the assembly loop?
>
> I am not advocating that. I want a CUDA-like solution, but of course  
> you can probably convert any one to the other with a
> tremendous amount of pain. I wish for all matrix-free applications  
> and vector assemblies, because then I think the problem
> is very simple. I have not seen any compelling argument against this  
> approach, and all the best codes I know use it.

    Matt lives in a world where 1) people are smart enough to  
recognize  a better approach and 2) are willing to change the way they  
do things to take advantage of a better approach.

   I live in a world where people are very conservative and will  
rarely change anything in the way they think and do things no matter  
how much better it may be. So, I am not "advocating" a hierarchical  
storage and access of the matrix within a shared memory level but I  
think it may be unavoidable. I imagine the user will write code that  
represents a "task" that generates matrix entries and then these tasks  
are scheduled across the threads somehow. So the user will be aware  
they are writing something that will be run in parallel, but they  
won't need to worry about the details of scheduling it.

    Barry

>
> On the issue of the point-wise physics (e.g. quadrature point  
> routines, Riemann solvers, ...). It seems like this is a long- 
> standing bottleneck for object-orient approaches to PDE's going back  
> to the OO Fem research in the 80s and 90s.  The SciPy group that met  
> to talk about PDE's spent some time on it, partly related to femhub  
> project.  I agree that a code generation approach seems to be the  
> only way forward.  Ondrej Certik from the femhub project has written  
> a symbolic math module for python, and we talked some about using  
> this to represent the physics in a way that can be used for code  
> generation.  I'm not sure what progress he's made. At the end of the  
> day you've got to allow people to write code for physics without  
> knowledge of the assembly processes while still injecting that code  
> into the inner loop, otherwise you'll approach the speed of  
> monolithic fortran codes.
>
> This solution has been "obvious" for some time, but needs a solution  
> even a Fortran programmer could love. Or at least a bad C programmer.
>
> One issue that you might also want to consider with multi-language  
> approaches is the configuration and build tools.   I agree that he  
> debugging support is a major issue, but what has taken far more of  
> my time is the lack of cross-platform configuration and build tools  
> for multi-language software.  We have looked at CMake and Scons but  
> have had to stick with a pretty ugly mixture of python scripts of  
> python and make. I feel like the other scientific computing packages  
> like Sage and EPD are successfully growing because they distribute  
> binaries that include all dependencies, but I don't see that being  
> feasible for HPC software anytime soon.  It seems like to really be  
> successful you may have to rely on a full featured and fully  
> documented tool like Scons is aiming at being, otherwise you'll end  
> up with an approach that works for you (because you wrote it) but  
> not very well for others trying to use the new PETSc.
>
> I think we do a better job than anyone with PETSc right now, but we  
> will have to work a little harder to get everything right.
>
> We have a small group at that US Army ERDC that has  been developing  
> a python FEM framework for multiscale/multiphysics modeling that  
> tries to break coupling between the physics and the FEM methods  
> (i.e. we try to support CG,DG, and non-conforming elements with  
> variable order and element topologies running from the same  
> description of the physics). The code uses our own wrappers to petsc  
> and mpi, but only because we started before mpi4py and petsc4py  
> existed.  We have a memory intensive  physics API based on computing  
> and storing quadrature point values of PDE coefficients and  
> numerical fluxes and a residual and jacobian assembly processes  
> along the lines of Matt's first approach of storing element  
> residuals and jacobians.  We also have handwritten "optimized" codes  
> that collapse all the loops and eliminate storage for direct  
> residual and jacobian assembly. We did that last step out of  
> necessity and as a precursor to working  on a code generation  
> approach (it's what we want the code generator to write).   It also  
> has the side effect of providing a low level API for people who want  
> to write traditional monolithic fortran codes for residual and  
> jacobian assembly.
>
> You should also look at Lisandro's fluids code.
>
> Our approach for the "user"  or more correctly the "physics  
> developer" is 1) prototype physics in python 2) "vectorize" the  
> physics by rewriting in C/C++/Fortran using swig, f2py, or hand  
> written wrappers and 3) take the inner loop of 2 and build  
> monolithic residual and jacobian routines.  This is obviously not  
> satisfactory and not maintainable in the long-run but we are solving  
> "real" nonlinear multi-physics problems in 3D with it (e.g.  
> incompressible two-phase flow, two-phase flow in porous media,  
> shallow water equations, and a few other models). Recent profiles  
> have put about 95% of the compute time inside KSPSolve, but we've  
> got a lot of work to do on better preconditioning and tolerance  
> selection before that number provides reliable guidance.  Anyway, If  
> you guys go forward with Barry's suggestion I'd be interested in  
> seeing if there's some way to work out some inter-agency  
> collaboration in addition to the good academic collaboration that  
> you already have.
>
> I would say an improvement to this would be CUDA kernels for your  
> inner loops.
>
>   Matt
>
> Chris
>
>
> On Dec 6, 2009, at 9:43 AM, Jed Brown wrote:
>
> On Sat, 5 Dec 2009 14:23:45 -0600, Matthew Knepley  
> <knepley at gmail.com> wrote:
> We do impose some of our process on users already, such as CHKERRQ and
> PetscMalloc.
>
> CHKERRQ is still optional, it's just recommended in order to get a
> decent trace.  It can't be used, for example, for top-level calls  
> within
> my VisIt plugin because those functions necessarily return something
> else (I have an alternative which checks the return code, generates
> debugging messages, possibly tries to deallocate local resources, and
> throws the appropriate exception).  In C99 projects, I use an  
> alternate
> macro that uses __func__ instead of __FUNCT__.
>
> There is PetscMallocSet if the user doesn't want to use PetscMalloc,  
> but
> users are rarely required to directly interact with
> PetscMalloc/PetscFree anyway since most arrays are copied or managed  
> by
> an object.
>
> Jed
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their  
> experiments is infinitely more interesting than any results to which  
> their experiments lead.
> -- Norbert Wiener