[petsc-users] Advice on OpenMP/PETSc mix

Barry Smith bsmith at mcs.anl.gov
Fri Apr 20 16:25:16 CDT 2012


   Mohammad,

     Short term for what you can do NOW. 

     PETSc wants to have one MPI process for core; so start up the program that way, (say there are 16 cores per node.  In the block of code that does your "non-MPI stuff" do 

     if ((rank % 16) == 0) {  /* this code is only running on one MPI process per node */
          build your mesh grid data structure and process it, use MPI pramas whatever to parallelize that computation, have a big data structure  */
     }
     have the (rank % 16) == 0  MPI processes send the grid information to each (rank % 16) == j MPI process the part of the grid information they need. 
     have the (rank % 16) == 0  MPI processes delete the global big data structure it built.

     The rest of the program runs as a regular MPI PETSc program

     The advantage of this approach is 1) it will run today 2) it doesn't depend on any fragile os features or software. The disadvantage is that you need to figure out what part of the grid data each process needs and ship it from the (rank % 16) == 0  MPI processes.


    Barry


    


On Apr 20, 2012, at 1:31 PM, Mohammad Mirzadeh wrote:

> Hi guys,
> 
> I have seen multiple emails regarding this in the mailing list and I'm afraid you might have already answered this question but I'm not quite sure!
> 
> I have objects in my code that are hard(er) to parallelize using MPI and so far my strategy has been to just handle them in serial such that each process has a copy of the whole thing. This object is related to my grid generation/information  etc so it only needs to be done once at the beginning (no moving mesh for NOW). As a result I do not care much about the speed since its nothing compared to the overall solution time. However, I do care about the memory that this object consumes and can limit my problem size.
> 
> So I had the following idea the other day. Is it possible/good idea to paralleize the grid generation using OpenMP so that each node (as opposed to core) would share the data structure? This can save me a lot since memory on nodes are shared among cores (e.g. 32 GB/node vs 2GB/core on Ranger). What I'm not quite sure about is how the job is scheduled when running the code via mpirun -n Np. Should Np be the total number of cores or nodes? 
> 
> If I use, say Np = 16 processes on one node, MPI is running 16 versions of the code on a single node (which has 16 cores). How does OpenMP figure out how to fork? Does it fork a total of 16 threads/MPI process = 256 threads or is it smart to just fork a total of 16 threads/node = 1 thread/core = 16 threads? I'm a bit confused here how the job is scheduled when MPI and OpenMP are mixed? 
> 
> Do I make any sense at all?! 
> 
> Thanks



More information about the petsc-users mailing list