[petsc-users] Advice on OpenMP/PETSc mix

Mohammad Mirzadeh mirzadeh at gmail.com
Fri Apr 20 16:39:22 CDT 2012


Barry,

That's quite smart and I like it. Aside from the disadvantage that you
mentioned (which to be honest is not quite straight forward but doable) I
have the following question.

When I do computations on, say (rank%16 == 0) processes, do they have
access to the whole 32 GB memory or are they still bounded by the 2GB/core?
... Or in a more broad sense, am I mistaken to assume that each core only
has access to 2GB?

Thanks for your support.

On Fri, Apr 20, 2012 at 2:25 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>   Mohammad,
>
>     Short term for what you can do NOW.
>
>     PETSc wants to have one MPI process for core; so start up the program
> that way, (say there are 16 cores per node.  In the block of code that does
> your "non-MPI stuff" do
>
>     if ((rank % 16) == 0) {  /* this code is only running on one MPI
> process per node */
>          build your mesh grid data structure and process it, use MPI
> pramas whatever to parallelize that computation, have a big data structure
>  */
>     }
>     have the (rank % 16) == 0  MPI processes send the grid information to
> each (rank % 16) == j MPI process the part of the grid information they
> need.
>     have the (rank % 16) == 0  MPI processes delete the global big data
> structure it built.
>
>     The rest of the program runs as a regular MPI PETSc program
>
>     The advantage of this approach is 1) it will run today 2) it doesn't
> depend on any fragile os features or software. The disadvantage is that you
> need to figure out what part of the grid data each process needs and ship
> it from the (rank % 16) == 0  MPI processes.
>
>
>    Barry
>
>
>
>
>
> On Apr 20, 2012, at 1:31 PM, Mohammad Mirzadeh wrote:
>
> > Hi guys,
> >
> > I have seen multiple emails regarding this in the mailing list and I'm
> afraid you might have already answered this question but I'm not quite sure!
> >
> > I have objects in my code that are hard(er) to parallelize using MPI and
> so far my strategy has been to just handle them in serial such that each
> process has a copy of the whole thing. This object is related to my grid
> generation/information  etc so it only needs to be done once at the
> beginning (no moving mesh for NOW). As a result I do not care much about
> the speed since its nothing compared to the overall solution time. However,
> I do care about the memory that this object consumes and can limit my
> problem size.
> >
> > So I had the following idea the other day. Is it possible/good idea to
> paralleize the grid generation using OpenMP so that each node (as opposed
> to core) would share the data structure? This can save me a lot since
> memory on nodes are shared among cores (e.g. 32 GB/node vs 2GB/core on
> Ranger). What I'm not quite sure about is how the job is scheduled when
> running the code via mpirun -n Np. Should Np be the total number of cores
> or nodes?
> >
> > If I use, say Np = 16 processes on one node, MPI is running 16 versions
> of the code on a single node (which has 16 cores). How does OpenMP figure
> out how to fork? Does it fork a total of 16 threads/MPI process = 256
> threads or is it smart to just fork a total of 16 threads/node = 1
> thread/core = 16 threads? I'm a bit confused here how the job is scheduled
> when MPI and OpenMP are mixed?
> >
> > Do I make any sense at all?!
> >
> > Thanks
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120420/21d609b4/attachment.htm>


More information about the petsc-users mailing list