[petsc-users] Advice on OpenMP/PETSc mix

Mohammad Mirzadeh mirzadeh at gmail.com
Fri Apr 20 17:18:23 CDT 2012


Thanks Jed; I did not know I can access the whole memory.

Mohammad

On Fri, Apr 20, 2012 at 3:05 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> You have access to all the memory, though it may help do distribute it
> carefully. For example, by interleaving across NUMA regions to avoid
> filling up a single memory bus with the shared data structure that would
> later displace the smaller independent data structures that will be
> accessed in parallel. You can use libnuma on Linux for this, other
> operating systems do not currently provide decent ways to do this.
>
> Write the code without worrying too much about this first.
> On Apr 20, 2012 4:39 PM, "Mohammad Mirzadeh" <mirzadeh at gmail.com> wrote:
>
>> Barry,
>>
>> That's quite smart and I like it. Aside from the disadvantage that you
>> mentioned (which to be honest is not quite straight forward but doable) I
>> have the following question.
>>
>> When I do computations on, say (rank%16 == 0) processes, do they have
>> access to the whole 32 GB memory or are they still bounded by the 2GB/core?
>> ... Or in a more broad sense, am I mistaken to assume that each core only
>> has access to 2GB?
>>
>> Thanks for your support.
>>
>> On Fri, Apr 20, 2012 at 2:25 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>>>
>>>   Mohammad,
>>>
>>>     Short term for what you can do NOW.
>>>
>>>     PETSc wants to have one MPI process for core; so start up the
>>> program that way, (say there are 16 cores per node.  In the block of code
>>> that does your "non-MPI stuff" do
>>>
>>>     if ((rank % 16) == 0) {  /* this code is only running on one MPI
>>> process per node */
>>>          build your mesh grid data structure and process it, use MPI
>>> pramas whatever to parallelize that computation, have a big data structure
>>>  */
>>>     }
>>>     have the (rank % 16) == 0  MPI processes send the grid information
>>> to each (rank % 16) == j MPI process the part of the grid information they
>>> need.
>>>     have the (rank % 16) == 0  MPI processes delete the global big data
>>> structure it built.
>>>
>>>     The rest of the program runs as a regular MPI PETSc program
>>>
>>>     The advantage of this approach is 1) it will run today 2) it doesn't
>>> depend on any fragile os features or software. The disadvantage is that you
>>> need to figure out what part of the grid data each process needs and ship
>>> it from the (rank % 16) == 0  MPI processes.
>>>
>>>
>>>    Barry
>>>
>>>
>>>
>>>
>>>
>>> On Apr 20, 2012, at 1:31 PM, Mohammad Mirzadeh wrote:
>>>
>>> > Hi guys,
>>> >
>>> > I have seen multiple emails regarding this in the mailing list and I'm
>>> afraid you might have already answered this question but I'm not quite sure!
>>> >
>>> > I have objects in my code that are hard(er) to parallelize using MPI
>>> and so far my strategy has been to just handle them in serial such that
>>> each process has a copy of the whole thing. This object is related to my
>>> grid generation/information  etc so it only needs to be done once at the
>>> beginning (no moving mesh for NOW). As a result I do not care much about
>>> the speed since its nothing compared to the overall solution time. However,
>>> I do care about the memory that this object consumes and can limit my
>>> problem size.
>>> >
>>> > So I had the following idea the other day. Is it possible/good idea to
>>> paralleize the grid generation using OpenMP so that each node (as opposed
>>> to core) would share the data structure? This can save me a lot since
>>> memory on nodes are shared among cores (e.g. 32 GB/node vs 2GB/core on
>>> Ranger). What I'm not quite sure about is how the job is scheduled when
>>> running the code via mpirun -n Np. Should Np be the total number of cores
>>> or nodes?
>>> >
>>> > If I use, say Np = 16 processes on one node, MPI is running 16
>>> versions of the code on a single node (which has 16 cores). How does OpenMP
>>> figure out how to fork? Does it fork a total of 16 threads/MPI process =
>>> 256 threads or is it smart to just fork a total of 16 threads/node = 1
>>> thread/core = 16 threads? I'm a bit confused here how the job is scheduled
>>> when MPI and OpenMP are mixed?
>>> >
>>> > Do I make any sense at all?!
>>> >
>>> > Thanks
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120420/a4947f9a/attachment-0001.htm>


More information about the petsc-users mailing list