[mpich-discuss] MPI_Comm_spawn(), dynamic distribution and hydra

Roberto Fichera kernel at tekno-soft.it
Thu Mar 3 08:11:01 CST 2011


On 03/03/2011 09:47 AM, Roberto Fichera wrote:
> On 03/02/2011 09:28 PM, Pavan Balaji wrote:
>
> Ciao Pavan,
>
>> Hi Roberto,
>>
>> Hydra doesn't currently have the capability to request for more resources from a resource manager. It can only query
>> for the allocated resources.
> So it means that at this time I have to use directly the resource manager API, right?
>
> What about to perform the MPI boot process on freshly allocated nodes? Do you have
> any suggestion how to make it?

Forgot to say that I would actually handle this problem by asking to the resource manager, via its API,
the nodes. Than write the node's list into a file, so like normally I would expect from qsub command,
thus set the PBS_NODEFILE accordingly. Finally start the slave application via a .sh file that inside uses
usual mpdboot and mpdallexit for booting and terminating mpi processes.

>> But if you ask Hydra to spawn a process on a particular node (using the info argument), it will do that.
> This is the same approach I used for spawning using the info argument of the MPI_Comm_spawn()
>
>>  -- Pavan
>>
>> On 03/02/2011 05:55 AM, Roberto Fichera wrote:
>>> Hi All,
>>>
>>> I made a parallelization library on top of the MPICH2 library that basically
>>> performs a dynamic job's distribution among all the assigned nodes. So
>>> the library makes an heavy use of MPI_Comm_spawn() in a MPI_THREAD_MULTIPLE
>>> configure MPICH2 library. This works pretty fine and we are able to easily differentiate
>>> the spawned jobs algorithms once decomposing the data. Furthermore the library
>>> permit the user to specify more than one node to assign to any given job. This end up
>>> on "electing" the given slave node to become a master of a node subset (submastering)
>>> to perform more fine grained parallel distribution as per user's algorithm. So
>>> typical distribution scenery is like a tree where the nodes are submasters. Since we
>>> discover that for certain algorithm we might need to dynamically decide the number
>>> of nodes to dedicate for any given submastering parallel computation. So I started to
>>> investigate if would be possible to make an interface with the cluster resource manager
>>> and the library we use.
>>>
>>> So my idea is that in certain condition before to call the MPI_Comm_spawn() for spawning
>>> a job into a well defined hostname I would like to reserve, via the resource manager, the
>>> number of node requested by the given submastering job. So my first thought was to make
>>> an interface against PBS (it's our cluster manager/scheduler) using its libpbs and basically
>>> implementing something like a qsub/mpiexec function call, but I come at the conclusion
>>> that for implementing the dynamic bootstrap process, for the newly allocated
>>> node, Hydra would be much more flexible in such sense, since it seems providing a more
>>> abstracted interface for doing such integration.
>>>
>>> So my question is: does anyone know a better way to do so? Might it create
>>> problems or not I don't see in some way?
>>>
>>> Any suggestion will be really appreciated.
>>>
>>> Thanks in advance.
>>> Roberto Fichera.
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



More information about the mpich-discuss mailing list