[mpich-discuss] Hydra: Sorting node list by number of CPUs

Yauheni Zelenko zelenko at cadence.com
Wed Nov 10 12:05:31 CST 2010


Hi, Pavan!

Somehow I got "Submission rejected as potential spam (You must enter a valid email address or username. Please hit back on your browser and try again.)" when trying to add my e-mail to CC list.

Eugene.
________________________________________
From: Pavan Balaji [balaji at mcs.anl.gov]
Sent: Tuesday, November 09, 2010 8:23 PM
To: Yauheni Zelenko
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Hydra: Sorting node list by number of CPUs

Hmm.. Unfortunately, this is not as straightforward as I initially
thought. The current bootstrap interface semantics dictate that if we
query it for the node list from it, the bootstrap can assume that the
same list is being passed back to it.

I can restructure the code to not rely on such semantics, but it's no
longer a simple sorting of the node list anymore.

I've created a ticket for it:
https://trac.mcs.anl.gov/projects/mpich2/ticket/1136

Please add yourself to the cc list if you wish to receive notifications
on the progress for the ticket.

Thanks,

  -- Pavan

On 11/04/2010 08:17 PM, Yauheni Zelenko wrote:
> Hi!
>
> I did simple node list sorting and it looks working as intended for LSF.
>
> Implantation is based on 1.3. I didn't add command line option processing, because Hydra looks lacking command line option without parameter support. There are also some debug prints in HYD_sort_node_list().
>
> Please review my code and include in trunk version with necessary modifications.
>
> There are also minor memory leak:
>
> ==10665== 5 bytes in 1 blocks are definitely lost in loss record 2 of 3
> ==10665==    at 0x48C0C4A: malloc (vg_replace_malloc.c:236)
> ==10665==    by 0x3A48DF: strdup (in /lib/tls/libc-2.3.4.so)
> ==10665==    by 0x80679CF: HYDT_bind_init (bind.c:40)
> ==10665==    by 0x804FBE3: HYD_pmci_launch_procs (pmiserv_pmci.c:293)
> ==10665==    by 0x804AFDB: main (mpiexec.c:344)
>
> Eugene.
>
> static int node_list_compare(const void *_a, const void *_b)
> {
>      const struct HYD_node *a = *((struct HYD_node**) _a);
>      const struct HYD_node *b = *((struct HYD_node**) _b);
>
>      return ((a->core_count>  b->core_count) ? 1 : ((a->core_count == b->core_count) ? 0 : -1));
> }
>
> static HYD_status HYD_sort_node_list(void)
> {
>      HYD_status status = HYD_SUCCESS;
>      struct HYD_node *node;
>      int node_count = 0;
>      struct HYD_node **nodes;
>      int i;
>
>      printf("Before sorting\n");
>      for (node = HYD_handle.node_list; node; node = node->next)
>       printf("%s\t%d\n", node->hostname, node->core_count);
>
>      for (node = HYD_handle.node_list; node; node = node->next)
>          ++node_count;
>
>      HYDU_MALLOC(nodes, struct HYD_node**, (node_count * sizeof(struct HYD_node*)), status);
>
>      i = 0;
>      for (node = HYD_handle.node_list; node; node = node->next)
>       {
>           nodes[i] = node;
>           ++i;
>       }
>
>      qsort(nodes, node_count, sizeof(struct HYD_node*), node_list_compare);
>
>      HYD_handle.node_list = NULL;
>      for (i = 0; i<  node_count; ++i)
>       {
>           nodes[i]->next = HYD_handle.node_list;
>           HYD_handle.node_list = nodes[i];
>       }
>
>      printf("After sorting\n");
>      for (node = HYD_handle.node_list; node; node = node->next)
>       printf("%s\t%d\n", node->hostname, node->core_count);
>
>      HYDU_FREE(nodes);
>
>    fn_exit:
>      return status;
>
>    fn_fail:
>      goto fn_exit;
> }
>
> int main(int argc, char **argv)
> {
>
>      ...
>
>      if (HYD_handle.node_list == NULL) {
>          /* Node list is not created yet. The user might not have
>           * provided the host file. Query the RMK. */
>          status = HYDT_rmki_query_node_list(&HYD_handle.node_list);
>          HYDU_ERR_POP(status, "unable to query the RMK for a node list\n");
>
>          if (HYD_handle.node_list == NULL) {
>              /* didn't get anything from the RMK; try the bootstrap server */
>              status = HYDT_bsci_query_node_list(&HYD_handle.node_list);
>              HYDU_ERR_POP(status, "bootstrap returned error while querying node list\n");
>          }
>
>          if (HYD_handle.node_list == NULL) {
>              /* The RMK and bootstrap didn't give us anything back; use localhost */
>              status = HYDU_add_to_node_list("localhost", 1,&HYD_handle.node_list);
>              HYDU_ERR_POP(status, "unable to add to node list\n");
>          }
>      }
>
>      /* Reset the host list to use only the number of processes per
>       * node as specified by the ppn option. */
>      if (HYD_handle.ppn != -1)
>          for (node = HYD_handle.node_list; node; node = node->next)
>              node->core_count = HYD_handle.ppn;
>
>      HYD_handle.global_core_count = 0;
>
>      HYD_sort_node_list();
>
> ________________________________________
> From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Yauheni Zelenko [zelenko at cadence.com]
> Sent: Wednesday, October 20, 2010 6:12 PM
> To: Pavan Balaji; mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Hydra: Sorting node list by number of CPUs
>
> From: Pavan Balaji [balaji at mcs.anl.gov]
> Sent: Wednesday, October 20, 2010 6:11 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: Yauheni Zelenko
> Subject: Re: [mpich-discuss] Hydra: Sorting node list by number of CPUs
>
>>> Unfortunately<  I don't have time to do this myself in nearest days,
>>> so may be somebody else will be interested to implement, so this may
>>> be included in 1.3 release?
>
>> This is too intrusive to go into the 1.3 release (we don't want to break
>> something by rushing a feature in). We are wrapping up stuff to push out
>> 1.3 as soon as possible. However, I can give you a patch as soon as 1.3
>> is released that'll provide this feature.
>
>>   -- Pavan
>
> It'll be great! Thank you!
>
> Eugene.
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list