[mpich-discuss] Hydra: Sorting node list by number of CPUs

Yauheni Zelenko zelenko at cadence.com
Thu Nov 4 20:17:29 CDT 2010


Hi!

I did simple node list sorting and it looks working as intended for LSF.

Implantation is based on 1.3. I didn't add command line option processing, because Hydra looks lacking command line option without parameter support. There are also some debug prints in HYD_sort_node_list().

Please review my code and include in trunk version with necessary modifications.

There are also minor memory leak:

==10665== 5 bytes in 1 blocks are definitely lost in loss record 2 of 3
==10665==    at 0x48C0C4A: malloc (vg_replace_malloc.c:236)
==10665==    by 0x3A48DF: strdup (in /lib/tls/libc-2.3.4.so)
==10665==    by 0x80679CF: HYDT_bind_init (bind.c:40)
==10665==    by 0x804FBE3: HYD_pmci_launch_procs (pmiserv_pmci.c:293)
==10665==    by 0x804AFDB: main (mpiexec.c:344)

Eugene.

static int node_list_compare(const void *_a, const void *_b)
{
    const struct HYD_node *a = *((struct HYD_node**) _a);
    const struct HYD_node *b = *((struct HYD_node**) _b);

    return ((a->core_count > b->core_count) ? 1 : ((a->core_count == b->core_count) ? 0 : -1));
}

static HYD_status HYD_sort_node_list(void)
{
    HYD_status status = HYD_SUCCESS;
    struct HYD_node *node;
    int node_count = 0;
    struct HYD_node **nodes;
    int i;

    printf("Before sorting\n");
    for (node = HYD_handle.node_list; node; node = node->next)
	printf("%s\t%d\n", node->hostname, node->core_count);

    for (node = HYD_handle.node_list; node; node = node->next)
        ++node_count;

    HYDU_MALLOC(nodes, struct HYD_node**, (node_count * sizeof(struct HYD_node*)), status);

    i = 0;
    for (node = HYD_handle.node_list; node; node = node->next)
	{
	    nodes[i] = node;
	    ++i;
	}

    qsort(nodes, node_count, sizeof(struct HYD_node*), node_list_compare);

    HYD_handle.node_list = NULL;
    for (i = 0; i < node_count; ++i)
	{
	    nodes[i]->next = HYD_handle.node_list;
	    HYD_handle.node_list = nodes[i];
	}

    printf("After sorting\n");
    for (node = HYD_handle.node_list; node; node = node->next)
	printf("%s\t%d\n", node->hostname, node->core_count);
    
    HYDU_FREE(nodes);

  fn_exit:
    return status;

  fn_fail:
    goto fn_exit;
}

int main(int argc, char **argv)
{

    ...

    if (HYD_handle.node_list == NULL) {
        /* Node list is not created yet. The user might not have
         * provided the host file. Query the RMK. */
        status = HYDT_rmki_query_node_list(&HYD_handle.node_list);
        HYDU_ERR_POP(status, "unable to query the RMK for a node list\n");

        if (HYD_handle.node_list == NULL) {
            /* didn't get anything from the RMK; try the bootstrap server */
            status = HYDT_bsci_query_node_list(&HYD_handle.node_list);
            HYDU_ERR_POP(status, "bootstrap returned error while querying node list\n");
        }

        if (HYD_handle.node_list == NULL) {
            /* The RMK and bootstrap didn't give us anything back; use localhost */
            status = HYDU_add_to_node_list("localhost", 1, &HYD_handle.node_list);
            HYDU_ERR_POP(status, "unable to add to node list\n");
        }
    }

    /* Reset the host list to use only the number of processes per
     * node as specified by the ppn option. */
    if (HYD_handle.ppn != -1)
        for (node = HYD_handle.node_list; node; node = node->next)
            node->core_count = HYD_handle.ppn;

    HYD_handle.global_core_count = 0;

    HYD_sort_node_list();

________________________________________
From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Yauheni Zelenko [zelenko at cadence.com]
Sent: Wednesday, October 20, 2010 6:12 PM
To: Pavan Balaji; mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Hydra: Sorting node list by number of CPUs

From: Pavan Balaji [balaji at mcs.anl.gov]
Sent: Wednesday, October 20, 2010 6:11 PM
To: mpich-discuss at mcs.anl.gov
Cc: Yauheni Zelenko
Subject: Re: [mpich-discuss] Hydra: Sorting node list by number of CPUs

>> Unfortunately< I don't have time to do this myself in nearest days,
>> so may be somebody else will be interested to implement, so this may
>> be included in 1.3 release?

> This is too intrusive to go into the 1.3 release (we don't want to break
> something by rushing a feature in). We are wrapping up stuff to push out
> 1.3 as soon as possible. However, I can give you a patch as soon as 1.3
> is released that'll provide this feature.

>  -- Pavan

It'll be great! Thank you!

Eugene.
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list