[mpich-discuss] Crash starting job with particular machine list

Dave Goodell goodell at mcs.anl.gov
Wed Aug 11 14:42:21 CDT 2010


On Aug 11, 2010, at 1:28 PM CDT, Steve Krueger wrote:

> I'm using the SOCK driver in a 1.2.1 build of mpich2 and the
> mpd daemon on a Linux X86-64 grid.
> 
> This problem seems to be fixed in 1.3, although I couldn't find
> a fix similar to the one I'm using.

Hmm... Between 1.2.1-->1.3 the default process manager changed from MPD to Hydra.  I suspect that MPD simply gave up on calculating a mapping for an irregular layout like (A,B,B), triggering the code path you describe below.  Hydra, on the other hand, probably actually calculates a valid mapping which avoids this problem entirely.

> The problem is in mpid_vc.c. This line:
> 
> mpi_errno = populate_ids_from_mapping(value, &num_nodes, pg, &did_map);
[...]
> Setting it back to 0 before the loop seems to solve the problem.
> 
> 
>    g_num_nodes = 0; /* NEW LINE */
> 
>    for (i = 0; i < pg->size; ++i)
>    {
>        if (i == our_pg_rank)
> ...
> 
> Please advise if this as an appropriate fix.

I'll take a closer look at the code in question (it's been a little while since I've looked at it) and let you know if there's a better fix.  Thanks for letting us know about the problem.

-Dave




More information about the mpich-discuss mailing list