[mpich-discuss] Crash starting job with particular machine list
Dave Goodell
goodell at mcs.anl.gov
Wed Aug 11 14:42:21 CDT 2010
On Aug 11, 2010, at 1:28 PM CDT, Steve Krueger wrote:
> I'm using the SOCK driver in a 1.2.1 build of mpich2 and the
> mpd daemon on a Linux X86-64 grid.
>
> This problem seems to be fixed in 1.3, although I couldn't find
> a fix similar to the one I'm using.
Hmm... Between 1.2.1-->1.3 the default process manager changed from MPD to Hydra. I suspect that MPD simply gave up on calculating a mapping for an irregular layout like (A,B,B), triggering the code path you describe below. Hydra, on the other hand, probably actually calculates a valid mapping which avoids this problem entirely.
> The problem is in mpid_vc.c. This line:
>
> mpi_errno = populate_ids_from_mapping(value, &num_nodes, pg, &did_map);
[...]
> Setting it back to 0 before the loop seems to solve the problem.
>
>
> g_num_nodes = 0; /* NEW LINE */
>
> for (i = 0; i < pg->size; ++i)
> {
> if (i == our_pg_rank)
> ...
>
> Please advise if this as an appropriate fix.
I'll take a closer look at the code in question (it's been a little while since I've looked at it) and let you know if there's a better fix. Thanks for letting us know about the problem.
-Dave
More information about the mpich-discuss
mailing list