[mpich-discuss] Crash starting job with particular machine list
Dave Goodell
goodell at mcs.anl.gov
Thu Aug 12 09:11:49 CDT 2010
Fixed in the trunk: https://trac.mcs.anl.gov/projects/mpich2/changeset/7052
Thanks for the bug report.
-Dave
On Aug 11, 2010, at 2:42 PM CDT, Dave Goodell wrote:
> On Aug 11, 2010, at 1:28 PM CDT, Steve Krueger wrote:
>
>> I'm using the SOCK driver in a 1.2.1 build of mpich2 and the
>> mpd daemon on a Linux X86-64 grid.
>>
>> This problem seems to be fixed in 1.3, although I couldn't find
>> a fix similar to the one I'm using.
>
> Hmm... Between 1.2.1-->1.3 the default process manager changed from MPD to Hydra. I suspect that MPD simply gave up on calculating a mapping for an irregular layout like (A,B,B), triggering the code path you describe below. Hydra, on the other hand, probably actually calculates a valid mapping which avoids this problem entirely.
>
>> The problem is in mpid_vc.c. This line:
>>
>> mpi_errno = populate_ids_from_mapping(value, &num_nodes, pg, &did_map);
> [...]
>> Setting it back to 0 before the loop seems to solve the problem.
>>
>>
>> g_num_nodes = 0; /* NEW LINE */
>>
>> for (i = 0; i < pg->size; ++i)
>> {
>> if (i == our_pg_rank)
>> ...
>>
>> Please advise if this as an appropriate fix.
>
> I'll take a closer look at the code in question (it's been a little while since I've looked at it) and let you know if there's a better fix. Thanks for letting us know about the problem.
>
> -Dave
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list