[mpich-discuss] Crash starting job with particular machine list

Dave Goodell goodell at mcs.anl.gov
Thu Aug 12 09:11:49 CDT 2010


Fixed in the trunk: https://trac.mcs.anl.gov/projects/mpich2/changeset/7052

Thanks for the bug report.

-Dave

On Aug 11, 2010, at 2:42 PM CDT, Dave Goodell wrote:

> On Aug 11, 2010, at 1:28 PM CDT, Steve Krueger wrote:
> 
>> I'm using the SOCK driver in a 1.2.1 build of mpich2 and the
>> mpd daemon on a Linux X86-64 grid.
>> 
>> This problem seems to be fixed in 1.3, although I couldn't find
>> a fix similar to the one I'm using.
> 
> Hmm... Between 1.2.1-->1.3 the default process manager changed from MPD to Hydra.  I suspect that MPD simply gave up on calculating a mapping for an irregular layout like (A,B,B), triggering the code path you describe below.  Hydra, on the other hand, probably actually calculates a valid mapping which avoids this problem entirely.
> 
>> The problem is in mpid_vc.c. This line:
>> 
>> mpi_errno = populate_ids_from_mapping(value, &num_nodes, pg, &did_map);
> [...]
>> Setting it back to 0 before the loop seems to solve the problem.
>> 
>> 
>>   g_num_nodes = 0; /* NEW LINE */
>> 
>>   for (i = 0; i < pg->size; ++i)
>>   {
>>       if (i == our_pg_rank)
>> ...
>> 
>> Please advise if this as an appropriate fix.
> 
> I'll take a closer look at the code in question (it's been a little while since I've looked at it) and let you know if there's a better fix.  Thanks for letting us know about the problem.
> 
> -Dave
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list