[mpich-discuss] Crash starting job with particular machine list
Steve.Krueger at sas.com
Wed Aug 11 13:28:44 CDT 2010
I'm using the SOCK driver in a 1.2.1 build of mpich2 and the
mpd daemon on a Linux X86-64 grid.
This problem seems to be fixed in 1.3, although I couldn't find
a fix similar to the one I'm using.
Given the simplest MPI program:
This will fail if run with a machine list where the last machine is repeated anywhere but first.
The smallest case is with 3 machines:
/usr/local/mpich2/bin/mpirun -machinefile mpi.hosts -np 3 ./a.out
where mpi.hosts contains
rank 2 in job 2499 rdgrd0001.unx.sas.com_48578 caused collective abort of all ranks
exit status of rank 2: killed by signal 11
rank 1 in job 2499 rdgrd0001.unx.sas.com_48578 caused collective abort of all ranks
exit status of rank 1: killed by signal 11
The crash occurs under MPI_Init().
SIGSEGV - Segmentation violation.
1* In routine libmpich:MPIU_Strncpy (+0x3B)
2 Called from libmpich:PMIU_getval (+0x86)
3 Called from libmpich:PMI_KVS_Get (+0xA8)
4 Called from libmpich:MPIDI_Populate_vc_node_ids (+0x4DF)
5 Called from libmpich:MPID_Init (+0x1EF)
6 Called from libmpich:MPIR_Init_thread (+0x28F)
7 Called from libmpich:PMPI_Init (+0x91)
Changing the machine list avoids the problem. This runs:
The problem is in mpid_vc.c. This line:
mpi_errno = populate_ids_from_mapping(value, &num_nodes, pg, &did_map);
fails to find a map, so sets did_map to 0. It does set num_nodes to 1
though (which may or may not be another issue). num_nodes is then assigned
to g_num_nodes, which is used in a loop below.
The loop below fills in node_names[g_num_nodes], but since it starts
at  instead of , it accesses a bad pointer on the last iteration.
Setting it back to 0 before the loop seems to solve the problem.
g_num_nodes = 0; /* NEW LINE */
for (i = 0; i < pg->size; ++i)
if (i == our_pg_rank)
Please advise if this as an appropriate fix.
More information about the mpich-discuss