[mpich-discuss] Crash starting job with particular machine list

Wed Aug 11 13:28:44 CDT 2010

I'm using the SOCK driver in a 1.2.1 build of mpich2 and the
mpd daemon on a Linux X86-64 grid.

This problem seems to be fixed in 1.3, although I couldn't find
a fix similar to the one I'm using.

Given the simplest MPI program: 

#include "mpi.h"

main()
{
    MPI_Init(0,0);
    MPI_Finalize();
}

This will fail if run with a machine list where the last machine is repeated anywhere but first.

The smallest case is with 3 machines:

/usr/local/mpich2/bin/mpirun -machinefile mpi.hosts -np 3 ./a.out

where mpi.hosts contains

rdgrd0001
rdgrd0002
rdgrd0002

rank 2 in job 2499  rdgrd0001.unx.sas.com_48578   caused collective abort of all ranks
  exit status of rank 2: killed by signal 11 
rank 1 in job 2499  rdgrd0001.unx.sas.com_48578   caused collective abort of all ranks
  exit status of rank 1: killed by signal 11

The crash occurs under MPI_Init().

SIGSEGV - Segmentation violation.
  1* In routine libmpich:MPIU_Strncpy (+0x3B)
  2  Called from libmpich:PMIU_getval (+0x86)
  3  Called from libmpich:PMI_KVS_Get (+0xA8)
  4  Called from libmpich:MPIDI_Populate_vc_node_ids (+0x4DF)
  5  Called from libmpich:MPID_Init (+0x1EF)
  6  Called from libmpich:MPIR_Init_thread (+0x28F)
  7  Called from libmpich:PMPI_Init (+0x91)

Changing the machine list avoids the problem. This runs:

rdgrd0002
rdgrd0001
rdgrd0002

The problem is in mpid_vc.c. This line:

mpi_errno = populate_ids_from_mapping(value, &num_nodes, pg, &did_map);

fails to find a map, so sets did_map to 0. It does set num_nodes to 1
though (which may or may not be another issue). num_nodes is then assigned
to g_num_nodes, which is used in a loop below.

The loop below fills in node_names[g_num_nodes], but since it starts
at [1] instead of [0], it accesses a bad pointer on the last iteration.

Setting it back to 0 before the loop seems to solve the problem.

    g_num_nodes = 0; /* NEW LINE */

    for (i = 0; i < pg->size; ++i)
    {
        if (i == our_pg_rank)
...

Please advise if this as an appropriate fix.

Thanks,
sk