[mpich-discuss] runtime segfault: mpich2-1.3.2 with pgi v11.5 on rhel5.6 system

Dave Goodell goodell at mcs.anl.gov
Thu May 26 13:15:03 CDT 2011


On May 26, 2011, at 12:46 PM CDT, Limin Gu wrote:

> When I tried to run "strace mpiexec", it shows mmap tried to allocate
> huge memory, and it failed.
> 
> open("/sys/devices/system/node", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> brk(0)                                  = 0x19125000
> brk(0x1914e000)                         = 0x1914e000
> getdents(3, /* 4 entries */, 32768)     = 112
> getdents(3, /* 0 entries */, 32768)     = 0
> close(3)                                = 0
> mmap(NULL, 18446744073223036928, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> mmap(NULL, 18446744073223168000, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> mmap(NULL, 134217728, PROT_NONE,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x2b4e88607000
> munmap(0x2b4e88607000, 60788736)        = 0
> munmap(0x2b4e90000000, 6320128)         = 0
> mprotect(0x2b4e8c000000, 135168, PROT_READ|PROT_WRITE) = 0
> mmap(NULL, 18446744073223036928, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> +++ killed by SIGSEGV +++

Can you "gdb mpiexec" and find us a stack trace for the failing mmap?  You may need to reconfigure and rebuild with "CFLAGS=-g" in order to get meaningful information from the debugger?  That value (18446744073223036928) is suspicious, it's 0xFFFFFFFFE3006000 in hex or -486,514,688 decimal if interpreted as a signed value instead.  It may be that the compiler or the code is doing some math incorrectly on size_t types.

AFAIK hydra does not mprotect at all, so if that mmap is coming from the same place then this error may be happening in a non-MPICH2 library.

We do mmap in hydra indirectly in the hwloc package, in a fashion consistent with your strace output, and we have definitely had problems with PGI+hwloc in the past.  You might try running "HYDRA_BINDLIB=bogus mpiexec" to see if disabling hwloc will avoid the segfault.  If it does, you should be able to reconfigure and rebuild MPICH2 using "--without-hydra-bindlib" to get a working MPICH2, but lacking built-in process binding functionality.

-Dave




More information about the mpich-discuss mailing list