[mpich-discuss] runtime segfault: mpich2-1.3.2 with pgi v11.5 on rhel5.6 system
Dave Goodell
goodell at mcs.anl.gov
Thu May 26 13:15:03 CDT 2011
On May 26, 2011, at 12:46 PM CDT, Limin Gu wrote:
> When I tried to run "strace mpiexec", it shows mmap tried to allocate
> huge memory, and it failed.
>
> open("/sys/devices/system/node", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
> fcntl(3, F_SETFD, FD_CLOEXEC) = 0
> brk(0) = 0x19125000
> brk(0x1914e000) = 0x1914e000
> getdents(3, /* 4 entries */, 32768) = 112
> getdents(3, /* 0 entries */, 32768) = 0
> close(3) = 0
> mmap(NULL, 18446744073223036928, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> mmap(NULL, 18446744073223168000, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> mmap(NULL, 134217728, PROT_NONE,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x2b4e88607000
> munmap(0x2b4e88607000, 60788736) = 0
> munmap(0x2b4e90000000, 6320128) = 0
> mprotect(0x2b4e8c000000, 135168, PROT_READ|PROT_WRITE) = 0
> mmap(NULL, 18446744073223036928, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> +++ killed by SIGSEGV +++
Can you "gdb mpiexec" and find us a stack trace for the failing mmap? You may need to reconfigure and rebuild with "CFLAGS=-g" in order to get meaningful information from the debugger? That value (18446744073223036928) is suspicious, it's 0xFFFFFFFFE3006000 in hex or -486,514,688 decimal if interpreted as a signed value instead. It may be that the compiler or the code is doing some math incorrectly on size_t types.
AFAIK hydra does not mprotect at all, so if that mmap is coming from the same place then this error may be happening in a non-MPICH2 library.
We do mmap in hydra indirectly in the hwloc package, in a fashion consistent with your strace output, and we have definitely had problems with PGI+hwloc in the past. You might try running "HYDRA_BINDLIB=bogus mpiexec" to see if disabling hwloc will avoid the segfault. If it does, you should be able to reconfigure and rebuild MPICH2 using "--without-hydra-bindlib" to get a working MPICH2, but lacking built-in process binding functionality.
-Dave
More information about the mpich-discuss
mailing list