[mpich-discuss] runtime segfault: mpich2-1.3.2 with pgi v11.5 on rhel5.6 system

Dave Goodell goodell at mcs.anl.gov
Thu May 26 16:05:31 CDT 2011


If you are having a similar problem with Open MPI with the same OS+compiler combination then that strongly suggests a problem with your compiler.  I don't know if it's a bug in the PGI compiler itself or just a bad installation.

If you do eventually figure out what's wrong, please let us know.

Thanks,
-Dave

On May 26, 2011, at 3:51 PM CDT, Limin Gu wrote:

> mvapich2 with pgi v11.5 works fine on rhel5.6.
> I have similar problem with openmpi-1.5.3 with pgi v11.5 on rhel5.6.
> I am not doing any fancy configure, so this is really weird.
> 
> But all of them (mpich2, mvapich2, openmpi) work fine with pgi v11.5 on rhel4.9.
> 
> Thanks again!
> 
> Limin
> 
> On Thu, May 26, 2011 at 4:38 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>> Hmm.. I have no idea what's going on then.  Do other programs compiled with the newer compiler work for you?
>> 
>> -Dave
>> 
>> On May 26, 2011, at 3:21 PM CDT, Limin Gu wrote:
>> 
>>> Thanks Dave!
>>> 
>>> I tried "HYDRA_BINDLIB=bogus mpiexec", it still segfaults :(
>>> 
>>> I reconfigure and rebuild with "CFLAGS=-g", here is "gdb mpiexec" bt output:
>>> 
>>> (gdb) run
>>> Starting program: /home/lgu/mpich2_install/bin/mpiexec
>>> warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
>>> [Thread debugging using libthread_db enabled]
>>> 
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x0000003ef10e72bf in __vsnprintf_chk () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0  0x0000003ef10e72bf in __vsnprintf_chk () from /lib64/libc.so.6
>>> #1  0x0000003ef10e722b in __snprintf_chk () from /lib64/libc.so.6
>>> #2  0x0000003ef0c0d1bb in call_init () from /lib64/ld-linux-x86-64.so.2
>>> #3  0x0000003ef0c0d2c5 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
>>> #4  0x0000003ef0c00aaa in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
>>> #5  0x0000000000000001 in ?? ()
>>> #6  0x00007fffffffeae0 in ?? ()
>>> #7  0x0000000000000000 in ?? ()
>>> (gdb)
>>> 
>>> 
>>> Thank you!
>>> 
>>> Limin
>>> 
>>>> Can you "gdb mpiexec" and find us a stack trace for the failing mmap?  You may need to reconfigure and rebuild with "CFLAGS=-g" in order to get meaningful information from the debugger?  That value (18446744073223036928) is suspicious, it's 0xFFFFFFFFE3006000 in hex or -486,514,688 decimal if interpreted as a signed value instead.  It may be that the compiler or the code is doing some math incorrectly on size_t types.
>>>> 
>>>> AFAIK hydra does not mprotect at all, so if that mmap is coming from the same place then this error may be happening in a non-MPICH2 library.
>>>> 
>>>> We do mmap in hydra indirectly in the hwloc package, in a fashion consistent with your strace output, and we have definitely had problems with PGI+hwloc in the past.  You might try running "HYDRA_BINDLIB=bogus mpiexec" to see if disabling hwloc will avoid the segfault.  If it does, you should be able to reconfigure and rebuild MPICH2 using "--without-hydra-bindlib" to get a working MPICH2, but lacking built-in process binding functionality.
>>>> 
>>>> -Dave
>>>> 
>>>> 
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> 
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list