[mpich-discuss] runtime segfault: mpich2-1.3.2 with pgi v11.5 on rhel5.6 system

Limin Gu lgu at penguincomputing.com
Thu May 26 15:51:16 CDT 2011


mvapich2 with pgi v11.5 works fine on rhel5.6.
I have similar problem with openmpi-1.5.3 with pgi v11.5 on rhel5.6.
I am not doing any fancy configure, so this is really weird.

But all of them (mpich2, mvapich2, openmpi) work fine with pgi v11.5 on rhel4.9.

Thanks again!

Limin

On Thu, May 26, 2011 at 4:38 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> Hmm.. I have no idea what's going on then.  Do other programs compiled with the newer compiler work for you?
>
> -Dave
>
> On May 26, 2011, at 3:21 PM CDT, Limin Gu wrote:
>
>> Thanks Dave!
>>
>> I tried "HYDRA_BINDLIB=bogus mpiexec", it still segfaults :(
>>
>> I reconfigure and rebuild with "CFLAGS=-g", here is "gdb mpiexec" bt output:
>>
>> (gdb) run
>> Starting program: /home/lgu/mpich2_install/bin/mpiexec
>> warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
>> [Thread debugging using libthread_db enabled]
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x0000003ef10e72bf in __vsnprintf_chk () from /lib64/libc.so.6
>> (gdb) bt
>> #0  0x0000003ef10e72bf in __vsnprintf_chk () from /lib64/libc.so.6
>> #1  0x0000003ef10e722b in __snprintf_chk () from /lib64/libc.so.6
>> #2  0x0000003ef0c0d1bb in call_init () from /lib64/ld-linux-x86-64.so.2
>> #3  0x0000003ef0c0d2c5 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
>> #4  0x0000003ef0c00aaa in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
>> #5  0x0000000000000001 in ?? ()
>> #6  0x00007fffffffeae0 in ?? ()
>> #7  0x0000000000000000 in ?? ()
>> (gdb)
>>
>>
>> Thank you!
>>
>> Limin
>>
>> > Can you "gdb mpiexec" and find us a stack trace for the failing mmap?  You may need to reconfigure and rebuild with "CFLAGS=-g" in order to get meaningful information from the debugger?  That value (18446744073223036928) is suspicious, it's 0xFFFFFFFFE3006000 in hex or -486,514,688 decimal if interpreted as a signed value instead.  It may be that the compiler or the code is doing some math incorrectly on size_t types.
>> >
>> > AFAIK hydra does not mprotect at all, so if that mmap is coming from the same place then this error may be happening in a non-MPICH2 library.
>> >
>> > We do mmap in hydra indirectly in the hwloc package, in a fashion consistent with your strace output, and we have definitely had problems with PGI+hwloc in the past.  You might try running "HYDRA_BINDLIB=bogus mpiexec" to see if disabling hwloc will avoid the segfault.  If it does, you should be able to reconfigure and rebuild MPICH2 using "--without-hydra-bindlib" to get a working MPICH2, but lacking built-in process binding functionality.
>> >
>> > -Dave
>> >
>> >
>> > _______________________________________________
>> > mpich-discuss mailing list
>> > mpich-discuss at mcs.anl.gov
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> >
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>


More information about the mpich-discuss mailing list