[mpich-discuss] runtime segfault: mpich2-1.3.2 with pgi v11.5 on rhel5.6 system
Dave Goodell
goodell at mcs.anl.gov
Thu May 26 16:05:31 CDT 2011
If you are having a similar problem with Open MPI with the same OS+compiler combination then that strongly suggests a problem with your compiler. I don't know if it's a bug in the PGI compiler itself or just a bad installation.
If you do eventually figure out what's wrong, please let us know.
Thanks,
-Dave
On May 26, 2011, at 3:51 PM CDT, Limin Gu wrote:
> mvapich2 with pgi v11.5 works fine on rhel5.6.
> I have similar problem with openmpi-1.5.3 with pgi v11.5 on rhel5.6.
> I am not doing any fancy configure, so this is really weird.
>
> But all of them (mpich2, mvapich2, openmpi) work fine with pgi v11.5 on rhel4.9.
>
> Thanks again!
>
> Limin
>
> On Thu, May 26, 2011 at 4:38 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>> Hmm.. I have no idea what's going on then. Do other programs compiled with the newer compiler work for you?
>>
>> -Dave
>>
>> On May 26, 2011, at 3:21 PM CDT, Limin Gu wrote:
>>
>>> Thanks Dave!
>>>
>>> I tried "HYDRA_BINDLIB=bogus mpiexec", it still segfaults :(
>>>
>>> I reconfigure and rebuild with "CFLAGS=-g", here is "gdb mpiexec" bt output:
>>>
>>> (gdb) run
>>> Starting program: /home/lgu/mpich2_install/bin/mpiexec
>>> warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
>>> [Thread debugging using libthread_db enabled]
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x0000003ef10e72bf in __vsnprintf_chk () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0 0x0000003ef10e72bf in __vsnprintf_chk () from /lib64/libc.so.6
>>> #1 0x0000003ef10e722b in __snprintf_chk () from /lib64/libc.so.6
>>> #2 0x0000003ef0c0d1bb in call_init () from /lib64/ld-linux-x86-64.so.2
>>> #3 0x0000003ef0c0d2c5 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
>>> #4 0x0000003ef0c00aaa in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
>>> #5 0x0000000000000001 in ?? ()
>>> #6 0x00007fffffffeae0 in ?? ()
>>> #7 0x0000000000000000 in ?? ()
>>> (gdb)
>>>
>>>
>>> Thank you!
>>>
>>> Limin
>>>
>>>> Can you "gdb mpiexec" and find us a stack trace for the failing mmap? You may need to reconfigure and rebuild with "CFLAGS=-g" in order to get meaningful information from the debugger? That value (18446744073223036928) is suspicious, it's 0xFFFFFFFFE3006000 in hex or -486,514,688 decimal if interpreted as a signed value instead. It may be that the compiler or the code is doing some math incorrectly on size_t types.
>>>>
>>>> AFAIK hydra does not mprotect at all, so if that mmap is coming from the same place then this error may be happening in a non-MPICH2 library.
>>>>
>>>> We do mmap in hydra indirectly in the hwloc package, in a fashion consistent with your strace output, and we have definitely had problems with PGI+hwloc in the past. You might try running "HYDRA_BINDLIB=bogus mpiexec" to see if disabling hwloc will avoid the segfault. If it does, you should be able to reconfigure and rebuild MPICH2 using "--without-hydra-bindlib" to get a working MPICH2, but lacking built-in process binding functionality.
>>>>
>>>> -Dave
>>>>
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list