[MPICH] How do I dump a core under MPICH-2?

Tue Sep 25 13:24:04 CDT 2007

Hello Jean-Marc Saffoy (and mpich-discuss list)

Thank you very much for your very clear explanation!

I tried the same experiments on my machine, and they do work as you said.
Sorry, I don't use bash, hence I had to adapt them to tcsh.

After I inserted the 'limit coredumpsize unlimited' on my .tcshrc,
mpiexec propagates this limit to the parallel job execution environment.
However, this only happens if I run anything as a tcsh command.
I.e. the output below was obtained after my .tcshrc was changed:

31-pokey% mpiexec -n 1 tcsh -c 'limit'
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    unlimited
coredumpsize unlimited
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1024
memorylocked 32 kbytes
maxproc      40960

By contrast, if I don't explicitly invoke tcsh, I get:

32-pokey% mpiexec -n 1 limit
problem with execution of limit  on  pokey:  [Errno 2] No such file or directory

The other example you sent (mpiexec -n 1 sh -c 'ulimit -c unlimited; 
sleep 1 & kill -11 %%' )
works exactly as you said, and does produce a core dump indeed.

Since the program I am testing is too big, I made a short wrong MPI 
program that
forces a segmentation fault (wrong_hellow.c):

#include <stdio.h>
#include "mpi.h"

int main( int argc, char *argv[] )
{
    int rank;
    int size;
    float *wrong;

    MPI_Init( 0, 0 );
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf( "Hello world from process %d of %d\n", rank, size );
    wrong[1000]=0.0;
    MPI_Finalize();
    return 0;
}

When I launch it directly from mpiexec no core dump is produced:

66-pokey% mpiexec -n 1 wrong_hellow
Hello world from process 0 of 1
rank 0 in job 97  pokey_55576   caused collective abort of all ranks
  exit status of rank 0: killed by signal 11

67-pokey% file core*
file: No match.

However, as you clarified, if I launch it as a (tc)shell command, with 
"limit coredumpsize unlimit" set,
I do get a core dump:

68-pokey% grep coredumpsize ~/.tcshrc
limit coredumpsize unlimited

69-pokey% mpiexec -n 1 tcsh -c 'wrong_hellow'
Hello world from process 0 of 1
Segmentation fault (core dumped)
rank 0 in job 98  pokey_55576   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

70-pokey% file core*
core.19840: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style

Many thanks again, Jean-Marc, your examples were very enlightening,
and showed me how to solve the problem!

Gus Correa

Jean-Marc Saffroy wrote:

> On Tue, 25 Sep 2007, Robert Latham wrote:
>
>> You may be enabling core dumps on one process but not all of them?
>> Also there might be a difference between the environment given to an
>> interactive shell, a login shell,  and that of a non-interactive shell.
>>
>> What does 'ulimit -a' show you?  You might have to stick a 'ulimit -c
>> unlimited' in your .zshenv or .bashrc
>
>
> The MPI launcher may or may not propagate user limits from the current 
> shell to the parallel job, and for example mpd/mpiexec does not:
>
> $ ulimit -c
> 0
>
> $ mpd&
>
> $ ulimit -c unlimited
>
> $ mpiexec -n 1 sh -c 'ulimit -c'
> 0
>
> But in this case, it's easy to circumvent:
>
> $ mpiexec -n 1 sh -c 'ulimit -c unlimited; sleep 1 & kill -11 %%'
>
> $ file core.5412
> core.5412: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), 
> SVR4-style, from 'sleep'
>
>