[MPICH] How do I dump a core under MPICH-2?
Gus Correa
gus at ldeo.columbia.edu
Tue Sep 25 13:24:04 CDT 2007
Hello Jean-Marc Saffoy (and mpich-discuss list)
Thank you very much for your very clear explanation!
I tried the same experiments on my machine, and they do work as you said.
Sorry, I don't use bash, hence I had to adapt them to tcsh.
After I inserted the 'limit coredumpsize unlimited' on my .tcshrc,
mpiexec propagates this limit to the parallel job execution environment.
However, this only happens if I run anything as a tcsh command.
I.e. the output below was obtained after my .tcshrc was changed:
31-pokey% mpiexec -n 1 tcsh -c 'limit'
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize unlimited
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 32 kbytes
maxproc 40960
By contrast, if I don't explicitly invoke tcsh, I get:
32-pokey% mpiexec -n 1 limit
problem with execution of limit on pokey: [Errno 2] No such file or directory
The other example you sent (mpiexec -n 1 sh -c 'ulimit -c unlimited;
sleep 1 & kill -11 %%' )
works exactly as you said, and does produce a core dump indeed.
Since the program I am testing is too big, I made a short wrong MPI
program that
forces a segmentation fault (wrong_hellow.c):
#include <stdio.h>
#include "mpi.h"
int main( int argc, char *argv[] )
{
int rank;
int size;
float *wrong;
MPI_Init( 0, 0 );
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf( "Hello world from process %d of %d\n", rank, size );
wrong[1000]=0.0;
MPI_Finalize();
return 0;
}
When I launch it directly from mpiexec no core dump is produced:
66-pokey% mpiexec -n 1 wrong_hellow
Hello world from process 0 of 1
rank 0 in job 97 pokey_55576 caused collective abort of all ranks
exit status of rank 0: killed by signal 11
67-pokey% file core*
file: No match.
However, as you clarified, if I launch it as a (tc)shell command, with
"limit coredumpsize unlimit" set,
I do get a core dump:
68-pokey% grep coredumpsize ~/.tcshrc
limit coredumpsize unlimited
69-pokey% mpiexec -n 1 tcsh -c 'wrong_hellow'
Hello world from process 0 of 1
Segmentation fault (core dumped)
rank 0 in job 98 pokey_55576 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
70-pokey% file core*
core.19840: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style
Many thanks again, Jean-Marc, your examples were very enlightening,
and showed me how to solve the problem!
Gus Correa
Jean-Marc Saffroy wrote:
> On Tue, 25 Sep 2007, Robert Latham wrote:
>
>> You may be enabling core dumps on one process but not all of them?
>> Also there might be a difference between the environment given to an
>> interactive shell, a login shell, and that of a non-interactive shell.
>>
>> What does 'ulimit -a' show you? You might have to stick a 'ulimit -c
>> unlimited' in your .zshenv or .bashrc
>
>
> The MPI launcher may or may not propagate user limits from the current
> shell to the parallel job, and for example mpd/mpiexec does not:
>
> $ ulimit -c
> 0
>
> $ mpd&
>
> $ ulimit -c unlimited
>
> $ mpiexec -n 1 sh -c 'ulimit -c'
> 0
>
> But in this case, it's easy to circumvent:
>
> $ mpiexec -n 1 sh -c 'ulimit -c unlimited; sleep 1 & kill -11 %%'
>
> $ file core.5412
> core.5412: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
> SVR4-style, from 'sleep'
>
>
More information about the mpich-discuss
mailing list