[MPICH] core dumps MPICH & Linux

Martin Siegert siegert at sfu.ca
Thu Oct 26 14:40:20 CDT 2006


Hi Wolfram,

if you are using mpd daemons, the mpi programs are actually children
of the mpd process. Hence, the limits set for mpd are inherited by
the mpi processes. These can be different from limits set for the
shell.

I got bitten by this when starting the mpd daemons by root at boot
time. As a consequence all mpi processes had the limits set for the
root account. Since SuSE sets a very small stacksize for root many
MPI programs were crashing because they ran into that stacksize
limit. The solution was to change the limits in the startup
script of the mpd daemons.

I do not know whether something like this is the reason for your
problems.

Cheers,
Martin

-- 
Martin Siegert
Head, HPC at SFU
WestGrid Site Lead
Academic Computing Services                phone: (604) 291-4691
Simon Fraser University                    fax:   (604) 291-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6

On Thu, Oct 26, 2006 at 06:11:07PM +0200, Wolfram Brenig wrote:
> Let me be more precise.
> 
> I have no problem in running code on the
> heterogeneous system. (I can also reduce
> the MPI-ring to just a homogeneous section
> of the cluster ... to be sure.)
> 
> What I want to do is, to get the core dump
> from the slaves of a master/slave type of
> MPI-code for debugging purpose in a particular
> case.
> 
> Now, when I run the slaves as standalone
> processes I can get core dumps from them.
> But when I run them as MPI processes they
> do not produce any core dump files.
> 
> I have set:
> 
> $> ulimit -c unlimited
> 
> in the .profile and .bashrc and when I do:
> 
> $> ssh node-whatever ulimit -c
> 
> I get:
> 
> $> unlimited
> 
> for any node-whatever of the cluster.
> I checked for the core files in the directory which I
> get when I do:
> 
> $> ssh node-whatever pwd
> 
> but I also searched over the whole home
> file system ... there is no core
> 
> Any suggestion what I might be missing.
> 
> 
> Wolfram
> 
> 
> Darius Buntinas wrote:
> > Note that MPICH2 does not (yet) run on heterogeneous clusters.  If you're
> > getting crashes, this may be why.
> > 
> > Try running
> >   ulimit -c
> > using mpiexec (as if it were an mpi program).  That will show you what the
> > limit is actually set at on each node.
> > 
> > -d
> > 
> > 
> > On Thu, 26 Oct 2006, Wolfram Brenig wrote:
> > 
> >> I'm trying to force core dumps on
> >> a heterogeneous linux cluster running
> >> mpich2version: 1.0.2 and SuSE linux
> >> versions 9.2 and 10.0.
> >>
> >> I have set "ulimit -c unlimited" on all
> >> nodes.
> >>
> >> When I run non-parallel code I can get
> >> core dumps. But no parallel program will
> >> core dump.
> >>
> >> Any help, or hint where to get info
> >> would be most appreciated.
> >>
> >> From searching the WWW I got the
> >> impression that linux may not be able
> >> to do core dumps with MPI. Is this so?
> >>
> >> Wolfram
> >>
> >>
> >>




More information about the mpich-discuss mailing list