[mpich-discuss] Problems Running WRF on Ubuntu 11.10, MPICH2

Anthony Chan chan at mcs.anl.gov
Wed Feb 8 12:19:03 CST 2012


Hmm..  Not sure what is happening..  I don't see anything
obviously wrong in your mpiexec verbose output (though
I am not hydra expert).  Your code now is killed because of
segmentation fault. Naively, I would recompile WRF with -g
and use a debugger to see where segfault is.  If you don't want
to mess around WRF source code, you may want to contact WRF
developers to see if they have encountered similar problem 
before.

----- Original Message -----
> Dear Anthony,
> 
> Thanks for your response. Yes, I did try MP_STACK_SIZE and
> OMP_STACKSIZE. The error is still there. I have attached a log file (I
> ran mpiexec with -verbose option). May be this will help.
> 
> Best regards,
> Sukanta
> 
> On Tue, Feb 7, 2012 at 3:28 PM, Anthony Chan <chan at mcs.anl.gov> wrote:
> >
> > I am not familar with WRF, and not sure if WRF uses any thread
> > in dmpar mode. Did you try setting MP_STACK_SIZE or OMP_STACKSIZE ?
> >
> > see: http://forum.wrfforum.com/viewtopic.php?f=6&t=255
> >
> > A.Chan
> >
> > ----- Original Message -----
> >> Hi,
> >>
> >> I am using a small cluster of 4 nodes (each with 8 cores + 24 GB
> >> RAM).
> >> OS: Ubuntu 11.10. The cluster uses nfs file system and gigE
> >> connections.
> >>
> >> I installed mpich2 and ran cpi.c program successfully.
> >>
> >> I installed WRF (http://www.wrf-model.org/index.php) using the
> >> intel
> >> compilers (dmpar option)
> >> I set ulimit -l and -s to be unlimited in .bashrc (all nodes)
> >> I set memlock to be unlimited in limits.conf (all nodes)
> >> I have password-less ssh (public key sharing) on all the nodes
> >> I ran parallel jobs with 40x40x40, 40x40x50, and 40x40x60 grid
> >> points
> >> successfully. However, when I utilize 40x40x80 grid points, I get
> >> the
> >> following MPI error:
> >>
> >> **********************************************************
> >> Fatal error in PMPI_Wait: Other MPI error, error stack:
> >> PMPI_Wait(183)............: MPI_Wait(request=0x34e83a4,
> >> status=0x7fff7b24c400) failed
> >> MPIR_Wait_impl(77)........:
> >> dequeue_and_set_error(596): Communication error with rank 8
> >> **********************************************************
> >> Given that I can run the exact simulation with slightly lesser
> >> number
> >> of grid points without any problem, this error is related to stack
> >> size. What could be the problem?
> >>
> >> Thanks,
> >> Sukanta
> >>
> >> --
> >> Sukanta Basu
> >> Associate Professor
> >> North Carolina State University
> >> http://www4.ncsu.edu/~sbasu5/
> >> _______________________________________________
> >> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >> To manage subscription options or unsubscribe:
> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > _______________________________________________
> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> 
> --
> Sukanta Basu
> Associate Professor
> North Carolina State University
> http://www4.ncsu.edu/~sbasu5/


More information about the mpich-discuss mailing list