[mpich-discuss] Problems Running WRF on Ubuntu 11.10, MPICH2

Anthony Chan chan at mcs.anl.gov
Wed Feb 8 15:23:09 CST 2012


There is fpi, Fortran counterpart of cpi, you can try that.
Also, there is MPICH2 testsuite which is located in
mpich2-xxx/test/mpi can be invoked by "make testing".
It is unlikely those tests will reveal anything.
The testsuite is meant to test the MPI implementation
not your app.

As what you said earlier, your difficulty in running WRF
with larger dataset is memory related.  You should contact WRF
emailing list for more pointers.

----- Original Message -----
> Hi Anthony,
> 
> Is there any other mpi example code (other than cpi.c) that I could
> test which will give me more information about my mpich setup?
> 
> Here is the output from cpi (using 32 cores on 4 nodes):
> 
> mpiuser at crayN1-5150jo:~/Misc$ mpiexec -f mpd.hosts -n 32 ./cpi
> Process 1 on crayN1-5150jo
> Process 18 on crayN2-5150jo
> Process 2 on crayN2-5150jo
> Process 26 on crayN2-5150jo
> Process 5 on crayN1-5150jo
> Process 14 on crayN2-5150jo
> Process 21 on crayN1-5150jo
> Process 22 on crayN2-5150jo
> Process 25 on crayN1-5150jo
> Process 6 on crayN2-5150jo
> Process 9 on crayN1-5150jo
> Process 17 on crayN1-5150jo
> Process 30 on crayN2-5150jo
> Process 10 on crayN2-5150jo
> Process 29 on crayN1-5150jo
> Process 13 on crayN1-5150jo
> Process 8 on crayN3-5150jo
> Process 20 on crayN3-5150jo
> Process 4 on crayN3-5150jo
> Process 12 on crayN3-5150jo
> Process 0 on crayN3-5150jo
> Process 24 on crayN3-5150jo
> Process 16 on crayN3-5150jo
> Process 28 on crayN3-5150jo
> Process 3 on crayN4-5150jo
> Process 7 on crayN4-5150jo
> Process 11 on crayN4-5150jo
> Process 23 on crayN4-5150jo
> Process 27 on crayN4-5150jo
> Process 31 on crayN4-5150jo
> Process 19 on crayN4-5150jo
> Process 15 on crayN4-5150jo
> pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> wall clock time = 0.009401
> 
> Best regards,
> Sukanta
> 
> On Wed, Feb 8, 2012 at 1:19 PM, Anthony Chan <chan at mcs.anl.gov> wrote:
> >
> > Hmm.. Not sure what is happening.. I don't see anything
> > obviously wrong in your mpiexec verbose output (though
> > I am not hydra expert). Your code now is killed because of
> > segmentation fault. Naively, I would recompile WRF with -g
> > and use a debugger to see where segfault is. If you don't want
> > to mess around WRF source code, you may want to contact WRF
> > developers to see if they have encountered similar problem
> > before.
> >
> > ----- Original Message -----
> >> Dear Anthony,
> >>
> >> Thanks for your response. Yes, I did try MP_STACK_SIZE and
> >> OMP_STACKSIZE. The error is still there. I have attached a log file
> >> (I
> >> ran mpiexec with -verbose option). May be this will help.
> >>
> >> Best regards,
> >> Sukanta
> >>
> >> On Tue, Feb 7, 2012 at 3:28 PM, Anthony Chan <chan at mcs.anl.gov>
> >> wrote:
> >> >
> >> > I am not familar with WRF, and not sure if WRF uses any thread
> >> > in dmpar mode. Did you try setting MP_STACK_SIZE or OMP_STACKSIZE
> >> > ?
> >> >
> >> > see: http://forum.wrfforum.com/viewtopic.php?f=6&t=255
> >> >
> >> > A.Chan
> >> >
> >> > ----- Original Message -----
> >> >> Hi,
> >> >>
> >> >> I am using a small cluster of 4 nodes (each with 8 cores + 24 GB
> >> >> RAM).
> >> >> OS: Ubuntu 11.10. The cluster uses nfs file system and gigE
> >> >> connections.
> >> >>
> >> >> I installed mpich2 and ran cpi.c program successfully.
> >> >>
> >> >> I installed WRF (http://www.wrf-model.org/index.php) using the
> >> >> intel
> >> >> compilers (dmpar option)
> >> >> I set ulimit -l and -s to be unlimited in .bashrc (all nodes)
> >> >> I set memlock to be unlimited in limits.conf (all nodes)
> >> >> I have password-less ssh (public key sharing) on all the nodes
> >> >> I ran parallel jobs with 40x40x40, 40x40x50, and 40x40x60 grid
> >> >> points
> >> >> successfully. However, when I utilize 40x40x80 grid points, I
> >> >> get
> >> >> the
> >> >> following MPI error:
> >> >>
> >> >> **********************************************************
> >> >> Fatal error in PMPI_Wait: Other MPI error, error stack:
> >> >> PMPI_Wait(183)............: MPI_Wait(request=0x34e83a4,
> >> >> status=0x7fff7b24c400) failed
> >> >> MPIR_Wait_impl(77)........:
> >> >> dequeue_and_set_error(596): Communication error with rank 8
> >> >> **********************************************************
> >> >> Given that I can run the exact simulation with slightly lesser
> >> >> number
> >> >> of grid points without any problem, this error is related to
> >> >> stack
> >> >> size. What could be the problem?
> >> >>
> >> >> Thanks,
> >> >> Sukanta
> >> >>
> >> >> --
> >> >> Sukanta Basu
> >> >> Associate Professor
> >> >> North Carolina State University
> >> >> http://www4.ncsu.edu/~sbasu5/
> >> >> _______________________________________________
> >> >> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >> >> To manage subscription options or unsubscribe:
> >> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >> > _______________________________________________
> >> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >> > To manage subscription options or unsubscribe:
> >> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >>
> >>
> >>
> >> --
> >> Sukanta Basu
> >> Associate Professor
> >> North Carolina State University
> >> http://www4.ncsu.edu/~sbasu5/
> 
> 
> 
> --
> Sukanta Basu
> Associate Professor
> North Carolina State University
> http://www4.ncsu.edu/~sbasu5/


More information about the mpich-discuss mailing list