[mpich-discuss] Problems Running WRF on Ubuntu 11.10, MPICH2

Gustavo Correa gus at ldeo.columbia.edu
Wed Feb 8 17:20:48 CST 2012


Hi Sukanta

Did you set the stacksize [not only memlock] to unlimited in 
/etc/security/limits.conf on all nodes?

Not sure this will work, but you could try to run 'ulimit -s'  and 'ulimit -l' via mpiexec, just to check:

mpiexec -prepend-rank -f hostfile -np 32 ulimit -s  
mpiexec -prepend-rank -f hostfile -np 32 ulimit -l

Or just login to each node and check.

Also, if your WRF is compiled with OpenMP, 
I think the Intel-specific environment variable for OMP_STACKSIZE is
KMP_STACKSIZE [not MP_STACKSIZE], although they should also accept
the portable/standard OMP_STACKSIZE [but I don't know if they do].
For some models here I had to make is as big as 512m [I don't run wrf, though].
'man ifort' should tell more about it [at the end of the man page].

I hope this helps,
Gus Correa

On Feb 8, 2012, at 4:23 PM, Anthony Chan wrote:

> 
> There is fpi, Fortran counterpart of cpi, you can try that.
> Also, there is MPICH2 testsuite which is located in
> mpich2-xxx/test/mpi can be invoked by "make testing".
> It is unlikely those tests will reveal anything.
> The testsuite is meant to test the MPI implementation
> not your app.
> 
> As what you said earlier, your difficulty in running WRF
> with larger dataset is memory related.  You should contact WRF
> emailing list for more pointers.
> 
> ----- Original Message -----
>> Hi Anthony,
>> 
>> Is there any other mpi example code (other than cpi.c) that I could
>> test which will give me more information about my mpich setup?
>> 
>> Here is the output from cpi (using 32 cores on 4 nodes):
>> 
>> mpiuser at crayN1-5150jo:~/Misc$ mpiexec -f mpd.hosts -n 32 ./cpi
>> Process 1 on crayN1-5150jo
>> Process 18 on crayN2-5150jo
>> Process 2 on crayN2-5150jo
>> Process 26 on crayN2-5150jo
>> Process 5 on crayN1-5150jo
>> Process 14 on crayN2-5150jo
>> Process 21 on crayN1-5150jo
>> Process 22 on crayN2-5150jo
>> Process 25 on crayN1-5150jo
>> Process 6 on crayN2-5150jo
>> Process 9 on crayN1-5150jo
>> Process 17 on crayN1-5150jo
>> Process 30 on crayN2-5150jo
>> Process 10 on crayN2-5150jo
>> Process 29 on crayN1-5150jo
>> Process 13 on crayN1-5150jo
>> Process 8 on crayN3-5150jo
>> Process 20 on crayN3-5150jo
>> Process 4 on crayN3-5150jo
>> Process 12 on crayN3-5150jo
>> Process 0 on crayN3-5150jo
>> Process 24 on crayN3-5150jo
>> Process 16 on crayN3-5150jo
>> Process 28 on crayN3-5150jo
>> Process 3 on crayN4-5150jo
>> Process 7 on crayN4-5150jo
>> Process 11 on crayN4-5150jo
>> Process 23 on crayN4-5150jo
>> Process 27 on crayN4-5150jo
>> Process 31 on crayN4-5150jo
>> Process 19 on crayN4-5150jo
>> Process 15 on crayN4-5150jo
>> pi is approximately 3.1416009869231249, Error is 0.0000083333333318
>> wall clock time = 0.009401
>> 
>> Best regards,
>> Sukanta
>> 
>> On Wed, Feb 8, 2012 at 1:19 PM, Anthony Chan <chan at mcs.anl.gov> wrote:
>>> 
>>> Hmm.. Not sure what is happening.. I don't see anything
>>> obviously wrong in your mpiexec verbose output (though
>>> I am not hydra expert). Your code now is killed because of
>>> segmentation fault. Naively, I would recompile WRF with -g
>>> and use a debugger to see where segfault is. If you don't want
>>> to mess around WRF source code, you may want to contact WRF
>>> developers to see if they have encountered similar problem
>>> before.
>>> 
>>> ----- Original Message -----
>>>> Dear Anthony,
>>>> 
>>>> Thanks for your response. Yes, I did try MP_STACK_SIZE and
>>>> OMP_STACKSIZE. The error is still there. I have attached a log file
>>>> (I
>>>> ran mpiexec with -verbose option). May be this will help.
>>>> 
>>>> Best regards,
>>>> Sukanta
>>>> 
>>>> On Tue, Feb 7, 2012 at 3:28 PM, Anthony Chan <chan at mcs.anl.gov>
>>>> wrote:
>>>>> 
>>>>> I am not familar with WRF, and not sure if WRF uses any thread
>>>>> in dmpar mode. Did you try setting MP_STACK_SIZE or OMP_STACKSIZE
>>>>> ?
>>>>> 
>>>>> see: http://forum.wrfforum.com/viewtopic.php?f=6&t=255
>>>>> 
>>>>> A.Chan
>>>>> 
>>>>> ----- Original Message -----
>>>>>> Hi,
>>>>>> 
>>>>>> I am using a small cluster of 4 nodes (each with 8 cores + 24 GB
>>>>>> RAM).
>>>>>> OS: Ubuntu 11.10. The cluster uses nfs file system and gigE
>>>>>> connections.
>>>>>> 
>>>>>> I installed mpich2 and ran cpi.c program successfully.
>>>>>> 
>>>>>> I installed WRF (http://www.wrf-model.org/index.php) using the
>>>>>> intel
>>>>>> compilers (dmpar option)
>>>>>> I set ulimit -l and -s to be unlimited in .bashrc (all nodes)
>>>>>> I set memlock to be unlimited in limits.conf (all nodes)
>>>>>> I have password-less ssh (public key sharing) on all the nodes
>>>>>> I ran parallel jobs with 40x40x40, 40x40x50, and 40x40x60 grid
>>>>>> points
>>>>>> successfully. However, when I utilize 40x40x80 grid points, I
>>>>>> get
>>>>>> the
>>>>>> following MPI error:
>>>>>> 
>>>>>> **********************************************************
>>>>>> Fatal error in PMPI_Wait: Other MPI error, error stack:
>>>>>> PMPI_Wait(183)............: MPI_Wait(request=0x34e83a4,
>>>>>> status=0x7fff7b24c400) failed
>>>>>> MPIR_Wait_impl(77)........:
>>>>>> dequeue_and_set_error(596): Communication error with rank 8
>>>>>> **********************************************************
>>>>>> Given that I can run the exact simulation with slightly lesser
>>>>>> number
>>>>>> of grid points without any problem, this error is related to
>>>>>> stack
>>>>>> size. What could be the problem?
>>>>>> 
>>>>>> Thanks,
>>>>>> Sukanta
>>>>>> 
>>>>>> --
>>>>>> Sukanta Basu
>>>>>> Associate Professor
>>>>>> North Carolina State University
>>>>>> http://www4.ncsu.edu/~sbasu5/
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sukanta Basu
>>>> Associate Professor
>>>> North Carolina State University
>>>> http://www4.ncsu.edu/~sbasu5/
>> 
>> 
>> 
>> --
>> Sukanta Basu
>> Associate Professor
>> North Carolina State University
>> http://www4.ncsu.edu/~sbasu5/
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list