[mpich-discuss] MPI Fatal error, but only with more cluster nodes!

Fabio F.Gervasi fabio.gervasi70 at gmail.com
Fri Sep 3 02:57:28 CDT 2010


Thank you Rajeev and Gus!

I have no problems about network connection because
I tried to run the "WRF Gnu gcc/gfortran" buildings version
and everything is ok (I can run also 5 pc together).

So, on equal other terms, I have problems (on more then 1 pc) only when I
use the WRF Intel compiled version (wrf stop with mpi errors).
Besides, before start wrf.exe, I need to start another executable
(real_nmm.exe) which run correctly both wrf compiled version.

About ulimit, I just set "ulimit -s unlimited" for each machine:
infact I put this condition on them .bash_profile and also
I verified it with ulimit -a for each pc.

So, *it seems only an "WRF Intel version <-> Mpi" problem..*
For this reason I'm going very crazy! :-(

Thank you!
Fabio.


2010/9/2 Gus Correa <gus at ldeo.columbia.edu>

> Hi Fabio
>
> Besides Rajeev's suggestion.
>
> You mentioned some "error stack" and that to
> run on one node successfully you did "ulimit -s unlimited".
> This may be needed on *all nodes* running WRF.
> It doesn't propagate to the other nodes, if you do it in the command line,
> or if you put the ulimit command in your job script, for instance.
> It can be done in the resource manager (Torque, SGE, SLURM) startup
> script, or in the Linux limit configuration files.
> Maybe your system administrator can help you with this.
>
> FYI, a number of large atmosphere/ocean/climate models we
> run produce a large program stack, often times larger than the
> default limit set by Linux, and do require the change above on
> all nodes.
> (I haven't run WRF, though.)
>
> Also, to sort out if your network has a problem, you may want to try
> something simpler than WRF.
> The cpi.c program in the MPICH2 'examples' directory is good for this.
> Compile with mpicc, run with mpirun on all nodes.
>
> I hope it helps.
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
>
> Rajeev Thakur wrote:
>
>> Try running the cpi example from the MPICH2 examples directory across two
>> machines. There could be a connection issue between the two machines.
>>
>> Rajeev
>>
>> On Sep 2, 2010, at 8:20 AM, Fabio F.Gervasi wrote:
>>
>>  Hi,
>>>
>>> I have a "strange" MPI problem when I run a WRF-NMM model, compiled with
>>> Intel v11.1.072 (by GNU run ok!).
>>> Mpich2-1.2 also is compiled by Intel.
>>>
>>> If I run on a single Quad-core machine everything is ok, but when I try
>>> on two or more Quad-core machine,
>>> initially the wrf.exe processes seem start on every pc, but after few
>>> second wrf stop and I get the error:
>>> Fatal error in MPI_Allreduce other mpi error error stack.. and so on...
>>>
>>> I just set: "ulimit -s unlimited", otherwise wrf crash also with a single
>>> machine...
>>>
>>> This probably is an MPI problem, but how can I fix it?
>>>
>>> Thank you very much
>>> Fabio.
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100903/e4c4784c/attachment-0001.htm>


More information about the mpich-discuss mailing list