[mpich-discuss] MPI Fatal error, but only with more cluster nodes!

Gus Correa gus at ldeo.columbia.edu
Thu Sep 2 13:22:50 CDT 2010


Hi Fabio

Besides Rajeev's suggestion.

You mentioned some "error stack" and that to
run on one node successfully you did "ulimit -s unlimited".
This may be needed on *all nodes* running WRF.
It doesn't propagate to the other nodes, if you do it in the command 
line, or if you put the ulimit command in your job script, for instance.
It can be done in the resource manager (Torque, SGE, SLURM) startup
script, or in the Linux limit configuration files.
Maybe your system administrator can help you with this.

FYI, a number of large atmosphere/ocean/climate models we
run produce a large program stack, often times larger than the
default limit set by Linux, and do require the change above on
all nodes.
(I haven't run WRF, though.)

Also, to sort out if your network has a problem, you may want to try
something simpler than WRF.
The cpi.c program in the MPICH2 'examples' directory is good for this.
Compile with mpicc, run with mpirun on all nodes.

I hope it helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Rajeev Thakur wrote:
> Try running the cpi example from the MPICH2 examples directory across two machines. There could be a connection issue between the two machines.
> 
> Rajeev
> 
> On Sep 2, 2010, at 8:20 AM, Fabio F.Gervasi wrote:
> 
>> Hi,
>>
>> I have a "strange" MPI problem when I run a WRF-NMM model, compiled with Intel v11.1.072 (by GNU run ok!).
>> Mpich2-1.2 also is compiled by Intel.
>>
>> If I run on a single Quad-core machine everything is ok, but when I try on two or more Quad-core machine,
>> initially the wrf.exe processes seem start on every pc, but after few second wrf stop and I get the error:
>> Fatal error in MPI_Allreduce other mpi error error stack.. and so on...
>>
>> I just set: "ulimit -s unlimited", otherwise wrf crash also with a single machine...
>>
>> This probably is an MPI problem, but how can I fix it?
>>
>> Thank you very much
>> Fabio.
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list