[mpich-discuss] MPI Fatal error, but only with more cluster nodes!

Fabio F.Gervasi fabio.gervasi70 at gmail.com
Fri Sep 10 04:01:17 CDT 2010


Hi to all!
Any idea about this issue?

Thanks,
Fabio.


2010/9/3 Fabio F.Gervasi <fabio.gervasi70 at gmail.com>

> Thank you Rajeev and Gus!
>
> I have no problems about network connection because
> I tried to run the "WRF Gnu gcc/gfortran" buildings version
> and everything is ok (I can run also 5 pc together).
>
> So, on equal other terms, I have problems (on more then 1 pc) only when I
> use the WRF Intel compiled version (wrf stop with mpi errors).
> Besides, before start wrf.exe, I need to start another executable
> (real_nmm.exe) which run correctly both wrf compiled version.
>
> About ulimit, I just set "ulimit -s unlimited" for each machine:
> infact I put this condition on them .bash_profile and also
> I verified it with ulimit -a for each pc.
>
> So, *it seems only an "WRF Intel version <-> Mpi" problem..*
> For this reason I'm going very crazy! :-(
>
> Thank you!
> Fabio.
>
>
> 2010/9/2 Gus Correa <gus at ldeo.columbia.edu>
>
> Hi Fabio
>>
>> Besides Rajeev's suggestion.
>>
>> You mentioned some "error stack" and that to
>> run on one node successfully you did "ulimit -s unlimited".
>> This may be needed on *all nodes* running WRF.
>> It doesn't propagate to the other nodes, if you do it in the command line,
>> or if you put the ulimit command in your job script, for instance.
>> It can be done in the resource manager (Torque, SGE, SLURM) startup
>> script, or in the Linux limit configuration files.
>> Maybe your system administrator can help you with this.
>>
>> FYI, a number of large atmosphere/ocean/climate models we
>> run produce a large program stack, often times larger than the
>> default limit set by Linux, and do require the change above on
>> all nodes.
>> (I haven't run WRF, though.)
>>
>> Also, to sort out if your network has a problem, you may want to try
>> something simpler than WRF.
>> The cpi.c program in the MPICH2 'examples' directory is good for this.
>> Compile with mpicc, run with mpirun on all nodes.
>>
>> I hope it helps.
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>>
>>
>> Rajeev Thakur wrote:
>>
>>> Try running the cpi example from the MPICH2 examples directory across two
>>> machines. There could be a connection issue between the two machines.
>>>
>>> Rajeev
>>>
>>> On Sep 2, 2010, at 8:20 AM, Fabio F.Gervasi wrote:
>>>
>>>  Hi,
>>>>
>>>> I have a "strange" MPI problem when I run a WRF-NMM model, compiled with
>>>> Intel v11.1.072 (by GNU run ok!).
>>>> Mpich2-1.2 also is compiled by Intel.
>>>>
>>>> If I run on a single Quad-core machine everything is ok, but when I try
>>>> on two or more Quad-core machine,
>>>> initially the wrf.exe processes seem start on every pc, but after few
>>>> second wrf stop and I get the error:
>>>> Fatal error in MPI_Allreduce other mpi error error stack.. and so on...
>>>>
>>>> I just set: "ulimit -s unlimited", otherwise wrf crash also with a
>>>> single machine...
>>>>
>>>> This probably is an MPI problem, but how can I fix it?
>>>>
>>>> Thank you very much
>>>> Fabio.
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100910/c7bef80e/attachment.htm>


More information about the mpich-discuss mailing list