[mpich-discuss] 答复: MPI Fatal error,but only with more cluster nodes!

Dr. Qian-Lin Tang qltang at xidian.edu.cn
Fri Sep 3 04:52:26 CDT 2010


Hi, all,
 
I allways recieved many emails from mpich-discuss at mcs.anl.gov every day.
I am therefore disturbed. Can anyone tell me how to deleted my user ID
in the MPI forum.
 
Thanks,
 
Qian-Lin Tang 

-----邮件原件-----
发件人: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] 代表 Fabio F.Gervasi
发送时间: 2010年9月3日 15:57
收件人: mpich-discuss at mcs.anl.gov
主题: Re: [mpich-discuss] MPI Fatal error,but only with more cluster
nodes!


Thank you Rajeev and Gus!

I have no problems about network connection because
I tried to run the "WRF Gnu gcc/gfortran" buildings version
and everything is ok (I can run also 5 pc together).

So, on equal other terms, I have problems (on more then 1 pc) only when
I use the WRF Intel compiled version (wrf stop with mpi errors).
Besides, before start wrf.exe, I need to start another executable
(real_nmm.exe) which run correctly both wrf compiled version.

About ulimit, I just set "ulimit -s unlimited" for each machine:
infact I put this condition on them .bash_profile and also
I verified it with ulimit -a for each pc.

So, it seems only an "WRF Intel version <-> Mpi" problem..
For this reason I'm going very crazy! :-(

Thank you!
Fabio.



2010/9/2 Gus Correa <gus at ldeo.columbia.edu>


Hi Fabio

Besides Rajeev's suggestion.

You mentioned some "error stack" and that to
run on one node successfully you did "ulimit -s unlimited".
This may be needed on *all nodes* running WRF.
It doesn't propagate to the other nodes, if you do it in the command
line, or if you put the ulimit command in your job script, for instance.
It can be done in the resource manager (Torque, SGE, SLURM) startup
script, or in the Linux limit configuration files.
Maybe your system administrator can help you with this.

FYI, a number of large atmosphere/ocean/climate models we
run produce a large program stack, often times larger than the
default limit set by Linux, and do require the change above on
all nodes.
(I haven't run WRF, though.)

Also, to sort out if your network has a problem, you may want to try
something simpler than WRF.
The cpi.c program in the MPICH2 'examples' directory is good for this.
Compile with mpicc, run with mpirun on all nodes.

I hope it helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
--------------------------------------------------------------------- 



Rajeev Thakur wrote:


Try running the cpi example from the MPICH2 examples directory across
two machines. There could be a connection issue between the two
machines.

Rajeev

On Sep 2, 2010, at 8:20 AM, Fabio F.Gervasi wrote:



Hi,

I have a "strange" MPI problem when I run a WRF-NMM model, compiled with
Intel v11.1.072 (by GNU run ok!).
Mpich2-1.2 also is compiled by Intel.

If I run on a single Quad-core machine everything is ok, but when I try
on two or more Quad-core machine,
initially the wrf.exe processes seem start on every pc, but after few
second wrf stop and I get the error:
Fatal error in MPI_Allreduce other mpi error error stack.. and so on...

I just set: "ulimit -s unlimited", otherwise wrf crash also with a
single machine...

This probably is an MPI problem, but how can I fix it?

Thank you very much
Fabio.
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100903/d8d0b159/attachment.htm>


More information about the mpich-discuss mailing list