[mpich-discuss] 答复: MPI Fatal error,but only with more cluster nodes!

Jayesh Krishna jayesh at mcs.anl.gov
Fri Sep 3 05:43:50 CDT 2010


Hi,
 You can unsubscribe from the list at https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

Regards,
Jayesh
----- Original Message -----
From: "Dr. Qian-Lin Tang" <qltang at xidian.edu.cn>
To: mpich-discuss at mcs.anl.gov
Sent: Friday, September 3, 2010 4:52:26 AM GMT -06:00 US/Canada Central
Subject: [mpich-discuss] 答复:  MPI Fatal error,but only with more cluster nodes!


邮件 
Hi, all, 

I allways recieved many emails from mpich-discuss at mcs.anl.gov every day. I am therefore disturbed. Can anyone tell me how to deleted my user ID in the MPI forum. 

Thanks, 

Qian-Lin Tang 



-----邮件原件----- 
发件人: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] 代表 Fabio F.Gervasi 
发送时间: 2010年9月3日 15:57 
收件人: mpich-discuss at mcs.anl.gov 
主题: Re: [mpich-discuss] MPI Fatal error,but only with more cluster nodes! 

Thank you Rajeev and Gus! 

I have no problems about network connection because 
I tried to run the "WRF Gnu gcc/gfortran" buildings version 
and everything is ok (I can run also 5 pc together). 

So, on equal other terms, I have problems (on more then 1 pc) only when I use the WRF Intel compiled version (wrf stop with mpi errors). 
Besides, before start wrf.exe, I need to start another executable (real_nmm.exe) which run correctly both wrf compiled version. 

About ulimit, I just set "ulimit -s unlimited" for each machine: 
infact I put this condition on them .bash_profile and also 
I verified it with ulimit -a for each pc. 

So, it seems only an "WRF Intel version <-> Mpi" problem.. 
For this reason I'm going very crazy! :-( 

Thank you! 
Fabio. 



2010/9/2 Gus Correa < gus at ldeo.columbia.edu > 


Hi Fabio 

Besides Rajeev's suggestion. 

You mentioned some "error stack" and that to 
run on one node successfully you did "ulimit -s unlimited". 
This may be needed on *all nodes* running WRF. 
It doesn't propagate to the other nodes, if you do it in the command line, or if you put the ulimit command in your job script, for instance. 
It can be done in the resource manager (Torque, SGE, SLURM) startup 
script, or in the Linux limit configuration files. 
Maybe your system administrator can help you with this. 

FYI, a number of large atmosphere/ocean/climate models we 
run produce a large program stack, often times larger than the 
default limit set by Linux, and do require the change above on 
all nodes. 
(I haven't run WRF, though.) 

Also, to sort out if your network has a problem, you may want to try 
something simpler than WRF. 
The cpi.c program in the MPICH2 'examples' directory is good for this. 
Compile with mpicc, run with mpirun on all nodes. 

I hope it helps. 
Gus Correa 
--------------------------------------------------------------------- 
Gustavo Correa 
Lamont-Doherty Earth Observatory - Columbia University 
Palisades, NY, 10964-8000 - USA 
--------------------------------------------------------------------- 





Rajeev Thakur wrote: 


Try running the cpi example from the MPICH2 examples directory across two machines. There could be a connection issue between the two machines. 

Rajeev 

On Sep 2, 2010, at 8:20 AM, Fabio F.Gervasi wrote: 



Hi, 

I have a "strange" MPI problem when I run a WRF-NMM model, compiled with Intel v11.1.072 (by GNU run ok!). 
Mpich2-1.2 also is compiled by Intel. 

If I run on a single Quad-core machine everything is ok, but when I try on two or more Quad-core machine, 
initially the wrf.exe processes seem start on every pc, but after few second wrf stop and I get the error: 
Fatal error in MPI_Allreduce other mpi error error stack.. and so on... 

I just set: "ulimit -s unlimited", otherwise wrf crash also with a single machine... 

This probably is an MPI problem, but how can I fix it? 

Thank you very much 
Fabio. 
_______________________________________________ 
mpich-discuss mailing list 
mpich-discuss at mcs.anl.gov 
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss 

_______________________________________________ 
mpich-discuss mailing list 
mpich-discuss at mcs.anl.gov 
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss 

_______________________________________________ 
mpich-discuss mailing list 
mpich-discuss at mcs.anl.gov 
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss 


_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list