[mpich-discuss] how to deal with these errors?

LSB sslitbb at hotmail.com
Tue Dec 22 03:38:14 CST 2009


Hi everyone,
 
I want to run Community Climate System Model on our machine under MPICH2. I compiled it successfully. However, I got some error message about mpi during runnig it.
 
1) In the run script, I asked for 32 cpus ( use PBS batch system). After starting up mpd daemons, I wrote " /mnt/storage-space/disk1/mpich/bin/mpiexec -l -n 2 $EXEROOT/all/cpl : -n 2 $EXEROOT/all/csim : -n 8 $EXEROOT/all/clm : -n 4 $EXEROOT/all/pop : -n 16 $EXEROOT/all/cam" . 
 The  process is over quite quickly after I qsub it. With error message like:
rank 5 in job 1  compute-0-10.local_46741   caused collective abort of all ranks
  exit status of rank 5: return code 1 
AND
14: Fatal error in MPI_Cart_shift: Invalid communicator, error stack:
14: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL, direction=1, displ=1, source=0x2582aa0, dest=0x2582aa4) failed
14: MPI_Cart_shift(80).: Null communicator
15: Fatal error in MPI_Cart_shift: Invalid communicator, error stack:
15: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL, direction=1, displ=1, source=0x2582aa0, dest=0x2582aa4) failed
5: Assertion failed in file helper_fns.c at line 337: 0
15: MPI_Cart_shift(80).: Null communicator
5: memcpy argument memory ranges overlap, dst_=0xf2c37f4 src_=0xf2c37f4 len_=4
9: Assertion failed in file helper_fns.c at line 337: 0
5: 
9: memcpy argument memory ranges overlap, dst_=0x1880ce64 src_=0x1880ce64 len_=4
5: internal ABORT - process 5
9: 
9: internal ABORT - process 9
4: Assertion failed in file helper_fns.c at line 337: 0
4: memcpy argument memory ranges overlap, dst_=0x1c9615d0 src_=0x1c9615d0 len_=4
4: 
4: internal ABORT - process 4

2) What quite puzzeled me is that if I delete any one of the five (cpl, csim, clm, pop, cam ) , the model can running sucsessfully. For example, delete "cpl", I wrote " 
/mnt/storage-space/disk1/mpich/bin/mpiexec -l -n 2 $EXEROOT/all/csim : -n 8 $EXEROOT/all/clm : -n 4 $EXEROOT/all/pop : -n 16 $EXEROOT/all/cam" will be ok.
but if I run all of the five at the same time, the error message as mentioned above will appear.
 
3) If ask for a few more cpus, things may become better, I guess. So I have a try . Ask  for 34 cpus but still use 2+2+8+4+16=32 cpus, mpi error message still exists.
 
How should I solve the problem?
Anyone can give some suggestions?
 
Thanks in advace!
 
 
L. S
 		 	   		  
_________________________________________________________________
立刻下载最新版 Windows Live Messenger!
http://www.windowslive.cn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091222/7a6b96ef/attachment.htm>


More information about the mpich-discuss mailing list