[mpich-discuss] how to deal with these errors?

Gus Correa gus at ldeo.columbia.edu
Tue Dec 22 15:30:40 CST 2009


Hi Liu

As I mentioned, probably to you,  in the CCSM3 forum:

**

Regarding 1),
the "Invalid communicator" error is often produced by the use
of a wrong mpi.h or mpif.h include files, i.e.,
include files from another MPI that may be in your system.

If you search this mailing list archives, or the OpenMPI mailing list
archives, you will find other postings reporting this error.

For instance, in one of our computers here, the MPICH-1
mpi.h has this:

#define MPI_COMM_WORLD 91

whereas the MPICH2 mpi.h has something else:

#define MPI_COMM_WORLD ((MPI_Comm)0x44000000)

As you can see, even MPI_COMM_WORLD is different on MPICH-1 and MPICH2.
You cannot patch this by hand.
You must use the correct mpi.h/mpif.h, associated to your
mpicc and mpif90.

You may want to compile everything again fresh.
Object files and modules that were built with the wrong mpi.h
will only cause you headaches, and the "Invalid communicator"
error will never go away.
Get rid of them before you restart.
Do make clean/cleanall,  or make cleandist.
Even better: simply start from a fresh tarball.

To compile, you should preferably use the MPICH2
compiler wrappers mpif90 and mpicc.

Wherever the CCSM3 Makefiles point to MPI include files,
make sure the directories are those of MPICH2, not any other MPI.

Likewise for the MPI library directories:
they must be those associated to MPICH2.

To save you headaches, you can use full path names to
the MPICH2 mpicc and mpif90.

You may need to compile the ESMF library separately,
as their makefiles seem to be hardwired not to use the MPI compiler
wrappers.

**

As for 2), CCSM3 is an MPMD program with 5 executables.
It cannot work correctly if you delete one of them.
You actually eliminated the flux coupler, which coordinates
the work of all other four components.
The other components only talk to the coupler.
Therefore, what probably happens
is that the other four executables are waiting
forever for the flux coupler to answer.

**

As for 3), besides requiring a substantial number of CPUs,
CCSM3 also needs a significant amount of memory.
On how many nodes, and with how much memory on each,
are you trying to run the job?
Which resolution (T42, T31, T85)?

In any case, increasing the number of processors
will not solve the MPI error message of 1),
which requires using the correct mpi.h.

**

Only question 1) is a general MPI/MPICH question.
Questions 2) and 3) are specific CCSM3 issues.
It may be more productive to discuss them in the CCSM3 forum.

In any case, let's hope you can get additional help here also.

**

I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

LSB wrote:
> Hi everyone,
>  
> I want to run Community Climate System Model on our machine under 
> MPICH2. I compiled it successfully. However, I got some error message 
> about mpi during runnig it.
>  
> 1) In the run script, I asked for 32 cpus ( use PBS batch system). After 
> starting up mpd daemons, I wrote " 
> /mnt/storage-space/disk1/mpich/bin/mpiexec -l -n 2 $EXEROOT/all/cpl : -n 
> 2 $EXEROOT/all/csim : -n 8 $EXEROOT/all/clm : -n 4 $EXEROOT/all/pop : -n 
> 16 $EXEROOT/all/cam" . 
>  The  process is over quite quickly after I qsub it. With error message 
> like:
> rank 5 in job 1  compute-0-10.local_46741   caused collective abort of 
> all ranks
>   exit status of rank 5: return code 1
> AND
> 14: Fatal error in MPI_Cart_shift: Invalid communicator, error stack:
> 14: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL, direction=1, 
> displ=1, source=0x2582aa0, dest=0x2582aa4) failed
> 14: MPI_Cart_shift(80).: Null communi cator
> 15: Fatal error in MPI_Cart_shift: Invalid communicator, error stack:
> 15: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL, direction=1, 
> displ=1, source=0x2582aa0, dest=0x2582aa4) failed
> 5: Assertion failed in file helper_fns.c at line 337: 0
> 15: MPI_Cart_shift(80).: Null communicator
> 5: memcpy argument memory ranges overlap, dst_=0xf2c37f4 src_=0xf2c37f4 
> len_=4
> 9: Assertion failed in file helper_fns.c at line 337: 0
> 5:
> 9: memcpy argument memory ranges overlap, dst_=0x1880ce64 
> src_=0x1880ce64 len_=4
> 5: internal ABORT - process 5
> 9:
> 9: internal ABORT - process 9
> 4: Assertion failed in file helper_fns.c at line 337: 0
> 4: memcpy argument memory ranges overlap, dst_=0x1c9615d0 
> src_=0x1c9615d0 len_=4
> 4:
> 4: internal ABORT - process 4
> 
> 2) What quite puzzeled me is that if I delete any one of the five (cpl, 
> csim, clm, pop, cam ) , the model can running sucsessfully. For example, 
> delete "cpl", I wro te "
> /mnt/storage-space/disk1/mpich/bin/mpiexec -l -n 2 $EXEROOT/all/csim : 
> -n 8 $EXEROOT/all/clm : -n 4 $EXEROOT/all/pop : -n 16 $EXEROOT/all/cam" 
> will be ok.
> but if I run all of the five at the same time, the error message as 
> mentioned above will appear.
>  
> 3) If ask for a few more cpus, things may become better, I guess. So I 
> have a try . Ask  for 34 cpus but still use 2+2+8+4+16=32 cpus, mpi 
> error message still exists.
>  
> How should I solve the problem?
> Anyone can give some suggestions?
>  
> Thanks in advace!
>  
>  
> L. S
> 


More information about the mpich-discuss mailing list