[mpich-discuss] how to deal with these errors?

Gus Correa gus at ldeo.columbia.edu
Wed Dec 23 20:12:44 CST 2009


Hi Liu

Suggestions:

1) Make sure mpicc and mpif90 are actually the MPICH2 versions.
You can check this out with "which mpicc"/"which mpif90".
I suppose they should be in /mnt/storage-space/disk1/mpich/bin,
right?
If "which" doesn't show them right,
then just use the full path name to them in the "Macros.Linux" file,
something like /mnt/storage-space/disk1/mpich/bin/mpicc, etc.

2) Put "-I/mnt/storage-space/disk1/mpich/include" first.
I suppose this is where your MPICH2 is installed,
and you want it to come first.

You can probably remove "-I/usr/local/include -I/usr/include".
These directories are probably searched by the compiler anyway.
However, they may have some old MPICH1 mpi.h/mpif.h.
You can actually check this out using the "find" command.

Another possibility is to use only
INCLDIR    := -I. -I${INCROOT},
and let mpicc and mpif90 do their job undisturbed.

3) Also the ESMF library is always a problem,
because somehow ESMF does not use the
MPI compiler wrappers.
Check if ESMF was compiled already, and do a make cleanall on
its top directory.
It may have some residual object files compiled with the wrong
MPI include files, which may be causing your troubles.

You may need to compile ESMF separately, and make sure that
the ESMF makefiles (check them in the ESMF directory tree)
point to your MPICH2 include directories also:
"-I/mnt/storage-space/disk1/mpich/include"
Likewise, you may need something like
LIBS = -L/mnt/storage-space/disk1/mpich/lib
in those ESMF Makefiles, to ensure that the right MPICH2
libraries are used.

I hope this helps,
Gus Correa

LS wrote:
> Hi Rajeev,
>  
> Thanks for your reply.
> This is my Macros.Linux file which is for directories and compile arguments.
>  
>  INCLDIR    := -I. -I/usr/local/include -I/usr/include -I${INCROOT} 
> -I/mnt/storage-space/disk1/mpich/include
>    SLIBS      := -L/mnt/storage-space/disk1/mpich/lib -L/usr/local/lib 
> -lnetcdf  -llapack -lblas
> ULIBS      := -L$(LIBROOT) -lesmf -lmct -lmpeu -lmph
> CPP        := NONE
> CPPFLAGS   := -DLINUX -DPGF90 -DNO_SHR_VMATH
> CPPDEFS    := -DLINUX
> CC         := mpicc
> CFLAGS     := -c
> ifeq ($(CC),pgcc)
>    CFLAGS  += -fast
> else
>    CFLAGS  += -DUSE_GCC
> endif
> FIXEDFLAGS :=
> FREEFLAGS  := -Mfree
> FC         := mpif90
> FFLAGS     := -c -r8 -i4 -Kieee -Mrecursive -Mdalign -Mextend
> #            ;  -g -Ktrap=fp -Mbounds
> MOD_SUFFIX := mod
> LD         := $(FC)
> #LDFLAGS    := -L/usr/local/gm/lib -lgm -lpthread
> #LDFLAGS    := -L/usr/local/gm/lib -lgm 
> 
>  
> I did use mpif90 and mpicc. Why ??
>  
> Liu. S
> ------------------------------------------------------------------------
> From: thakur at mcs.anl.gov
> To: mpich-discuss at mcs.anl.gov
> Date: Wed, 23 Dec 2009 08:41:35 -0600
> Subject: Re: [mpich-discuss] how to deal with these errors?
> 
> You need to compile with mpif77 or mpif90, not just f77 or f90.
>  
> Rajeev
> 
>     ------------------------------------------------------------------------
>     *From:* mpich-discuss-bounces at mcs.anl.gov
>     [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf Of *LS
>     *Sent:* Wednesday, December 23, 2009 6:04 AM
>     *To:* MPICH
>     *Subject:* Re: [mpich-discuss] how to deal with these errors?
> 
>     Hi Rajeev,
>      
>     I just have a try, delete the mpif.h file in the application
>     directories. But in this way I can not even compile the model
>     successfully.
>      
>      
>     Liu. S 
>     ------------------------------------------------------------------------
>     From: thakur at mcs.anl.gov
>     To: mpich-discuss at mcs.anl.gov
>     Date: Wed, 23 Dec 2009 02:54:14 -0600
>     Subject: Re: [mpich-discuss] how to deal with these errors?
> 
>     Make sure there is no mpif.h file in any of the application directories.
>      
>     Rajeev
> 
>         ------------------------------------------------------------------------
>         *From:* mpich-discuss-bounces at mcs.anl.gov
>         [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf Of *LSB
>         *Sent:* Wednesday, December 23, 2009 2:39 AM
>         *To:* mpich-discuss at mcs.anl.gov
>         *Subject:* Re: [mpich-discuss] how to deal with these errors?
> 
>         Hi Correa,
>          
>         Thank you for your reply. Meet you here agian, haha
>         I have reorgnized my questions, and posted it on the CGD forum.
> 
>         I have edited the related directories of the MPI include files
>         and library after you answered me through the ccsm maillist.
>         However, the "invalid communicator" is still there. I am sure my
>         CCSM3 Makefiles now points to MPICH2 include files and the
>         library directories are also have been changed to associate to
>         MPICH2. Is there any other reason that can lead to this problem?
>          
>         I said that I can "run" CCSM3 by delete any one of the five
>         components. I infact meant that in this situation I can see
>         these process names by the command "top". But if I run all the
>         five, I can not see these process names with "top", and the
>         error message as mentioned before will appear.
>          
>         The resolution I use is T31_gx3v5. I asked for two nodes(each
>         node with 16G memory and 16 cpus). But I am not sure whether
>         these resources are enough for CCSM3 running or not.
>          
>         Thanks for your help.
>          
>         Liu. S
>          
>          
>          
>          > Date: Tue, 22 Dec 2009 16:30:40 -0500
>          > From: gus at ldeo.columbia.edu
>          > To: mpich-discuss at mcs.anl.gov
>          > Subject: Re: [mpich-discuss] how to deal with these errors?
>          >
>          > Hi Liu
>          >
>          > As I mentioned, probably to you, in the CCSM3 forum:
>          >
>          > **
>          >
>          > Regarding 1),
>          > the "Invalid communicator" error is often produced by the use
>          > of a wrong mpi.h or mpif.h include files, i.e.,
>          > include files from another MPI that may be in your system.
>          >
>          > If you search this mailing list archives, or the OpenMPI
>         mailing list
>          > archives, you will find other postings reporting this error.
>          >
>          > For instance, in one of our computers here, the MPICH-1
>          > mpi.h has this:*>
>          > #define MPI_COMM_WORLD 91
>          >
>          > whereas the MPICH2 mpi.h has something else:
>          >
>          > #define MPI_COMM_WORLD ((MPI_Comm)0x44000000)
>          >
>          > As you can see, eve n MPI_COMM_WORLD is different on MPICH-1
>         and MPICH2.
>          > You cannot patch this by hand.
>          > You must use the correct mpi.h/mpif.h, associated to your
>          > mpicc and mpif90.
>          >
>          > You may want to compile everything again fresh.
>          > Object files and modules that were built with the wrong mpi.h
>          > will only cause you headaches, and the "Invalid communicator"
>          > error will never go away.
>          > Get rid of them before you restart.
>          > Do make clean/cleanall, or make cleandist.
>          > Even better: simply start from a fresh tarball.
>          >
>          > To compile, you should preferably use the MPICH2
>          > compiler wrappers mpif90 and mpicc.
>          >
>          > Wherever the CCSM3 Makefiles point to MPI include files,
>          > make sure the direc tories are those of MPICH2, not any other
>         MPI.
>          >
>          > Likewise for the MPI library directories:
>          > they must be those associated to MPICH2.
>          >
>          > To save you headaches, you can u se full path names to
>          > the MPICH2 mpicc and mpif90.
>          >
>          > You may need to compile the ESMF library separately,
>          > as their makefiles seem to be hardwired not to use the MPI
>         compiler
>          > wrappers.
>          >
>          > **
>          >
>          > As for 2), CCSM3 is an MPMD program with 5 executables.
>          > It cannot work correctly if you delete one of them.
>          > You actually eliminated the flux coupler, which coordinates
>          > the work of all other four components.
>          > The other components only talk to the coupler.
>          > Therefore, what probably happens
>          > is that the other four executables are waiting
>          > forever for the flux coupler to answer.
>          >
>          > **
>          >
>          > As for 3), besides requiring a substantial number of CPUs ,
>          > CCSM3 also needs a significant amount of memory.
>          > On how many nodes, and with how much memory on each,
>          > are you trying to run the job?
>          > Which resolution (T42, T31, T85)?
>         & gt;
>          > In any case, increasing the number of processors
>          > will not solve the MPI error message of 1),
>          > which requires using the correct mpi.h.
>          >
>          > **
>          >
>          > Only question 1) is a general MPI/MPICH question.
>          > Questions 2) and 3) are specific CCSM3 issues.
>          > It may be more productive to discuss them in the CCSM3 forum.
>          >
>          > In any case, let's hope you can get additional help here also.
>          >
>          > **
>          >
>          > I hope this helps.
>          > Gus Correa
>          >
>         ---------------------------------------------------------------------
>          > Gustavo Correa
>          > Lamont-Doherty Earth Observatory - Columbia University
>          > Palisades, NY, 10964-8000 - USA
>          > ------------------------------------- -
>         -------------------------------
>          >
>          > LSB wrote:
>          > > Hi everyone,
>          > >
>          > > I want to run Community Climate System Model on our machine
>         under
>          > > MPICH2. I compile d it successfully. However, I got some
>         error message
>          > > about mpi during runnig it.
>          > >
>          > > 1) In the run script, I asked for 32 cpus ( use PBS batch
>         system). After
>          > > starting up mpd daemons, I wrote "
>          > > /mnt/storage-space/disk1/mpich/bin/mpiexec -l -n 2
>         $EXEROOT/all/cpl : -n
>          > > 2 $EXEROOT/all/csim : -n 8 $EXEROOT/all/clm : -n 4
>         $EXEROOT/all/pop : -n
>          > > 16 $EXEROOT/all/cam" .
>          > > The process is over quite quickly after I qsub it. With
>         error message
>          > > like:
>          > > rank 5 in job 1 compute-0-10.local_46741 caused collective
>         abort of
>          > > all ranks
>          > > exit status of rank 5: return code 1
>          > > AND
>          > > 14: Fatal error in MPI_Cart_shif t: Invalid communicator,
>         error stack:
>          > > 14: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL,
>         direction=1,
>          > > displ=1, source=0x2582aa0, dest=0x2582aa4) failed
>          > > 14: MPI_Cart_shif t(80).: Null communi cator
>          > > 15: Fatal error in MPI_Cart_shift: Invalid communicator,
>         error stack:
>          > > 15: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL,
>         direction=1,
>          > > displ=1, source=0x2582aa0, dest=0x2582aa4) failed
>          > > 5: Assertion failed in file helper_fns.c at line 337: 0
>          > > 15: MPI_Cart_shift(80).: Null communicator
>          > > 5: memcpy argument memory ranges overlap, dst_=0xf2c37f4
>         src_=0xf2c37f4
>          > > len_=4
>          > > 9: Assertion failed in file helper_fns.c at line 337: 0
>          > > 5:
>          > > 9: memcpy argument memory ranges overlap, dst_=0x1880ce64
>          > > src_=0x1880ce64 len_=4
>          > > 5: internal ABORT - process 5
>          > > 9:
>          > > 9: internal ABORT - proc es s 9
>          > > 4: Assertion failed in file helper_fns.c at line 337: 0
>          > > 4: memcpy argument memory ranges overlap, dst_=0x1c9615d0
>          > > src_=0x1c9615d0 len_=4
>          > > 4:
>          > > 4: internal ABORT - process 4
>          > >
>          > > 2) What quite puzzeled me is that if I delete any one of
>         the five (cpl,
>          > > csim, clm, pop, cam ) , the model can running sucsessfully.
>         For example,
>          > > delete "cpl", I wro te "
>          > > /mnt/storage-space/disk1/mpich/bin/mpiexec -l -n 2
>         $EXEROOT/all/csim :
>          > > -n 8 $EXEROOT/all/clm : -n 4 $EXEROOT/all/pop : -n 16
>         $EXEROOT/all/cam"
>          > > will be ok.
>          > > but if I run all of the five at the same time, the error
>         message as
>          > > mentioned above will appear.
>          > >
>          > > 3) If ask for a few more cpus, things may become better, I
>         guess. So I
>          > > have a try . Ask for 34 cpus but still use 2+2+8+4+16=32
>         cpus, mpi
>          > > error me ssa ge still exists.
>          > >
>          > > How should I solve the problem?
>          > > Anyone can give some suggestions?
>          > >
>          > > Thanks in advace!
>          > >
>          > >
>          > > L. S
>          > >
>          > _______________________________________________
>          > mpich-discuss mailing list
>          > mpich-discuss at mcs.anl.gov
>          > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
>         *
>         ------------------------------------------------------------------------
>         *更多热辣资讯尽在新版MSN首页! 立刻访问! <http://cn.msn.com/> *
> 
>     *
>     *
>     ------------------------------------------------------------------------
>     *聊天+搜索+邮箱 想要轻松出游,手机MSN帮你搞定! 立刻下载!
>     <http://3g.msn.cn/> *
> 
> 
> ------------------------------------------------------------------------
> 使用Messenger保护盾2.0,支持多账号登录! 现在就下载! 
> <http://im.live.cn/safe/>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list