[mpich-discuss] mpich2 with Ubuntu 11.04 update with trunk

Darius Buntinas buntinas at mcs.anl.gov
Fri Feb 24 17:25:10 CST 2012


Sorry for the delay on this.

I'm not sure how else to debug this without having a copy of the code.  If you can create a short test program that gives the same error, it would help.

Otherwise, you can check that your application is calling collectives correctly using the collchk profiling library included in mpich2.  You can do this by adding -mpe=mpicheck to your mpicc, mpif77 or mpif90 line, e.g., "mpicc -mpe=mpicheck cpi.c -o cpi".

Also try running valgrind on each process and look for suspicious messages.  Make sure you configure mpich with --enable-g=all to avoid getting some unrelated warnings from valgrind.  You can run valgrind like this:

   mpiexec -n 4 -errfile-pattern vg-out valgrind my_pgm my_opt 

This will generate one file per process containing the valgrind output called vc-out.XX where XX is the rank of the process.  

Google valgrind for more info on the output and its options.

-d


On Feb 11, 2012, at 2:15 PM, Konstantinos Varotsos wrote:

> 
> Here is the error after adding
> 
> MPICH_ENABLE_COLL_FT_RET=0
> 
>  [4] Fatal error in PMPI_Bcast: Other MPI error, error stack:
> [4] PMPI_Bcast(1479)......: MPI_Bcast(buf=0x200d468, count=1, MPI_CHAR, root=0, comm=0x84000004) failed
> [4] MPIR_Bcast_impl(1322).:
> [4] MPIR_Bcast_intra(1121):
> [4] MPIR_SMP_Bcast(1039)..: Failure during collective
> [0] Internal Error: invalid error code 489d36 (Ring ids do not match) in PMPI_Barrier:425
> [0] Fatal error in PMPI_Barrier: Other MPI error, error stack:
> [0] PMPI_Barrier(425): MPI_Barrier(comm=0x84000004) failed
> 
> 
> Thanks,
> 
> 
> Kwstas
> 
> 
> On 02/10/2012 11:44 PM, Darius Buntinas wrote:
>> Unfortunately, the log files stopped before the error, probably due to file buffering.
>> 
>> Add the this before mpiexec:
>>     MPICH_ENABLE_COLL_FT_RET=0
>> Also, add the -l flag to mpiexec.
>> 
>> e.g., MPICH_ENABLE_COLL_FT_RET=0 mpiexec -l -n 8 ...
>> 
>> Thanks,
>> -d
>> 
>> 
>> On Feb 9, 2012, at 4:54 AM, Konstantinos Varotsos wrote:
>> 
>>> Hi Darious,
>>> 
>>> 
>>> I recompiled mpich2 as you suggested
>>> 
>>> MPICH2 Version:        1.5a2
>>> MPICH2 Release date:    Tue Feb  7 00:00:57 CST 2012
>>> MPICH2 Device:        ch3:nemesis
>>> MPICH2 configure:     --enable-g=all --prefix=/mirror/mpiuser/mpich2-install
>>> MPICH2 CC:     gcc    -g -O2
>>> MPICH2 CXX:     c++   -g -O2
>>> MPICH2 F77:     gfortran   -g -O2
>>> MPICH2 FC:     gfortran   -g -O2
>>> 
>>> 
>>> and the error I got  is
>>> 
>>> 
>>> Assertion failed in file src/mpi/coll/helper_fns.c at line 713: status->MPI_TAG == recvtag
>>> internal ABORT - process 4
>>> Internal Error: invalid error code 489d36 (Ring ids do not match) in PMPI_Barrier:425
>>> Fatal error in PMPI_Barrier: Other MPI error, error stack:
>>> PMPI_Barrier(425).......: MPI_Barrier(comm=0x84000002) failed
>>> MPIR_Barrier_impl(306)..:
>>> MPIR_Bcast_impl(1322)...:
>>> MPIR_Bcast_intra(1156)..:
>>> MPIR_Bcast_binomial(213): Failure during collective
>>> Fatal error in PMPI_Barrier: Other MPI error, error stack:
>>> PMPI_Barrier(425).......: MPI_Barrier(comm=0x84000002) failed
>>> MPIR_Barrier_impl(306)..:
>>> MPIR_Bcast_impl(1322)...:
>>> MPIR_Bcast_intra(1156)..:
>>> MPIR_Bcast_binomial(213): Failure during collective
>>> Fatal error in PMPI_Barrier: Other MPI error, error stack:
>>> PMPI_Barrier(425).......: MPI_Barrier(comm=0x84000002) failed
>>> MPIR_Barrier_impl(306)..:
>>> MPIR_Bcast_impl(1322)...:
>>> MPIR_Bcast_intra(1156)..:
>>> MPIR_Bcast_binomial(213): Failure during collective
>>> Fatal error in PMPI_Barrier: Other MPI error, error stack:
>>> PMPI_Barrier(425): MPI_Barrier(comm=0x84000004) failed
>>> 
>>> 
>>> 
>>> I am also sending the log. ( I had to kill the proccess because it didnt  exit )
>>> 
>>> 
>>> 
>>> Thanx Kwstas
>>> 
>>> <log.tar.gz>
> 



More information about the mpich-discuss mailing list