[mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce

Calin Iaru calin at dolphinics.com
Fri May 30 14:15:07 CDT 2008


Jayesh Krishna wrote:
>
> # Disable hyperthreading
>
It crashes on nodes with 1 cpu, no point in doing that.
>
> # Download the latest version of MPICH2 (available at 
> http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads). 
> Uninstall any existing version of MPICH2 in your system and install 
> the downloaded version.
>
done
>
> # Remove any modifications that you made to IMB. It would be best to 
> use a fresh download of IMB.
>
Fresh will run. TOL 0 and CHECK will fail, no point to retry.
>
> # Recompile IMB (Note that you should link your applications with 
> mpi.lib NOT mpich2.lib.)
>
Done, fails.
>
> # Rerun the allreduce benchmark (on the local machine - "mpiexec -n 2 
> imb-mpi1.exe allreduce")
>
I guess you saw that I am running IMB.exe in some log files that I sent 
you. The reason is that I needed to specify some new link options that 
would help me debug the issues and I had to specify a new /out:IMB.exe - 
just a typo.

I can do all of the above for a third time, or:

Inside IMB_chk_dadd, the following assignment takes place:

   for(rank = rank0; rank<= rank1; rank++)
   {
       for(i=0; i<Locsize/asize; i++)
        ((assign_type*)AUX)[i] += BUF_VALUE(rank,buf_pos/asize+i);
      
   }

When this code is compiled, the test reports a data corruption.

If I modify the sequence as:
   for(rank = rank0; rank<= rank1; rank++)
   {
       for(i=0; i<Locsize/asize; i++) {
            assign_type x = BUF_VALUE(rank,buf_pos/asize+i);
        ((assign_type*)AUX)[i] += x;
       }
      
   }
then the code runs successfully.

Provided that this is not something I did wrong (which even with the 
most attention I had could still happen), then this code looks like a 
compiler issue. I plan to compile with /FAsc and continue investigating.





More information about the mpich-discuss mailing list