[mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce

Jayesh Krishna jayesh at mcs.anl.gov
Fri May 30 12:39:52 CDT 2008


 Hi,
  I tried the allreduce test in IMB on a 64-bit machine with 2 dual core
procs (total 4 cores) and did not get any errors.
  Can you try the following ?,

# Disable hyperthreading
# Download the latest version of MPICH2 (available at
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=down
loads). Uninstall any existing version of MPICH2 in your system and
install the downloaded version.
# Remove any modifications that you made to IMB. It would be best to use a
fresh download of IMB.
# Recompile IMB (Note that you should link your applications with mpi.lib
NOT mpich2.lib.)
# Rerun the allreduce benchmark (on the local machine - "mpiexec -n 2
imb-mpi1.exe allreduce")

  Let us know the results.

Regards,
Jayesh

-----Original Message-----
From: Calin Iaru [mailto:calin at dolphinics.com] 
Sent: Friday, May 30, 2008 11:58 AM
To: Jayesh Krishna
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce

IMB is compiled with Studio 2003 command prompt by launching "nmake -f
make_ict_win", links to Program Files\mpich2\lib\mpich2.lib; the machine
where it runs has 2 cpus that run with hyperthreading enabled and is also
the build machine. I ran it on both CPUs and on the same CPU by adding a
SetProcessAffinityMask before MPI_Init.

I added some information like the hexadecimal representation of the
expected value and the hexadecimal representation of the difference
between the expected and the arrived value.



Jayesh Krishna wrote:
>
>  Hi,
>   Please provide us as much details as possible so that we can help 
> with your problem (I am not able to reproduce the error in our lab. I 
> tried allreduce - 16 procs, reduce - 2 procs, reduce_scatter - 2 
> procs, on an x86 WinXP machine with 1 proc).
>
> # Make sure that you compile the IMB 3.1 suite in your local machine 
> (don't execute an executable created on another machine - to narrow 
> down on the pblm) # Run your job as "mpiexec -n 2 imb-mpi1.exe 
> allreduce"
> # Are you running your tests on a multi-core machine ?
>
>   Once again pls provide as much details as possible in your reply.
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: Calin Iaru [mailto:calin at dolphinics.com]
> Sent: Friday, May 30, 2008 9:48 AM
> To: Jayesh Krishna
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
>
> 1) the job crashes on one machine with -n 2 at the same transfers:
> Allreduce, Reduce and Reduce_scatter. Jobs are running on Win32 only.
>
> Jayesh Krishna wrote:
> >
> >  Hi,
> >   Any inputs on the other points that I mentioned in my prev email ?
> >
> > Regards,
> > Jayesh
> >
> > -----Original Message-----
> > From: Calin Iaru [mailto:calin at dolphinics.com]
> > Sent: Friday, May 30, 2008 8:17 AM
> > To: Jayesh Krishna
> > Cc: mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
> >
> > Hi Jayesh,
> >
> >     besides Allreduce, there is Reduce and Reduce_Scatter that fails.
> >
> > Best regards,
> >     Calin
> >
> > Jayesh Krishna wrote:
> > >
> > >  Hi,
> > >   I tried running the IMB 3.1 suite for allreduce on a single 
> > > machine with upto 8 procs and did not get any errors.
> > >
> > > 1) Make sure that both node-1 & node-2 have the same data model 
> > > (data type representation). Please note that MPICH2 currently does 
> > > not support heterogeneous systems (wrt the data models used by the 
> > > machines, for eg: you cannot run MPI procs across x86 and x64 
> > > machines). If you need to run your program across a heterogeneous 
> > > system please use MPICH1 instead.
> > >
> > > 2) Try running the benchmark on a single node/host (mpiexec -n 2 
> > > imb-mpi1.exe allreduce) and let us know the results.
> > > 3) Are you able to run other tests in the IMB 3.1 suite ?
> > >
> > > Regards,
> > > Jayesh
> > >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov 
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Calin Iaru
> > > Sent: Monday, May 26, 2008 5:50 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
> > >
> > > The problem is that the latest mpich2 in combination with IMB 3.1 
> > > generates a data corruption error when running on 2 nodes. IMB was 
> > > compiled with the CHECK flag and TOL set to 0 inside IMB_declare.h.
> > > I am not sure if this is a transport error or a verification 
> > > error; it could be that the problem lies in the application code.
> > >
> > > E:\Program Files\MPICH2\bin>mpiexec.exe -hosts 2 node-1 node-2 
> > > \\node-1\e$\imb-mpi1.exe allreduce
> > > #---------------------------------------------------
> > > #    Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
> > > #---------------------------------------------------
> > > # Date                  : Fri May 23 14:44:12 2008
> > > # Machine               : x86 Family 15 Model 4 Stepping 1, 
> GenuineIntel
> > > # System                : Windows 2003
> > > # Release               : 5.2.3790
> > > # Version               : Service Pack 1
> > > # MPI Version           : 2.0
> > > # MPI Thread Environment: MPI_THREAD_SINGLE
> > >
> > >
> > >
> > > # Calling sequence was:
> > >
> > > # \\node-1\e$\imb-mpi1.exe allreduce
> > >
> > > # Minimum message length in bytes:   0
> > > # Maximum message length in bytes:   4194304
> > > #
> > > # MPI_Datatype                   :   MPI_BYTE
> > > # MPI_Datatype for reductions    :   MPI_FLOAT
> > > # MPI_Op                         :   MPI_SUM
> > > #
> > > #
> > >
> > > # List of Benchmarks to run:
> > >
> > > # Allreduce
> > >
> > > #-----------------------------------------------------------------
> > > --
> > > --
> > > --------
> > > # Benchmarking Allreduce
> > > # #processes = 2
> > >
> > #-------------------------------------------------------------------
> > --
> > --------
> > >        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   
> > > defects
> > >             0         1000         0.51         0.52      
> > > 0.51         0.00
> > >             4         1000        80.30        80.35     
> > > 80.33         0.00
> > > 1: Error Allreduce, size = 8, sample #0 Process 1: Got invalid
buffer:
> > > Buffer entry: 2.300000
> > > 0: Error Allreduce, size = 8, sample #0 Process 0: Got invalid
buffer:
> > > Buffer entry: 2.300000
> > >
> > >
> >
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080530/fe3df785/attachment.htm>


More information about the mpich-discuss mailing list