[mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce

Calin Iaru calin at dolphinics.com
Fri May 30 11:57:56 CDT 2008


IMB is compiled with Studio 2003 command prompt by launching "nmake -f 
make_ict_win", links to Program Files\mpich2\lib\mpich2.lib; the machine 
where it runs has 2 cpus that run with hyperthreading enabled and is 
also the build machine. I ran it on both CPUs and on the same CPU by 
adding a SetProcessAffinityMask before MPI_Init.

I added some information like the hexadecimal representation of the 
expected value and the hexadecimal representation of the difference 
between the expected and the arrived value.



Jayesh Krishna wrote:
>
>  Hi,
>   Please provide us as much details as possible so that we can help 
> with your problem (I am not able to reproduce the error in our lab. I 
> tried allreduce - 16 procs, reduce - 2 procs, reduce_scatter - 2 
> procs, on an x86 WinXP machine with 1 proc).
>
> # Make sure that you compile the IMB 3.1 suite in your local machine 
> (don't execute an executable created on another machine - to narrow 
> down on the pblm)
> # Run your job as "mpiexec -n 2 imb-mpi1.exe allreduce"
> # Are you running your tests on a multi-core machine ?
>
>   Once again pls provide as much details as possible in your reply.
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: Calin Iaru [mailto:calin at dolphinics.com]
> Sent: Friday, May 30, 2008 9:48 AM
> To: Jayesh Krishna
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
>
> 1) the job crashes on one machine with -n 2 at the same transfers:
> Allreduce, Reduce and Reduce_scatter. Jobs are running on Win32 only.
>
> Jayesh Krishna wrote:
> >
> >  Hi,
> >   Any inputs on the other points that I mentioned in my prev email ?
> >
> > Regards,
> > Jayesh
> >
> > -----Original Message-----
> > From: Calin Iaru [mailto:calin at dolphinics.com]
> > Sent: Friday, May 30, 2008 8:17 AM
> > To: Jayesh Krishna
> > Cc: mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
> >
> > Hi Jayesh,
> >
> >     besides Allreduce, there is Reduce and Reduce_Scatter that fails.
> >
> > Best regards,
> >     Calin
> >
> > Jayesh Krishna wrote:
> > >
> > >  Hi,
> > >   I tried running the IMB 3.1 suite for allreduce on a single
> > > machine with upto 8 procs and did not get any errors.
> > >
> > > 1) Make sure that both node-1 & node-2 have the same data model
> > > (data type representation). Please note that MPICH2 currently does
> > > not support heterogeneous systems (wrt the data models used by the
> > > machines, for eg: you cannot run MPI procs across x86 and x64
> > > machines). If you need to run your program across a heterogeneous
> > > system please use MPICH1 instead.
> > >
> > > 2) Try running the benchmark on a single node/host (mpiexec -n 2
> > > imb-mpi1.exe allreduce) and let us know the results.
> > > 3) Are you able to run other tests in the IMB 3.1 suite ?
> > >
> > > Regards,
> > > Jayesh
> > >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Calin Iaru
> > > Sent: Monday, May 26, 2008 5:50 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
> > >
> > > The problem is that the latest mpich2 in combination with IMB 3.1
> > > generates a data corruption error when running on 2 nodes. IMB was
> > > compiled with the CHECK flag and TOL set to 0 inside IMB_declare.h.
> > > I am not sure if this is a transport error or a verification error;
> > > it could be that the problem lies in the application code.
> > >
> > > E:\Program Files\MPICH2\bin>mpiexec.exe -hosts 2 node-1 node-2
> > > \\node-1\e$\imb-mpi1.exe allreduce
> > > #---------------------------------------------------
> > > #    Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
> > > #---------------------------------------------------
> > > # Date                  : Fri May 23 14:44:12 2008
> > > # Machine               : x86 Family 15 Model 4 Stepping 1, 
> GenuineIntel
> > > # System                : Windows 2003
> > > # Release               : 5.2.3790
> > > # Version               : Service Pack 1
> > > # MPI Version           : 2.0
> > > # MPI Thread Environment: MPI_THREAD_SINGLE
> > >
> > >
> > >
> > > # Calling sequence was:
> > >
> > > # \\node-1\e$\imb-mpi1.exe allreduce
> > >
> > > # Minimum message length in bytes:   0
> > > # Maximum message length in bytes:   4194304
> > > #
> > > # MPI_Datatype                   :   MPI_BYTE
> > > # MPI_Datatype for reductions    :   MPI_FLOAT
> > > # MPI_Op                         :   MPI_SUM
> > > #
> > > #
> > >
> > > # List of Benchmarks to run:
> > >
> > > # Allreduce
> > >
> > > #-------------------------------------------------------------------
> > > --
> > > --------
> > > # Benchmarking Allreduce
> > > # #processes = 2
> > >
> > #---------------------------------------------------------------------
> > --------
> > >        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   
> > > defects
> > >             0         1000         0.51         0.52      
> > > 0.51         0.00
> > >             4         1000        80.30        80.35     
> > > 80.33         0.00
> > > 1: Error Allreduce, size = 8, sample #0 Process 1: Got invalid buffer:
> > > Buffer entry: 2.300000
> > > 0: Error Allreduce, size = 8, sample #0 Process 0: Got invalid buffer:
> > > Buffer entry: 2.300000
> > >
> > >
> >
>
>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: allreduce.txt
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080530/5f3a7869/attachment.txt>


More information about the mpich-discuss mailing list