[mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce

Calin Iaru calin at dolphinics.com
Fri May 30 09:48:23 CDT 2008


1) the job crashes on one machine with -n 2 at the same transfers: 
Allreduce, Reduce and Reduce_scatter. Jobs are running on Win32 only.

Jayesh Krishna wrote:
>
>  Hi,
>   Any inputs on the other points that I mentioned in my prev email ?
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: Calin Iaru [mailto:calin at dolphinics.com]
> Sent: Friday, May 30, 2008 8:17 AM
> To: Jayesh Krishna
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
>
> Hi Jayesh,
>
>     besides Allreduce, there is Reduce and Reduce_Scatter that fails.
>
> Best regards,
>     Calin
>
> Jayesh Krishna wrote:
> >
> >  Hi,
> >   I tried running the IMB 3.1 suite for allreduce on a single machine
> > with upto 8 procs and did not get any errors.
> > 
> > 1) Make sure that both node-1 & node-2 have the same data model (data
> > type representation). Please note that MPICH2 currently does not
> > support heterogeneous systems (wrt the data models used by the
> > machines, for eg: you cannot run MPI procs across x86 and x64
> > machines). If you need to run your program across a heterogeneous
> > system please use MPICH1 instead.
> >
> > 2) Try running the benchmark on a single node/host (mpiexec -n 2
> > imb-mpi1.exe allreduce) and let us know the results.
> > 3) Are you able to run other tests in the IMB 3.1 suite ?
> >
> > Regards,
> > Jayesh
> >
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Calin Iaru
> > Sent: Monday, May 26, 2008 5:50 AM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
> >
> > The problem is that the latest mpich2 in combination with IMB 3.1
> > generates a data corruption error when running on 2 nodes. IMB was
> > compiled with the CHECK flag and TOL set to 0 inside IMB_declare.h. I
> > am not sure if this is a transport error or a verification error; it
> > could be that the problem lies in the application code.
> >
> > E:\Program Files\MPICH2\bin>mpiexec.exe -hosts 2 node-1 node-2
> > \\node-1\e$\imb-mpi1.exe allreduce
> > #---------------------------------------------------
> > #    Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
> > #---------------------------------------------------
> > # Date                  : Fri May 23 14:44:12 2008
> > # Machine               : x86 Family 15 Model 4 Stepping 1, GenuineIntel
> > # System                : Windows 2003
> > # Release               : 5.2.3790
> > # Version               : Service Pack 1
> > # MPI Version           : 2.0
> > # MPI Thread Environment: MPI_THREAD_SINGLE
> >
> >
> >
> > # Calling sequence was:
> >
> > # \\node-1\e$\imb-mpi1.exe allreduce
> >
> > # Minimum message length in bytes:   0
> > # Maximum message length in bytes:   4194304
> > #
> > # MPI_Datatype                   :   MPI_BYTE
> > # MPI_Datatype for reductions    :   MPI_FLOAT
> > # MPI_Op                         :   MPI_SUM
> > #
> > #
> >
> > # List of Benchmarks to run:
> >
> > # Allreduce
> >
> > #---------------------------------------------------------------------
> > --------
> > # Benchmarking Allreduce
> > # #processes = 2
> > 
> #-----------------------------------------------------------------------------
> >        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]    
> > defects
> >             0         1000         0.51         0.52       
> > 0.51         0.00
> >             4         1000        80.30        80.35      
> > 80.33         0.00
> > 1: Error Allreduce, size = 8, sample #0 Process 1: Got invalid buffer:
> > Buffer entry: 2.300000
> > 0: Error Allreduce, size = 8, sample #0 Process 0: Got invalid buffer:
> > Buffer entry: 2.300000
> >
> >
>




More information about the mpich-discuss mailing list