[MPICH] Problem in using the MPI_Type_create_darray() API

Rajeev Thakur thakur at mcs.anl.gov
Thu Mar 15 13:29:33 CDT 2007


If the global array is block distributed along rows only, and the total
number of rows is 2*nprocs, shouldn't the local array size be 2*COLS? The
memory buffer allocated below, however, is of size COLS+1.
 
Rajeev


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Christina Patrick
Sent: Thursday, March 15, 2007 11:35 AM
To: mpich-discuss-digest at mcs.anl.gov
Subject: [MPICH] Problem in using the MPI_Type_create_darray() API


I have written a MPI Program that goes as below.

program.c
~~~~~~~
 
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ROWS   (long)(2 * nprocs)
#define COLS   (long)(4 * nprocs)
#define MPI_ERR_CHECK(error_code) if((error_code) != MPI_SUCCESS) { \
                                    MPI_Error_string(error_code, string,
&len); \ 
                                    fprintf(stderr, "error_code: %s\n",
string); \
                                    return error_code; \
                                  }


int main(int argc, char **argv) {
  int *buf = NULL, nprocs = 0, mynod = 0, error_code = 0, len = 0, i = 0, j
= 0, provided;
  char string[MPI_MAX_ERROR_STRING], filename[] =
"pvfs2:/tmp/pvfs2-fs/TEST"; 
  MPI_Datatype darray;
  MPI_File fh;
  MPI_Status status;

  error_code = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE,
&provided); MPI_ERR_CHECK(error_code);
  error_code = MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_ERR_CHECK(error_code);
  error_code = MPI_Comm_rank(MPI_COMM_WORLD, &mynod);
MPI_ERR_CHECK(error_code); 

  int array_size[2]    = {ROWS, COLS};
  int array_distrib[2] = {MPI_DISTRIBUTE_BLOCK, MPI_DISTRIBUTE_BLOCK};
  int array_dargs[2]   = {MPI_DISTRIBUTE_DFLT_DARG,
MPI_DISTRIBUTE_DFLT_DARG};
  int array_psizes[2]  = {nprocs, 1}; 

  error_code = MPI_Type_create_darray(nprocs, mynod, 2, array_size,
array_distrib, array_dargs, array_psizes, MPI_ORDER_C, MPI_INT, &darray);
MPI_ERR_CHECK(error_code);
  error_code = MPI_Type_commit(&darray); MPI_ERR_CHECK(error_code); 
  error_code = MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDONLY,
MPI_INFO_NULL, &fh); MPI_ERR_CHECK(error_code);
  error_code = MPI_File_set_view(fh, 0, MPI_INT, darray, "native",
MPI_INFO_NULL); MPI_ERR_CHECK(error_code); 
  buf = (int *) calloc(COLS+1, sizeof MPI_INT);
  if(!buf) { fprintf(stderr, "malloc error\n"); exit(0); }

  for(i = 0; i < ROWS/nprocs; i++) {
    error_code = MPI_File_read_all(fh, buf, COLS, MPI_INT, &status);
MPI_ERR_CHECK(error_code);
  }

  free(buf);
  MPI_Finalize();
  return 0;
}
~~~~~~~~
 
I have used extremely small buffer sizes in this example program just to
check the validity of the program. My file is 1G in size and I am just
reading a small portion of the file to test the program. I am getting
strange errors while running the program. Sometimes under the control of a
debugger the program will fail in the line where I "free(buf);". 
2:  (gdb) backtrace
0:  #0  main (argc=1, argv=0xbf8f8134) at program.c:44
1:  #0  main (argc=1, argv=0xbf80a044) at program.c:44
2:  #0  0x00211ed8 in _int_free () from /lib/libc.so.6
3:  #0  main (argc=1, argv=0xbfaf7b34) at program.c:44
0-1,3:  (gdb) 2:  #1  0x0021272b in free () from /lib/libc.so.6
2:  #2  0x0804b655 in main (argc=1, argv=0xbfc76cb4) at program.c:43
 
Somehow I do not think that this problem is related to freeing of the buffer
because there are times when the first collective call succeeds and then
subsequent collective calls fails with a SEGV. When I debug the program in
this case, it shows me that the global variable "ADIOI_Flatlist" is getting
corrupted. Here is where my program faults and the values of this global
link list when it faults: 
 
0-3:  (gdb) p ADIOI_Flatlist
0:  $1 = (ADIOI_Flatlist_node *) 0x99657e8
1:  $1 = (ADIOI_Flatlist_node *) 0x8e1fbb0
2:  $1 = (ADIOI_Flatlist_node *) 0x85bd608
3:  $1 = (ADIOI_Flatlist_node *) 0x82ee7c0
0-3:  (gdb) p ADIOI_Flatlist->next
0:  $2 = (struct ADIOI_Fl_node *) 0x9986310
1:  $2 = (struct ADIOI_Fl_node *) 0x8e3d218
2:  $2 = (struct ADIOI_Fl_node *) 0x56          <== This is the invalid
value. Somehow the link list is getting corrupted. Instead of NULL, u have
an invalid value like 0x56 stored in the list. 
3:  $2 = (struct ADIOI_Fl_node *) 0x830f280
 
Initially I felt that the problem is because I am creating a separate I/O
thread. However, when I remove my code changes from the MPICH2 library, I
continue getting the same errors. If somebody out there could help me out, I
would really appreciate it. I have compiled the mpich2 library as follows: 
# ./configure --with-pvfs2=<pvfs2path> --with-file-system=pvfs2+ufs+nfs
--enable-threads=multiple -prefix=<pathToInstallMpich2> --enable-g=dbg
--enable-debuginfo
Also my program is compiled using "-ggdb3" option.

Thanks,
Christina.
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070315/9876450e/attachment.htm>


More information about the mpich-discuss mailing list