[MPICH] Problem in using the MPI_Type_create_darray() API
Rajeev Thakur
thakur at mcs.anl.gov
Thu Mar 15 13:29:33 CDT 2007
If the global array is block distributed along rows only, and the total
number of rows is 2*nprocs, shouldn't the local array size be 2*COLS? The
memory buffer allocated below, however, is of size COLS+1.
Rajeev
_____
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Christina Patrick
Sent: Thursday, March 15, 2007 11:35 AM
To: mpich-discuss-digest at mcs.anl.gov
Subject: [MPICH] Problem in using the MPI_Type_create_darray() API
I have written a MPI Program that goes as below.
program.c
~~~~~~~
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ROWS (long)(2 * nprocs)
#define COLS (long)(4 * nprocs)
#define MPI_ERR_CHECK(error_code) if((error_code) != MPI_SUCCESS) { \
MPI_Error_string(error_code, string,
&len); \
fprintf(stderr, "error_code: %s\n",
string); \
return error_code; \
}
int main(int argc, char **argv) {
int *buf = NULL, nprocs = 0, mynod = 0, error_code = 0, len = 0, i = 0, j
= 0, provided;
char string[MPI_MAX_ERROR_STRING], filename[] =
"pvfs2:/tmp/pvfs2-fs/TEST";
MPI_Datatype darray;
MPI_File fh;
MPI_Status status;
error_code = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE,
&provided); MPI_ERR_CHECK(error_code);
error_code = MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_ERR_CHECK(error_code);
error_code = MPI_Comm_rank(MPI_COMM_WORLD, &mynod);
MPI_ERR_CHECK(error_code);
int array_size[2] = {ROWS, COLS};
int array_distrib[2] = {MPI_DISTRIBUTE_BLOCK, MPI_DISTRIBUTE_BLOCK};
int array_dargs[2] = {MPI_DISTRIBUTE_DFLT_DARG,
MPI_DISTRIBUTE_DFLT_DARG};
int array_psizes[2] = {nprocs, 1};
error_code = MPI_Type_create_darray(nprocs, mynod, 2, array_size,
array_distrib, array_dargs, array_psizes, MPI_ORDER_C, MPI_INT, &darray);
MPI_ERR_CHECK(error_code);
error_code = MPI_Type_commit(&darray); MPI_ERR_CHECK(error_code);
error_code = MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDONLY,
MPI_INFO_NULL, &fh); MPI_ERR_CHECK(error_code);
error_code = MPI_File_set_view(fh, 0, MPI_INT, darray, "native",
MPI_INFO_NULL); MPI_ERR_CHECK(error_code);
buf = (int *) calloc(COLS+1, sizeof MPI_INT);
if(!buf) { fprintf(stderr, "malloc error\n"); exit(0); }
for(i = 0; i < ROWS/nprocs; i++) {
error_code = MPI_File_read_all(fh, buf, COLS, MPI_INT, &status);
MPI_ERR_CHECK(error_code);
}
free(buf);
MPI_Finalize();
return 0;
}
~~~~~~~~
I have used extremely small buffer sizes in this example program just to
check the validity of the program. My file is 1G in size and I am just
reading a small portion of the file to test the program. I am getting
strange errors while running the program. Sometimes under the control of a
debugger the program will fail in the line where I "free(buf);".
2: (gdb) backtrace
0: #0 main (argc=1, argv=0xbf8f8134) at program.c:44
1: #0 main (argc=1, argv=0xbf80a044) at program.c:44
2: #0 0x00211ed8 in _int_free () from /lib/libc.so.6
3: #0 main (argc=1, argv=0xbfaf7b34) at program.c:44
0-1,3: (gdb) 2: #1 0x0021272b in free () from /lib/libc.so.6
2: #2 0x0804b655 in main (argc=1, argv=0xbfc76cb4) at program.c:43
Somehow I do not think that this problem is related to freeing of the buffer
because there are times when the first collective call succeeds and then
subsequent collective calls fails with a SEGV. When I debug the program in
this case, it shows me that the global variable "ADIOI_Flatlist" is getting
corrupted. Here is where my program faults and the values of this global
link list when it faults:
0-3: (gdb) p ADIOI_Flatlist
0: $1 = (ADIOI_Flatlist_node *) 0x99657e8
1: $1 = (ADIOI_Flatlist_node *) 0x8e1fbb0
2: $1 = (ADIOI_Flatlist_node *) 0x85bd608
3: $1 = (ADIOI_Flatlist_node *) 0x82ee7c0
0-3: (gdb) p ADIOI_Flatlist->next
0: $2 = (struct ADIOI_Fl_node *) 0x9986310
1: $2 = (struct ADIOI_Fl_node *) 0x8e3d218
2: $2 = (struct ADIOI_Fl_node *) 0x56 <== This is the invalid
value. Somehow the link list is getting corrupted. Instead of NULL, u have
an invalid value like 0x56 stored in the list.
3: $2 = (struct ADIOI_Fl_node *) 0x830f280
Initially I felt that the problem is because I am creating a separate I/O
thread. However, when I remove my code changes from the MPICH2 library, I
continue getting the same errors. If somebody out there could help me out, I
would really appreciate it. I have compiled the mpich2 library as follows:
# ./configure --with-pvfs2=<pvfs2path> --with-file-system=pvfs2+ufs+nfs
--enable-threads=multiple -prefix=<pathToInstallMpich2> --enable-g=dbg
--enable-debuginfo
Also my program is compiled using "-ggdb3" option.
Thanks,
Christina.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070315/9876450e/attachment.htm>
More information about the mpich-discuss
mailing list