<div>I have written a MPI Program that goes as below.</div>
<div><br>program.c<br>~~~~~~~</div>
<div> </div>
<div>#include "mpi.h"<br>#include <stdio.h><br>#include <stdlib.h><br>#include <string.h></div>
<p>#define ROWS (long)(2 * nprocs)<br>#define COLS (long)(4 * nprocs)<br>#define MPI_ERR_CHECK(error_code) if((error_code) != MPI_SUCCESS) { \<br> MPI_Error_string(error_code, string, &len); \
<br> fprintf(stderr, "error_code: %s\n", string); \<br> return error_code; \<br> }</p>
<p><br>int main(int argc, char **argv) {<br> int *buf = NULL, nprocs = 0, mynod = 0, error_code = 0, len = 0, i = 0, j = 0, provided;<br> char string[MPI_MAX_ERROR_STRING], filename[] = "pvfs2:/tmp/pvfs2-fs/TEST";
<br> MPI_Datatype darray;<br> MPI_File fh;<br> MPI_Status status;</p>
<p> error_code = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); MPI_ERR_CHECK(error_code);<br> error_code = MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_ERR_CHECK(error_code);<br> error_code = MPI_Comm_rank(MPI_COMM_WORLD, &mynod); MPI_ERR_CHECK(error_code);
</p>
<p> int array_size[2] = {ROWS, COLS};<br> int array_distrib[2] = {MPI_DISTRIBUTE_BLOCK, MPI_DISTRIBUTE_BLOCK};<br> int array_dargs[2] = {MPI_DISTRIBUTE_DFLT_DARG, MPI_DISTRIBUTE_DFLT_DARG};<br> int array_psizes[2] = {nprocs, 1};
</p>
<p> error_code = MPI_Type_create_darray(nprocs, mynod, 2, array_size, array_distrib, array_dargs, array_psizes, MPI_ORDER_C, MPI_INT, &darray); MPI_ERR_CHECK(error_code);<br> error_code = MPI_Type_commit(&darray); MPI_ERR_CHECK(error_code);
<br> error_code = MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDONLY, MPI_INFO_NULL, &fh); MPI_ERR_CHECK(error_code);<br> error_code = MPI_File_set_view(fh, 0, MPI_INT, darray, "native", MPI_INFO_NULL); MPI_ERR_CHECK(error_code);
<br> buf = (int *) calloc(COLS+1, sizeof MPI_INT);<br> if(!buf) { fprintf(stderr, "malloc error\n"); exit(0); }</p>
<p> for(i = 0; i < ROWS/nprocs; i++) {<br> error_code = MPI_File_read_all(fh, buf, COLS, MPI_INT, &status); MPI_ERR_CHECK(error_code);<br> }</p>
<div> free(buf);<br> MPI_Finalize();<br> return 0;<br>}</div>
<div>~~~~~~~~<br> <br>I have used extremely small buffer sizes in this example program just to check the validity of the program. My file is 1G in size and I am just reading a small portion of the file to test the program. I am getting strange errors while running the program. Sometimes under the control of a debugger the program will fail in the line where I "free(buf);".
</div>
<div>2: (gdb) backtrace<br>0: #0 main (argc=1, argv=0xbf8f8134) at program.c:44<br>1: #0 main (argc=1, argv=0xbf80a044) at program.c:44<br>2: #0 0x00211ed8 in _int_free () from /lib/libc.so.6<br>3: #0 main (argc=1, argv=0xbfaf7b34) at
program.c:44<br>0-1,3: (gdb) 2: #1 0x0021272b in free () from /lib/libc.so.6<br>2: #2 0x0804b655 in main (argc=1, argv=0xbfc76cb4) at program.c:43</div>
<div> </div>
<div>Somehow I do not think that this problem is related to freeing of the buffer because there are times when the first collective call succeeds and then subsequent collective calls fails with a SEGV. When I debug the program in this case, it shows me that the global variable "ADIOI_Flatlist" is getting corrupted. Here is where my program faults and the values of this global link list when it faults:
</div>
<div> </div>
<div>0-3: (gdb) p ADIOI_Flatlist<br>0: $1 = (ADIOI_Flatlist_node *) 0x99657e8<br>1: $1 = (ADIOI_Flatlist_node *) 0x8e1fbb0<br>2: $1 = (ADIOI_Flatlist_node *) 0x85bd608<br>3: $1 = (ADIOI_Flatlist_node *) 0x82ee7c0<br>
0-3: (gdb) p ADIOI_Flatlist->next<br>0: $2 = (struct ADIOI_Fl_node *) 0x9986310<br>1: $2 = (struct ADIOI_Fl_node *) 0x8e3d218<br>2: $2 = (struct ADIOI_Fl_node *) 0x56 <== This is the invalid value. Somehow the link list is getting corrupted. Instead of NULL, u have an invalid value like 0x56 stored in the list.
<br>3: $2 = (struct ADIOI_Fl_node *) 0x830f280</div>
<div> </div>
<div>Initially I felt that the problem is because I am creating a separate I/O thread. However, when I remove my code changes from the MPICH2 library, I continue getting the same errors. If somebody out there could help me out, I would really appreciate it. I have compiled the mpich2 library as follows:
</div>
<div># ./configure --with-pvfs2=<pvfs2path> --with-file-system=pvfs2+ufs+nfs --enable-threads=multiple -prefix=<pathToInstallMpich2> --enable-g=dbg --enable-debuginfo</div>
<div>Also my program is compiled using "-ggdb3" option.</div>
<p>Thanks,<br>Christina.<br> </p>