[mpich-discuss] MPI-IO ERROR

Mon Sep 27 10:49:55 CDT 2010

Hi,

I'm trying to use MPI-IO on BlueGene/P cluster by incorporating it into my Fortran77 code.

The program works fine, and have reduced the time of writing files by several times faster when compared to writing out files from each core. 

However, I found out that, after I scale my program to more CPUs (from 32,768 to 65,5536), some problem starts appearing. The system has complained in my two tests that no sufficient memory can be allocated in the I/O nodes.
In these two tests, I tried to write out totally 12,582,912 atom info (including x,y,z coordinates and velocities and datatype all in double precision). These data are distributed uniformly among all the processors.

Here below are the details of the messages in the two tests:

1) ======================
<Sep 24 22:40:53.496483> FE_MPI (Info) : Starting job 1636055
<Sep 24 22:40:53.576159> FE_MPI (Info) : Waiting for job to terminate
<Sep 24 22:40:55.770176> BE_MPI (Info) : IO - Threads initialized
<Sep 24 22:40:55.784851> BE_MPI (Info) : I/O input runner thread terminated
"remd22.f", line 903: 1525-037 The I/O statement cannot be processed because the I/O subsystem is unable to allocate sufficient memory for the oper
ation.  The program will stop.
<Sep 24 22:42:06.025409> BE_MPI (Info) : I/O output runner thread terminated
<Sep 24 22:42:06.069553> BE_MPI (Info) : Job 1636055 switched to state TERMINATED ('T')

2) =======================
Out of memory in file /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c, line 498
Out of memory in file /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c, line 498
Abort(1) on node 59156 (rank 59156 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59156
Abort(1) on node 59236 (rank 59236 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59236
Abort(1) on node 59256 (rank 59256 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59256
Abort(1) on node 59152 (rank 59152 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59152
Abort(1) on node 59168 (rank 59168 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59168
Abort(1) on node 59196 (rank 59196 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59196
<Sep 25 00:57:52.247734> FE_MPI (Info) : Job terminated normally
<Sep 25 00:57:52.247865> FE_MPI (Info) : exit status = (143)
<Sep 25 00:57:52.248145> BE_MPI (Info) : Starting cleanup sequence
<Sep 25 00:57:52.248177> BE_MPI (Info) : cleanupDatabase() - job already terminated / hasn't been added
<Sep 25 00:57:52.288462> BE_MPI (ERROR): The error message in the job record is as follows:
<Sep 25 00:57:52.288495> BE_MPI (ERROR):   "killed with signal 15"
<Sep 25 00:57:52.330783> BE_MPI (ERROR): print_job_errtext() - Job 1636142 had 8 RAS events
<Sep 25 00:57:52.330821> BE_MPI (ERROR): print_job_errtext() - last event: KERN_080A  DDR controller single symbol error count.  Controller 1, chip
select 4  Count=3
<Sep 25 00:57:52.330890> BE_MPI (ERROR): print_job_errtext() - Check the Navigator's job history for complete details

I would very much appreciate it if anyone could give me some advice on this and how to solve it.

Thank you!

Weiqiang