[mpich-discuss] MPI-IO ERROR

Dave Goodell goodell at mcs.anl.gov
Mon Sep 27 14:09:09 CDT 2010


(please don't hijack threads by replying to existing mails with a different subject, just send a plain new mail to mpich-discuss at mcs.anl.gov)

This is a general malloc failure.  How much memory do you believe that your application is using itself?  If the application consumes too much memory, there may simply be too little memory available for the MPI library to use for temporary buffers.

-Dave

On Sep 27, 2010, at 10:49 AM CDT, Weiqiang Wang wrote:

> Hi,
> 
> I'm trying to use MPI-IO on BlueGene/P cluster by incorporating it into my Fortran77 code.
> 
> The program works fine, and have reduced the time of writing files by several times faster when compared to writing out files from each core. 
> 
> However, I found out that, after I scale my program to more CPUs (from 32,768 to 65,5536), some problem starts appearing. The system has complained in my two tests that no sufficient memory can be allocated in the I/O nodes.
> In these two tests, I tried to write out totally 12,582,912 atom info (including x,y,z coordinates and velocities and datatype all in double precision). These data are distributed uniformly among all the processors.
> 
> Here below are the details of the messages in the two tests:
> 
> 1) ======================
> <Sep 24 22:40:53.496483> FE_MPI (Info) : Starting job 1636055
> <Sep 24 22:40:53.576159> FE_MPI (Info) : Waiting for job to terminate
> <Sep 24 22:40:55.770176> BE_MPI (Info) : IO - Threads initialized
> <Sep 24 22:40:55.784851> BE_MPI (Info) : I/O input runner thread terminated
> "remd22.f", line 903: 1525-037 The I/O statement cannot be processed because the I/O subsystem is unable to allocate sufficient memory for the oper
> ation.  The program will stop.
> <Sep 24 22:42:06.025409> BE_MPI (Info) : I/O output runner thread terminated
> <Sep 24 22:42:06.069553> BE_MPI (Info) : Job 1636055 switched to state TERMINATED ('T')
> 
> 2) =======================
> Out of memory in file /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c, line 498
> Out of memory in file /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/ad_bgl_wrcoll.c, line 498
> Abort(1) on node 59156 (rank 59156 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59156
> Abort(1) on node 59236 (rank 59236 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59236
> Abort(1) on node 59256 (rank 59256 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59256
> Abort(1) on node 59152 (rank 59152 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59152
> Abort(1) on node 59168 (rank 59168 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59168
> Abort(1) on node 59196 (rank 59196 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 59196
> <Sep 25 00:57:52.247734> FE_MPI (Info) : Job terminated normally
> <Sep 25 00:57:52.247865> FE_MPI (Info) : exit status = (143)
> <Sep 25 00:57:52.248145> BE_MPI (Info) : Starting cleanup sequence
> <Sep 25 00:57:52.248177> BE_MPI (Info) : cleanupDatabase() - job already terminated / hasn't been added
> <Sep 25 00:57:52.288462> BE_MPI (ERROR): The error message in the job record is as follows:
> <Sep 25 00:57:52.288495> BE_MPI (ERROR):   "killed with signal 15"
> <Sep 25 00:57:52.330783> BE_MPI (ERROR): print_job_errtext() - Job 1636142 had 8 RAS events
> <Sep 25 00:57:52.330821> BE_MPI (ERROR): print_job_errtext() - last event: KERN_080A  DDR controller single symbol error count.  Controller 1, chip
> select 4  Count=3
> <Sep 25 00:57:52.330890> BE_MPI (ERROR): print_job_errtext() - Check the Navigator's job history for complete details
> 
> 
> I would very much appreciate it if anyone could give me some advice on this and how to solve it.
> 
> Thank you!
> 
> Weiqiang
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list