nonblocking write gets stuck

刘壮 liuzhuang at lsec.cc.ac.cn
Thu Aug 29 12:22:47 CDT 2019


Hi Wei-keng,

     Thanks very much for your reply. 
     I am trying to use part of mpi processes to do the output for my program. 
For example, if "group_size=10", and the total number of running processes 
is 41, then I want to use processes "0, 10, 20, 30, 40" to do the output, which 
have the feather "grank=0". To create the nc file, I know that all the output 
processes should be in the same communicator, so I split the MPI_COMM_WORLD 
to comm2, then the output processes "0, 10, 20, 30, 40" are in the same comm2.
     Am I misusing some wrong interface in mpi or pnetcdf?

Best


> -----原始邮件-----
> 发件人: "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
> 发送时间: 2019-08-30 00:38:08 (星期五)
> 收件人: "刘壮" <liuzhuang at lsec.cc.ac.cn>
> 抄送: parallel-netcdf at lists.mcs.anl.gov
> 主题: Re: nonblocking write gets stuck
> 
> I notice the followings from your codes.
> 
> grank is produced from comm1 in line 68
> 68           call mpi_comm_rank(comm1, grank, err)
> 
> But when creating a new file, comm2 is used.
> 111           if(grank .eq. 0) then
> 112             err = nfmpi_create(comm2, filename, cmode, info, ncid)
> 
> All collective I/O subroutines, such as nfmpi_create, require all
> processes in the communicator to participate (in this case, all
> processes in comm2.)
> 
> Please explain what you are trying to do.
> 
> Wei-keng
> 
> > On Aug 29, 2019, at 9:16 AM, 刘壮 via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
> > 
> > Hi:
> > 
> >  I have got a problem when using the nonblocking-write function in pnetcdf. The problem seems
> > very strange, my program gets stuck in the function "nfmpi_wait_all". 
> >  However, if all the outputing processes are running on one node, the problem will go away. And
> > I have test my program on several machines, only one of them has this problem. 
> >  The attached file is a simplified example of my program, which also has this problem. The files
> > in "Start" and "Count" directories are the "starts" and "counts" for the outputing processes. To 
> > see this problem, one can use 41~49 mpi processes to run this program (if your machine has more 
> > than 50 processors on one node, please modify "group_size" to larger numbers and run the program 
> > using 4*group_size+1~5*group_size-1 processors, to make sure that the outputing processes are 
> > running on at least two nodes).
> >  Suggestions are repected. Thank you very much!
> > 
> > Best,
> > Zhuang
> > <test.tar.gz>


More information about the parallel-netcdf mailing list