nonblocking write gets stuck

Wei-keng Liao wkliao at eecs.northwestern.edu
Thu Aug 29 13:23:09 CDT 2019


I tested your codes on one compute node and see no error or hanging.
What version of PnetCDF you are using? I tested 1.11.2

Wei-keng

> On Aug 29, 2019, at 12:22 PM, 刘壮 <liuzhuang at lsec.cc.ac.cn> wrote:
> 
> Hi Wei-keng,
> 
>     Thanks very much for your reply. 
>     I am trying to use part of mpi processes to do the output for my program. 
> For example, if "group_size=10", and the total number of running processes 
> is 41, then I want to use processes "0, 10, 20, 30, 40" to do the output, which 
> have the feather "grank=0". To create the nc file, I know that all the output 
> processes should be in the same communicator, so I split the MPI_COMM_WORLD 
> to comm2, then the output processes "0, 10, 20, 30, 40" are in the same comm2.
>     Am I misusing some wrong interface in mpi or pnetcdf?
> 
> Best
> 
> 
>> -----原始邮件-----
>> 发件人: "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>> 发送时间: 2019-08-30 00:38:08 (星期五)
>> 收件人: "刘壮" <liuzhuang at lsec.cc.ac.cn>
>> 抄送: parallel-netcdf at lists.mcs.anl.gov
>> 主题: Re: nonblocking write gets stuck
>> 
>> I notice the followings from your codes.
>> 
>> grank is produced from comm1 in line 68
>> 68           call mpi_comm_rank(comm1, grank, err)
>> 
>> But when creating a new file, comm2 is used.
>> 111           if(grank .eq. 0) then
>> 112             err = nfmpi_create(comm2, filename, cmode, info, ncid)
>> 
>> All collective I/O subroutines, such as nfmpi_create, require all
>> processes in the communicator to participate (in this case, all
>> processes in comm2.)
>> 
>> Please explain what you are trying to do.
>> 
>> Wei-keng
>> 
>>> On Aug 29, 2019, at 9:16 AM, 刘壮 via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
>>> 
>>> Hi:
>>> 
>>> I have got a problem when using the nonblocking-write function in pnetcdf. The problem seems
>>> very strange, my program gets stuck in the function "nfmpi_wait_all". 
>>> However, if all the outputing processes are running on one node, the problem will go away. And
>>> I have test my program on several machines, only one of them has this problem. 
>>> The attached file is a simplified example of my program, which also has this problem. The files
>>> in "Start" and "Count" directories are the "starts" and "counts" for the outputing processes. To 
>>> see this problem, one can use 41~49 mpi processes to run this program (if your machine has more 
>>> than 50 processors on one node, please modify "group_size" to larger numbers and run the program 
>>> using 4*group_size+1~5*group_size-1 processors, to make sure that the outputing processes are 
>>> running on at least two nodes).
>>> Suggestions are repected. Thank you very much!
>>> 
>>> Best,
>>> Zhuang
>>> <test.tar.gz>



More information about the parallel-netcdf mailing list