nonblocking write gets stuck
Wei-keng Liao
wkliao at eecs.northwestern.edu
Thu Aug 29 13:23:09 CDT 2019
I tested your codes on one compute node and see no error or hanging.
What version of PnetCDF you are using? I tested 1.11.2
Wei-keng
> On Aug 29, 2019, at 12:22 PM, 刘壮 <liuzhuang at lsec.cc.ac.cn> wrote:
>
> Hi Wei-keng,
>
> Thanks very much for your reply.
> I am trying to use part of mpi processes to do the output for my program.
> For example, if "group_size=10", and the total number of running processes
> is 41, then I want to use processes "0, 10, 20, 30, 40" to do the output, which
> have the feather "grank=0". To create the nc file, I know that all the output
> processes should be in the same communicator, so I split the MPI_COMM_WORLD
> to comm2, then the output processes "0, 10, 20, 30, 40" are in the same comm2.
> Am I misusing some wrong interface in mpi or pnetcdf?
>
> Best
>
>
>> -----原始邮件-----
>> 发件人: "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>> 发送时间: 2019-08-30 00:38:08 (星期五)
>> 收件人: "刘壮" <liuzhuang at lsec.cc.ac.cn>
>> 抄送: parallel-netcdf at lists.mcs.anl.gov
>> 主题: Re: nonblocking write gets stuck
>>
>> I notice the followings from your codes.
>>
>> grank is produced from comm1 in line 68
>> 68 call mpi_comm_rank(comm1, grank, err)
>>
>> But when creating a new file, comm2 is used.
>> 111 if(grank .eq. 0) then
>> 112 err = nfmpi_create(comm2, filename, cmode, info, ncid)
>>
>> All collective I/O subroutines, such as nfmpi_create, require all
>> processes in the communicator to participate (in this case, all
>> processes in comm2.)
>>
>> Please explain what you are trying to do.
>>
>> Wei-keng
>>
>>> On Aug 29, 2019, at 9:16 AM, 刘壮 via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
>>>
>>> Hi:
>>>
>>> I have got a problem when using the nonblocking-write function in pnetcdf. The problem seems
>>> very strange, my program gets stuck in the function "nfmpi_wait_all".
>>> However, if all the outputing processes are running on one node, the problem will go away. And
>>> I have test my program on several machines, only one of them has this problem.
>>> The attached file is a simplified example of my program, which also has this problem. The files
>>> in "Start" and "Count" directories are the "starts" and "counts" for the outputing processes. To
>>> see this problem, one can use 41~49 mpi processes to run this program (if your machine has more
>>> than 50 processors on one node, please modify "group_size" to larger numbers and run the program
>>> using 4*group_size+1~5*group_size-1 processors, to make sure that the outputing processes are
>>> running on at least two nodes).
>>> Suggestions are repected. Thank you very much!
>>>
>>> Best,
>>> Zhuang
>>> <test.tar.gz>
More information about the parallel-netcdf
mailing list