nonblocking write gets stuck
Wei-keng Liao
wkliao at eecs.northwestern.edu
Thu Aug 29 13:53:29 CDT 2019
If hanging only occurs when running on more than 1 compute node,
then the root of problems most likely is from the MPI library.
Did you run the program using NFS file system?
What MPI library are you using?
Wei-keng
> On Aug 29, 2019, at 1:23 PM, Wei-keng Liao via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
>
> I tested your codes on one compute node and see no error or hanging.
> What version of PnetCDF you are using? I tested 1.11.2
>
> Wei-keng
>
>> On Aug 29, 2019, at 12:22 PM, 刘壮 <liuzhuang at lsec.cc.ac.cn> wrote:
>>
>> Hi Wei-keng,
>>
>> Thanks very much for your reply.
>> I am trying to use part of mpi processes to do the output for my program.
>> For example, if "group_size=10", and the total number of running processes
>> is 41, then I want to use processes "0, 10, 20, 30, 40" to do the output, which
>> have the feather "grank=0". To create the nc file, I know that all the output
>> processes should be in the same communicator, so I split the MPI_COMM_WORLD
>> to comm2, then the output processes "0, 10, 20, 30, 40" are in the same comm2.
>> Am I misusing some wrong interface in mpi or pnetcdf?
>>
>> Best
>>
>>
>>> -----原始邮件-----
>>> 发件人: "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>> 发送时间: 2019-08-30 00:38:08 (星期五)
>>> 收件人: "刘壮" <liuzhuang at lsec.cc.ac.cn>
>>> 抄送: parallel-netcdf at lists.mcs.anl.gov
>>> 主题: Re: nonblocking write gets stuck
>>>
>>> I notice the followings from your codes.
>>>
>>> grank is produced from comm1 in line 68
>>> 68 call mpi_comm_rank(comm1, grank, err)
>>>
>>> But when creating a new file, comm2 is used.
>>> 111 if(grank .eq. 0) then
>>> 112 err = nfmpi_create(comm2, filename, cmode, info, ncid)
>>>
>>> All collective I/O subroutines, such as nfmpi_create, require all
>>> processes in the communicator to participate (in this case, all
>>> processes in comm2.)
>>>
>>> Please explain what you are trying to do.
>>>
>>> Wei-keng
>>>
>>>> On Aug 29, 2019, at 9:16 AM, 刘壮 via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
>>>>
>>>> Hi:
>>>>
>>>> I have got a problem when using the nonblocking-write function in pnetcdf. The problem seems
>>>> very strange, my program gets stuck in the function "nfmpi_wait_all".
>>>> However, if all the outputing processes are running on one node, the problem will go away. And
>>>> I have test my program on several machines, only one of them has this problem.
>>>> The attached file is a simplified example of my program, which also has this problem. The files
>>>> in "Start" and "Count" directories are the "starts" and "counts" for the outputing processes. To
>>>> see this problem, one can use 41~49 mpi processes to run this program (if your machine has more
>>>> than 50 processors on one node, please modify "group_size" to larger numbers and run the program
>>>> using 4*group_size+1~5*group_size-1 processors, to make sure that the outputing processes are
>>>> running on at least two nodes).
>>>> Suggestions are repected. Thank you very much!
>>>>
>>>> Best,
>>>> Zhuang
>>>> <test.tar.gz>
>
More information about the parallel-netcdf
mailing list