nonblocking write gets stuck

Wei-keng Liao wkliao at eecs.northwestern.edu
Thu Aug 29 13:53:29 CDT 2019


If hanging only occurs when running on more than 1 compute node,
then the root of problems most likely is from the MPI library.
Did you run the program using NFS file system?
What MPI library are you using?
 
Wei-keng

> On Aug 29, 2019, at 1:23 PM, Wei-keng Liao via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
> 
> I tested your codes on one compute node and see no error or hanging.
> What version of PnetCDF you are using? I tested 1.11.2
> 
> Wei-keng
> 
>> On Aug 29, 2019, at 12:22 PM, 刘壮 <liuzhuang at lsec.cc.ac.cn> wrote:
>> 
>> Hi Wei-keng,
>> 
>>    Thanks very much for your reply. 
>>    I am trying to use part of mpi processes to do the output for my program. 
>> For example, if "group_size=10", and the total number of running processes 
>> is 41, then I want to use processes "0, 10, 20, 30, 40" to do the output, which 
>> have the feather "grank=0". To create the nc file, I know that all the output 
>> processes should be in the same communicator, so I split the MPI_COMM_WORLD 
>> to comm2, then the output processes "0, 10, 20, 30, 40" are in the same comm2.
>>    Am I misusing some wrong interface in mpi or pnetcdf?
>> 
>> Best
>> 
>> 
>>> -----原始邮件-----
>>> 发件人: "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
>>> 发送时间: 2019-08-30 00:38:08 (星期五)
>>> 收件人: "刘壮" <liuzhuang at lsec.cc.ac.cn>
>>> 抄送: parallel-netcdf at lists.mcs.anl.gov
>>> 主题: Re: nonblocking write gets stuck
>>> 
>>> I notice the followings from your codes.
>>> 
>>> grank is produced from comm1 in line 68
>>> 68           call mpi_comm_rank(comm1, grank, err)
>>> 
>>> But when creating a new file, comm2 is used.
>>> 111           if(grank .eq. 0) then
>>> 112             err = nfmpi_create(comm2, filename, cmode, info, ncid)
>>> 
>>> All collective I/O subroutines, such as nfmpi_create, require all
>>> processes in the communicator to participate (in this case, all
>>> processes in comm2.)
>>> 
>>> Please explain what you are trying to do.
>>> 
>>> Wei-keng
>>> 
>>>> On Aug 29, 2019, at 9:16 AM, 刘壮 via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
>>>> 
>>>> Hi:
>>>> 
>>>> I have got a problem when using the nonblocking-write function in pnetcdf. The problem seems
>>>> very strange, my program gets stuck in the function "nfmpi_wait_all". 
>>>> However, if all the outputing processes are running on one node, the problem will go away. And
>>>> I have test my program on several machines, only one of them has this problem. 
>>>> The attached file is a simplified example of my program, which also has this problem. The files
>>>> in "Start" and "Count" directories are the "starts" and "counts" for the outputing processes. To 
>>>> see this problem, one can use 41~49 mpi processes to run this program (if your machine has more 
>>>> than 50 processors on one node, please modify "group_size" to larger numbers and run the program 
>>>> using 4*group_size+1~5*group_size-1 processors, to make sure that the outputing processes are 
>>>> running on at least two nodes).
>>>> Suggestions are repected. Thank you very much!
>>>> 
>>>> Best,
>>>> Zhuang
>>>> <test.tar.gz>
> 



More information about the parallel-netcdf mailing list