nonblocking write gets stuck

刘壮 liuzhuang at lsec.cc.ac.cn
Sun Sep 1 22:35:25 CDT 2019


Hi Wei-keng

    Thanks for your comments. You are right, I have made more tests in these days, and find 
out that the reason for the hanging is from the MPI library.
    Originally, I used the mpiifort compiler from the intel parallel studio 2017 installed
on the cluster, and my program got stuck. Then I tried mpif09 in openmpi, and also mpiifort
from intel parallel studio 2019 installed on my own account, both run smoothly.
    Now my program can perform the output normally on this cluster, thank you again for your
help.

Zhuang


> -----原始邮件-----
> 发件人: "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
> 发送时间: 2019-08-30 02:53:29 (星期五)
> 收件人: "刘壮" <liuzhuang at lsec.cc.ac.cn>
> 抄送: parallel-netcdf at lists.mcs.anl.gov
> 主题: Re: nonblocking write gets stuck
> 
> If hanging only occurs when running on more than 1 compute node,
> then the root of problems most likely is from the MPI library.
> Did you run the program using NFS file system?
> What MPI library are you using?
>  
> Wei-keng
> 
> > On Aug 29, 2019, at 1:23 PM, Wei-keng Liao via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
> > 
> > I tested your codes on one compute node and see no error or hanging.
> > What version of PnetCDF you are using? I tested 1.11.2
> > 
> > Wei-keng
> > 
> >> On Aug 29, 2019, at 12:22 PM, 刘壮 <liuzhuang at lsec.cc.ac.cn> wrote:
> >> 
> >> Hi Wei-keng,
> >> 
> >>    Thanks very much for your reply. 
> >>    I am trying to use part of mpi processes to do the output for my program. 
> >> For example, if "group_size=10", and the total number of running processes 
> >> is 41, then I want to use processes "0, 10, 20, 30, 40" to do the output, which 
> >> have the feather "grank=0". To create the nc file, I know that all the output 
> >> processes should be in the same communicator, so I split the MPI_COMM_WORLD 
> >> to comm2, then the output processes "0, 10, 20, 30, 40" are in the same comm2.
> >>    Am I misusing some wrong interface in mpi or pnetcdf?
> >> 
> >> Best
> >> 
> >> 
> >>> -----原始邮件-----
> >>> 发件人: "Wei-keng Liao" <wkliao at eecs.northwestern.edu>
> >>> 发送时间: 2019-08-30 00:38:08 (星期五)
> >>> 收件人: "刘壮" <liuzhuang at lsec.cc.ac.cn>
> >>> 抄送: parallel-netcdf at lists.mcs.anl.gov
> >>> 主题: Re: nonblocking write gets stuck
> >>> 
> >>> I notice the followings from your codes.
> >>> 
> >>> grank is produced from comm1 in line 68
> >>> 68           call mpi_comm_rank(comm1, grank, err)
> >>> 
> >>> But when creating a new file, comm2 is used.
> >>> 111           if(grank .eq. 0) then
> >>> 112             err = nfmpi_create(comm2, filename, cmode, info, ncid)
> >>> 
> >>> All collective I/O subroutines, such as nfmpi_create, require all
> >>> processes in the communicator to participate (in this case, all
> >>> processes in comm2.)
> >>> 
> >>> Please explain what you are trying to do.
> >>> 
> >>> Wei-keng
> >>> 
> >>>> On Aug 29, 2019, at 9:16 AM, 刘壮 via parallel-netcdf <parallel-netcdf at lists.mcs.anl.gov> wrote:
> >>>> 
> >>>> Hi:
> >>>> 
> >>>> I have got a problem when using the nonblocking-write function in pnetcdf. The problem seems
> >>>> very strange, my program gets stuck in the function "nfmpi_wait_all". 
> >>>> However, if all the outputing processes are running on one node, the problem will go away. And
> >>>> I have test my program on several machines, only one of them has this problem. 
> >>>> The attached file is a simplified example of my program, which also has this problem. The files
> >>>> in "Start" and "Count" directories are the "starts" and "counts" for the outputing processes. To 
> >>>> see this problem, one can use 41~49 mpi processes to run this program (if your machine has more 
> >>>> than 50 processors on one node, please modify "group_size" to larger numbers and run the program 
> >>>> using 4*group_size+1~5*group_size-1 processors, to make sure that the outputing processes are 
> >>>> running on at least two nodes).
> >>>> Suggestions are repected. Thank you very much!
> >>>> 
> >>>> Best,
> >>>> Zhuang
> >>>> <test.tar.gz>
> > 


More information about the parallel-netcdf mailing list