collective memory-mapped array
Wei-keng Liao
wkliao at ece.northwestern.edu
Fri Jan 29 10:41:36 CST 2010
Hi, Jose,
I thought Rob's suggestion on setting "romio_cb_write" to "disable"
and "romio_ds_write" to "disable" works fine with your codes.
Especially, disabling romio_ds_write avoids the ADIO driver to call
file locks, which may not work perfectly.
Is it possible for you to provide a sample code that can reproduce the
problem?
Wei-keng
On Jan 29, 2010, at 5:02 AM, Jose Gracia wrote:
> Hallo Wei-keng,
>
>
> >A suggestion to see if it is Lustre ADIO driver's problem is to use
> >the "ufs:" file name prefix, eg. ufs:/home/user_id/testfile.dat
> >
> >This will force MPI-IO to use the "generic" Unix File System driver.
> >
> >Wei-keng
>
> I tried that and the problem remains. So this doesn't seem to be
> connected to Lustre then.
>
> I appreciate any other ideas.
>
> Cheers, Jose
>
>
>
>
> > On Tue, Jan 26, 2010 at 11:31:34AM +0100, Jose Gracia wrote:
> >> Hi Rob, everybody,
> >>
> >> thanks for looking into this.
> >
> > Hi Jose. Please make sure to copy the parallel-netcdf at
> mcs.anl.gov
> > mailing list: there might be others on the list who have some
> > experience with a problem like yours.
> >
> >>> OK, I think we need to know if this is a hang or just very very
> slow
> >>> response from the underlying file system.
> >>
> >> Normally the operation would complete in under a minute. I left it
> >> running for several hours a while ago with the same result.
> >>
> >>> On your system, do you have a way to capture a backtrace of some
> of
> >>> the MPI processors? I would like to see what the hung processes
> are
> >>> trying to do.
> >>>
> >> The only thing coming to my mind is ltrace. We have DDT installed,
> >> but I am not familiar with it.
> >>
> >> I have added an MPI_Barrier and sleep(30) just in front of the line
> >> quoted above in order to have time to attach the ltrace to the
> >> processes. Below are traces from two different nodes (total of 8
> >> nodes running 4 MPI tasks each). I don't see the ncmpi_* calls in
> >> the trace ... probably have to recompile with debugging symbols?
> >
> > ltrace is only able to trace symbols in shared libraries, and unless
> > you took extraordinary measures, you have a static libpnetcdf.a
> >
> >> ------------
> >>
> >> trace 1, on the master node. The SIGs towards the end come from me
> >> trying to kill the run.
> >>
> >> 1899 MPI_Type_get_envelope(0x6ae700, 0x7fffffffccbc,
> 0x7fffffffccb8,
> >> 0x7fffffffccb4, 0x7fffffffccb0) = 0
> >> 1899 MPI_Type_size(0x6ae700, 0x7fffffffcd70, 0x7fffffffcd58, 0,
> >> 0x7fffffffccb8) = 0
> >> 1899 MPI_Type_vector(128, 15, 17, 0x6ae700, 0x7fffffffcd48) = 0
> >> 1899 MPI_Type_commit(0x7fffffffcd48, 1, 0x6ae700, 8, 0) = 0
> >> 1899 MPI_Type_create_hvector(72, 1, 17408, 0x2aaad21409c0,
> >> 0x7fffffffcd50) = 0
> >> 1899 MPI_Type_free(0x7fffffffcd48, 1, 0x2aaad21409c0, 8,
> >> 0x7fffffffccd0) = 0
> >> 1899 MPI_Type_commit(0x7fffffffcd50, 1, 0, 8, 0x7fffffffccd0) = 0
> >> 1899 MPI_Type_create_hvector(1, 1, 0x132000, 0x2aaad2140c10,
> >> 0x7fffffffcd50) = 0
> >> 1899 MPI_Type_free(0x7fffffffcd48, 1, 0x2aaad2140c10, 8,
> >> 0x7fffffffccd0) = 0
> >> 1899 MPI_Type_commit(0x7fffffffcd50, 1, 0, 8, 0x7fffffffccd0) = 0
> >> 1899 malloc(1105920) = 0x2aaad64c8010
> >> 1899 MPI_Type_size(0x2aaad60681c0, 0x7fffffffccc4, 0x2aaad60681c0,
> >> 0x2aaad64c8010, 138240) = 0
> >> 1899 MPI_Type_size(0x6ae700, 0x7fffffffccc0, 0x2aaad60681c0,
> >> 0x2aaad64c8010, 138240) = 0
> >> 1899 MPI_Pack_size(1, 0x2aaad60681c0, 0x6ae4e0, 0x7fffffffccbc,
> >> 138240) = 0
> >> 1899 malloc(1105920) = 0x2aaad65d6020
> >> 1899 MPI_Pack(0x2b967c08, 1, 0x2aaad60681c0, 0x2aaad65d6020,
> >> 0x10e000) = 0
> >> 1899 MPI_Unpack(0x2aaad65d6020, 0x10e000, 0x7fffffffccb8,
> >> 0x2aaad64c8010, 138240) = 0
> >> 1899 free(0x2aaad65d6020) = <void>
> >> 1899 MPI_Type_free(0x7fffffffcd48, 0x2aaad65d6020, 0x2aaad66e4020,
> >> 0x2aaad65d6010, 0x2aaad65d6020) = 0
> >> 1899 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcbdc,
> 0x7fffffffcbd8,
> >> 0x7fffffffcbd4, 0x7fffffffcbd0) = 0
> >> 1899 MPI_Type_size(0x6ae700, 0x7fffffffccb8, 0x7fffffffcc90, 0,
> >> 0x7fffffffcbd8) = 0
> >> 1899 malloc(16) = 0x2aaabc1e3e00
> >> 1899 malloc(16) = 0x2aaac12dd680
> >> 1899 malloc(16) = 0x2aaabc1e3e60
> >> 1899 MPI_Type_create_subarray(4, 0x2aaabc1e3e00, 0x2aaac12dd680,
> >> 0x2aaabc1e3e60, 0) = 0
> >> 1899 MPI_Type_commit(0x7fffffffcb00, 1, 0x6aef20, 8, 0x24000000000)
> >> = 0
> >> 1899 MPI_File_set_view(0x2aaad2147c00, 0x1a29464, 0x6aef20,
> >> 0x2aaad21407b0, 0x4a0edf) = 0
> >> 1899 MPI_Type_free(0x7fffffffcb00, 0, 0, 0, 0x2aaab08f7000) = 0
> >> 1899 free(0x2aaabc1e3e00) = <void>
> >> 1899 free(0x2aaac12dd680) = <void>
> >> 1899 free(0x2aaabc1e3e60) = <void>
> >> 1899 malloc(1105920) = 0x2aaad65d6020
> >> 1899 MPI_File_write_all(0x2aaad2147c00, 0x2aaad65d6020, 0x10e000,
> >> 0x6aef20, 0x7fffffffcc70 <unfinished ...>
> >
> > OK, what I can see from all this is that you've set up a
> noncontiguous
> > in file -- not surprising for a 4D variable. Things make it pretty
> > far but get stuck in MPI_File_write_all. Perfect: that's actually
> > what I expected to see.
> >
> >> ------------
> >>
> >> trace 2, from an arbitrary node:
> >>
> >> 28099 MPI_Barrier(0x6ae880, 0, 0, 0, 0) = 0
> >> 28099 printf("IOOGSNC: rank %d about to write "..., 16) = 57
> >> 28099 sleep(30) = 0
> >> 28099 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcc7c,
> >> 0x7fffffffcc78, 0x7fffffffcc74, 0x7fffffffcc70) = 0
> >> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcd30, 0x7fffffffcd18, 0,
> >> 0x7fffffffcc78) = 0
> >> 28099 MPI_Type_vector(128, 5, 7, 0x6ae700, 0x7fffffffcd08) = 0
> >> 28099 MPI_Type_commit(0x7fffffffcd08, 1, 0x6ae700, 8, 0) = 0
> >> 28099 MPI_Type_create_hvector(72, 1, 7168, 0x2aaac0430fd0,
> >> 0x7fffffffcd10) = 0
> >> 28099 MPI_Type_free(0x7fffffffcd08, 1, 0x2aaac0430fd0, 8,
> >> 0x7fffffffcc90) = 0
> >> 28099 MPI_Type_commit(0x7fffffffcd10, 1, 0, 8, 0x7fffffffcc90) = 0
> >> 28099 MPI_Type_create_hvector(1, 1, 516096, 0x2aaac044df90,
> >> 0x7fffffffcd10) = 0
> >> 28099 MPI_Type_free(0x7fffffffcd08, 1, 0x2aaac044df90, 8,
> >> 0x7fffffffcc90) = 0
> >> 28099 MPI_Type_commit(0x7fffffffcd10, 1, 0, 8, 0x7fffffffcc90) = 0
> >> 28099 malloc(368640) = 0x2aaac045cab0
> >> 28099 MPI_Type_size(0x2aaac044ae10, 0x7fffffffcc84, 0x2aaac044ae10,
> >> 0x2aaac045cab0, 46080) = 0
> >> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcc80, 0x2aaac044ae10,
> >> 0x2aaac045cab0, 46080) = 0
> >> 28099 MPI_Pack_size(1, 0x2aaac044ae10, 0x6ae4e0, 0x7fffffffcc7c,
> >> 46080) = 0
> >> 28099 malloc(368640) = 0x2aaac04d2310
> >> 28099 MPI_Pack(0x22ed8d78, 1, 0x2aaac044ae10, 0x2aaac04d2310,
> >> 368640) = 0
> >> 28099 MPI_Unpack(0x2aaac04d2310, 368640, 0x7fffffffcc78,
> >> 0x2aaac045cab0, 46080) = 0
> >> 28099 free(0x2aaac04d2310) = <void>
> >> 28099 MPI_Type_free(0x7fffffffcd08, 0x2aaac04d2310, 0x3080151a20,
> 0,
> >> 0x2aaac04d2310) = 0
> >> 28099 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcb9c,
> >> 0x7fffffffcb98, 0x7fffffffcb94, 0x7fffffffcb90) = 0
> >> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcc78, 0x7fffffffcc50, 0,
> >> 0x7fffffffcb98) = 0
> >> 28099 malloc(16) = 0x2aaabdb60a50
> >> 28099 malloc(16) = 0x3fedef10
> >> 28099 malloc(16) = 0xa1c270
> >> 28099 MPI_Type_create_subarray(4, 0x2aaabdb60a50, 0x3fedef10,
> >> 0xa1c270, 0) = 0
> >> 28099 MPI_Type_commit(0x7fffffffcac0, 1, 0x6aef20, 8,
> >> 0x68000000000) = 0
> >> 28099 MPI_File_set_view(0x2aaac0751de0, 0x1a29464, 0x6aef20,
> >> 0x2aaac0449560, 0x4a0edf) = 0
> >> 28099 MPI_Type_free(0x7fffffffcac0, 0, 0, 0, 0) = 0
> >> 28099 free(0x2aaabdb60a50) = <void>
> >> 28099 free(0x3fedef10) = <void>
> >> 28099 free(0xa1c270) = <void>
> >> 28099 malloc(368640) = 0x2aaac04ed340
> >> 28099 MPI_File_write_all(0x2aaac0751de0, 0x2aaac04ed340, 368640,
> >> 0x6aef20, 0x7fffffffcc30 <unfinished ...>
> >
> > Another process stuck in MPI_File_write_all. A good sign, in
> that it
> > suggests your MPI processes are indeed stuck in I/O, and not stuck
> in
> > exchanging messages or anything like that.
> >
> > Are you writing directly to Lustre, or are you writing to NFS-
> exported
> > Lustre?
> >
> > I think we need one more piece of information. You said if you
> run on
> > one big node, you can write very quickly. Can you send the output
> of
> > 'ncmpidump -h' or 'ncdump -h' on a completed dataset?
> >
> > I also have one trick you might want to try: Are you familiar with
> > MPI-IO "Info" objects? When you create the file, you are (probably)
> > passing in MPI_INFO_NULL. If instead you set up your own info
> object,
> > we can guide some of the choices the underlying MPI-IO
> implementation
> > makes. In this case, it sounds like some very poorly-performing
> > choices have been made.
> >
> > There are a few hint configurations you might want to try:
> > Configuration #1:
> >
> > - set "romio_cb_write" to "enable" -- on Lustre, this is almost
> always
> > the right choice.
> >
> > Configuration #2:
> >
> > - set "romio_cb_write" to "disable"
> > - set "romio_ds_write" to "disable"
> >
> > This configuration turns off all optimizations, but it also avoids
> > costly file locks.
> >
> > If either of those configurations works, let us know.
> >
> > The good news is that MPI-IO support on Lustre has recently gotten a
> > lot more attention. As the improvements make their way out to more
> > systems, you might not have to set all these hints.
> >
> > ==rob
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> >
>
>
>
> --
>
> Dr. Jose Gracia email: gracia at hlrs.de
> HLRS, Uni Stuttgart http://www.hlrs.de/people/gracia
> Nobelstrasse 19 phone: +49 711 685 87208
> 70569 Stuttgart fax: +49 711 685 65832
> Germany
>
More information about the parallel-netcdf
mailing list