collective memory-mapped array

Jose Gracia gracia at hlrs.de
Fri Jan 29 05:02:04 CST 2010


Hallo Wei-keng,


 >A suggestion to see if it is Lustre ADIO driver's problem is to use
 >the "ufs:" file name prefix, eg. ufs:/home/user_id/testfile.dat
 >
 >This will force MPI-IO to use the "generic" Unix File System driver.
 >
 >Wei-keng

I tried that and the problem remains. So this doesn't seem to be 
connected to Lustre then.

I appreciate any other ideas.

Cheers, Jose




 > On Tue, Jan 26, 2010 at 11:31:34AM +0100, Jose Gracia wrote:
 >> Hi Rob, everybody,
 >>
 >> thanks for looking into this.
 >
 > Hi Jose.  Please make sure to copy the parallel-netcdf at mcs.anl.gov
 > mailing list: there might be others on the list who have some
 > experience with a problem like yours.
 >
 >>> OK, I think we need to know if this is a hang or just very very slow
 >>> response from the underlying file system.
 >>
 >> Normally the operation would complete in under a minute. I left it
 >> running for several hours a while ago with the same result.
 >>
 >>> On your system, do you have a way to capture a backtrace of some of
 >>> the MPI processors?  I would like to see what the hung processes are
 >>> trying to do.
 >>>
 >> The only thing coming to my mind is ltrace. We have DDT installed,
 >> but I am not familiar with it.
 >>
 >> I have added an MPI_Barrier and sleep(30) just in front of the line
 >> quoted above in order to have time to attach the ltrace to the
 >> processes. Below are traces from two different nodes (total of 8
 >> nodes running 4 MPI tasks each). I don't see the ncmpi_* calls in
 >> the trace ... probably have to recompile with debugging symbols?
 >
 > ltrace is only able to trace symbols in shared libraries, and unless
 > you took extraordinary measures, you have a static libpnetcdf.a
 >
 >> ------------
 >>
 >> trace 1, on the master node. The SIGs towards the end come from me
 >> trying to kill the run.
 >>
 >> 1899 MPI_Type_get_envelope(0x6ae700, 0x7fffffffccbc, 0x7fffffffccb8,
 >> 0x7fffffffccb4, 0x7fffffffccb0) = 0
 >> 1899 MPI_Type_size(0x6ae700, 0x7fffffffcd70, 0x7fffffffcd58, 0,
 >> 0x7fffffffccb8) = 0
 >> 1899 MPI_Type_vector(128, 15, 17, 0x6ae700, 0x7fffffffcd48) = 0
 >> 1899 MPI_Type_commit(0x7fffffffcd48, 1, 0x6ae700, 8, 0) = 0
 >> 1899 MPI_Type_create_hvector(72, 1, 17408, 0x2aaad21409c0,
 >> 0x7fffffffcd50) = 0
 >> 1899 MPI_Type_free(0x7fffffffcd48, 1, 0x2aaad21409c0, 8,
 >> 0x7fffffffccd0) = 0
 >> 1899 MPI_Type_commit(0x7fffffffcd50, 1, 0, 8, 0x7fffffffccd0) = 0
 >> 1899 MPI_Type_create_hvector(1, 1, 0x132000, 0x2aaad2140c10,
 >> 0x7fffffffcd50) = 0
 >> 1899 MPI_Type_free(0x7fffffffcd48, 1, 0x2aaad2140c10, 8,
 >> 0x7fffffffccd0) = 0
 >> 1899 MPI_Type_commit(0x7fffffffcd50, 1, 0, 8, 0x7fffffffccd0) = 0
 >> 1899 malloc(1105920)                             = 0x2aaad64c8010
 >> 1899 MPI_Type_size(0x2aaad60681c0, 0x7fffffffccc4, 0x2aaad60681c0,
 >> 0x2aaad64c8010, 138240) = 0
 >> 1899 MPI_Type_size(0x6ae700, 0x7fffffffccc0, 0x2aaad60681c0,
 >> 0x2aaad64c8010, 138240) = 0
 >> 1899 MPI_Pack_size(1, 0x2aaad60681c0, 0x6ae4e0, 0x7fffffffccbc,
 >> 138240) = 0
 >> 1899 malloc(1105920)                             = 0x2aaad65d6020
 >> 1899 MPI_Pack(0x2b967c08, 1, 0x2aaad60681c0, 0x2aaad65d6020,
 >> 0x10e000) = 0
 >> 1899 MPI_Unpack(0x2aaad65d6020, 0x10e000, 0x7fffffffccb8,
 >> 0x2aaad64c8010, 138240) = 0
 >> 1899 free(0x2aaad65d6020)                        = <void>
 >> 1899 MPI_Type_free(0x7fffffffcd48, 0x2aaad65d6020, 0x2aaad66e4020,
 >> 0x2aaad65d6010, 0x2aaad65d6020) = 0
 >> 1899 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcbdc, 0x7fffffffcbd8,
 >> 0x7fffffffcbd4, 0x7fffffffcbd0) = 0
 >> 1899 MPI_Type_size(0x6ae700, 0x7fffffffccb8, 0x7fffffffcc90, 0,
 >> 0x7fffffffcbd8) = 0
 >> 1899 malloc(16)                                  = 0x2aaabc1e3e00
 >> 1899 malloc(16)                                  = 0x2aaac12dd680
 >> 1899 malloc(16)                                  = 0x2aaabc1e3e60
 >> 1899 MPI_Type_create_subarray(4, 0x2aaabc1e3e00, 0x2aaac12dd680,
 >> 0x2aaabc1e3e60, 0) = 0
 >> 1899 MPI_Type_commit(0x7fffffffcb00, 1, 0x6aef20, 8, 0x24000000000)
 >> = 0
 >> 1899 MPI_File_set_view(0x2aaad2147c00, 0x1a29464, 0x6aef20,
 >> 0x2aaad21407b0, 0x4a0edf) = 0
 >> 1899 MPI_Type_free(0x7fffffffcb00, 0, 0, 0, 0x2aaab08f7000) = 0
 >> 1899 free(0x2aaabc1e3e00)                        = <void>
 >> 1899 free(0x2aaac12dd680)                        = <void>
 >> 1899 free(0x2aaabc1e3e60)                        = <void>
 >> 1899 malloc(1105920)                             = 0x2aaad65d6020
 >> 1899 MPI_File_write_all(0x2aaad2147c00, 0x2aaad65d6020, 0x10e000,
 >> 0x6aef20, 0x7fffffffcc70 <unfinished ...>
 >
 > OK, what I can see from all this is that you've set up a noncontiguous
 > in file -- not surprising for a 4D variable.  Things make it pretty
 > far but get stuck in MPI_File_write_all.  Perfect: that's actually
 > what I expected to see.
 >
 >> ------------
 >>
 >> trace 2, from an arbitrary node:
 >>
 >> 28099 MPI_Barrier(0x6ae880, 0, 0, 0, 0)          = 0
 >> 28099 printf("IOOGSNC: rank %d about to write "..., 16) = 57
 >> 28099 sleep(30)                                  = 0
 >> 28099 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcc7c,
 >> 0x7fffffffcc78, 0x7fffffffcc74, 0x7fffffffcc70) = 0
 >> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcd30, 0x7fffffffcd18, 0,
 >> 0x7fffffffcc78) = 0
 >> 28099 MPI_Type_vector(128, 5, 7, 0x6ae700, 0x7fffffffcd08) = 0
 >> 28099 MPI_Type_commit(0x7fffffffcd08, 1, 0x6ae700, 8, 0) = 0
 >> 28099 MPI_Type_create_hvector(72, 1, 7168, 0x2aaac0430fd0,
 >> 0x7fffffffcd10) = 0
 >> 28099 MPI_Type_free(0x7fffffffcd08, 1, 0x2aaac0430fd0, 8,
 >> 0x7fffffffcc90) = 0
 >> 28099 MPI_Type_commit(0x7fffffffcd10, 1, 0, 8, 0x7fffffffcc90) = 0
 >> 28099 MPI_Type_create_hvector(1, 1, 516096, 0x2aaac044df90,
 >> 0x7fffffffcd10) = 0
 >> 28099 MPI_Type_free(0x7fffffffcd08, 1, 0x2aaac044df90, 8,
 >> 0x7fffffffcc90) = 0
 >> 28099 MPI_Type_commit(0x7fffffffcd10, 1, 0, 8, 0x7fffffffcc90) = 0
 >> 28099 malloc(368640)                             = 0x2aaac045cab0
 >> 28099 MPI_Type_size(0x2aaac044ae10, 0x7fffffffcc84, 0x2aaac044ae10,
 >> 0x2aaac045cab0, 46080) = 0
 >> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcc80, 0x2aaac044ae10,
 >> 0x2aaac045cab0, 46080) = 0
 >> 28099 MPI_Pack_size(1, 0x2aaac044ae10, 0x6ae4e0, 0x7fffffffcc7c,
 >> 46080) = 0
 >> 28099 malloc(368640)                             = 0x2aaac04d2310
 >> 28099 MPI_Pack(0x22ed8d78, 1, 0x2aaac044ae10, 0x2aaac04d2310,
 >> 368640) = 0
 >> 28099 MPI_Unpack(0x2aaac04d2310, 368640, 0x7fffffffcc78,
 >> 0x2aaac045cab0, 46080) = 0
 >> 28099 free(0x2aaac04d2310)                       = <void>
 >> 28099 MPI_Type_free(0x7fffffffcd08, 0x2aaac04d2310, 0x3080151a20, 0,
 >> 0x2aaac04d2310) = 0
 >> 28099 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcb9c,
 >> 0x7fffffffcb98, 0x7fffffffcb94, 0x7fffffffcb90) = 0
 >> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcc78, 0x7fffffffcc50, 0,
 >> 0x7fffffffcb98) = 0
 >> 28099 malloc(16)                                 = 0x2aaabdb60a50
 >> 28099 malloc(16)                                 = 0x3fedef10
 >> 28099 malloc(16)                                 = 0xa1c270
 >> 28099 MPI_Type_create_subarray(4, 0x2aaabdb60a50, 0x3fedef10,
 >> 0xa1c270, 0) = 0
 >> 28099 MPI_Type_commit(0x7fffffffcac0, 1, 0x6aef20, 8,
 >> 0x68000000000) = 0
 >> 28099 MPI_File_set_view(0x2aaac0751de0, 0x1a29464, 0x6aef20,
 >> 0x2aaac0449560, 0x4a0edf) = 0
 >> 28099 MPI_Type_free(0x7fffffffcac0, 0, 0, 0, 0)  = 0
 >> 28099 free(0x2aaabdb60a50)                       = <void>
 >> 28099 free(0x3fedef10)                           = <void>
 >> 28099 free(0xa1c270)                             = <void>
 >> 28099 malloc(368640)                             = 0x2aaac04ed340
 >> 28099 MPI_File_write_all(0x2aaac0751de0, 0x2aaac04ed340, 368640,
 >> 0x6aef20, 0x7fffffffcc30 <unfinished ...>
 >
 > Another process stuck in MPI_File_write_all.   A good sign, in that it
 > suggests your MPI processes are indeed stuck in I/O, and not stuck in
 > exchanging messages or anything like that.
 >
 > Are you writing directly to Lustre, or are you writing to NFS-exported
 > Lustre?
 >
 > I think we need one more piece of information.  You said if you run on
 > one big node, you can write very quickly.  Can you send the output of
 > 'ncmpidump -h' or 'ncdump -h' on a completed dataset?
 >
 > I also have one trick you might want to try: Are you familiar with
 > MPI-IO "Info" objects?  When you create the file, you are (probably)
 > passing in MPI_INFO_NULL.  If instead you set up your own info object,
 > we can guide some of the choices the underlying MPI-IO implementation
 > makes.  In this case, it sounds like some very poorly-performing
 > choices have been made.
 >
 > There are a few hint configurations you might want to try:
 > Configuration #1:
 >
 > - set "romio_cb_write" to "enable" -- on Lustre, this is almost always
 >  the right choice.
 >
 > Configuration #2:
 >
 > - set "romio_cb_write" to "disable"
 > - set "romio_ds_write" to "disable"
 >
 > This configuration turns off all optimizations, but it also avoids
 > costly file locks.
 >
 > If either of those configurations works, let us know.
 >
 > The good news is that MPI-IO support on Lustre has recently gotten a
 > lot more attention.  As the improvements make their way out to more
 > systems, you might not have to set all these hints.
 >
 > ==rob
 >
 > --
 > Rob Latham
 > Mathematics and Computer Science Division
 > Argonne National Lab, IL USA
 >



-- 

Dr. Jose Gracia		email:  gracia at hlrs.de
HLRS, Uni Stuttgart	http://www.hlrs.de/people/gracia
Nobelstrasse 19		phone: +49 711 685 87208
70569 Stuttgart		fax:   +49 711 685 65832
Germany


More information about the parallel-netcdf mailing list