collective memory-mapped array

Rob Latham robl at mcs.anl.gov
Tue Jan 26 10:03:31 CST 2010


On Tue, Jan 26, 2010 at 11:31:34AM +0100, Jose Gracia wrote:
> Hi Rob, everybody,
> 
> thanks for looking into this.

Hi Jose.  Please make sure to copy the parallel-netcdf at mcs.anl.gov
mailing list: there might be others on the list who have some
experience with a problem like yours.

> >OK, I think we need to know if this is a hang or just very very slow
> >response from the underlying file system.
>
> Normally the operation would complete in under a minute. I left it
> running for several hours a while ago with the same result.
> 
> >On your system, do you have a way to capture a backtrace of some of
> >the MPI processors?  I would like to see what the hung processes are
> >trying to do.
> >
> The only thing coming to my mind is ltrace. We have DDT installed,
> but I am not familiar with it.
> 
> I have added an MPI_Barrier and sleep(30) just in front of the line
> quoted above in order to have time to attach the ltrace to the
> processes. Below are traces from two different nodes (total of 8
> nodes running 4 MPI tasks each). I don't see the ncmpi_* calls in
> the trace ... probably have to recompile with debugging symbols?

ltrace is only able to trace symbols in shared libraries, and unless you took extraordinary measures, you have a static libpnetcdf.a

> ------------
> 
> trace 1, on the master node. The SIGs towards the end come from me
> trying to kill the run.
> 
> 1899 MPI_Type_get_envelope(0x6ae700, 0x7fffffffccbc, 0x7fffffffccb8,
> 0x7fffffffccb4, 0x7fffffffccb0) = 0
> 1899 MPI_Type_size(0x6ae700, 0x7fffffffcd70, 0x7fffffffcd58, 0,
> 0x7fffffffccb8) = 0
> 1899 MPI_Type_vector(128, 15, 17, 0x6ae700, 0x7fffffffcd48) = 0
> 1899 MPI_Type_commit(0x7fffffffcd48, 1, 0x6ae700, 8, 0) = 0
> 1899 MPI_Type_create_hvector(72, 1, 17408, 0x2aaad21409c0,
> 0x7fffffffcd50) = 0
> 1899 MPI_Type_free(0x7fffffffcd48, 1, 0x2aaad21409c0, 8, 0x7fffffffccd0) = 0
> 1899 MPI_Type_commit(0x7fffffffcd50, 1, 0, 8, 0x7fffffffccd0) = 0
> 1899 MPI_Type_create_hvector(1, 1, 0x132000, 0x2aaad2140c10,
> 0x7fffffffcd50) = 0
> 1899 MPI_Type_free(0x7fffffffcd48, 1, 0x2aaad2140c10, 8, 0x7fffffffccd0) = 0
> 1899 MPI_Type_commit(0x7fffffffcd50, 1, 0, 8, 0x7fffffffccd0) = 0
> 1899 malloc(1105920)                             = 0x2aaad64c8010
> 1899 MPI_Type_size(0x2aaad60681c0, 0x7fffffffccc4, 0x2aaad60681c0,
> 0x2aaad64c8010, 138240) = 0
> 1899 MPI_Type_size(0x6ae700, 0x7fffffffccc0, 0x2aaad60681c0,
> 0x2aaad64c8010, 138240) = 0
> 1899 MPI_Pack_size(1, 0x2aaad60681c0, 0x6ae4e0, 0x7fffffffccbc, 138240) = 0
> 1899 malloc(1105920)                             = 0x2aaad65d6020
> 1899 MPI_Pack(0x2b967c08, 1, 0x2aaad60681c0, 0x2aaad65d6020, 0x10e000) = 0
> 1899 MPI_Unpack(0x2aaad65d6020, 0x10e000, 0x7fffffffccb8,
> 0x2aaad64c8010, 138240) = 0
> 1899 free(0x2aaad65d6020)                        = <void>
> 1899 MPI_Type_free(0x7fffffffcd48, 0x2aaad65d6020, 0x2aaad66e4020,
> 0x2aaad65d6010, 0x2aaad65d6020) = 0
> 1899 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcbdc, 0x7fffffffcbd8,
> 0x7fffffffcbd4, 0x7fffffffcbd0) = 0
> 1899 MPI_Type_size(0x6ae700, 0x7fffffffccb8, 0x7fffffffcc90, 0,
> 0x7fffffffcbd8) = 0
> 1899 malloc(16)                                  = 0x2aaabc1e3e00
> 1899 malloc(16)                                  = 0x2aaac12dd680
> 1899 malloc(16)                                  = 0x2aaabc1e3e60
> 1899 MPI_Type_create_subarray(4, 0x2aaabc1e3e00, 0x2aaac12dd680,
> 0x2aaabc1e3e60, 0) = 0
> 1899 MPI_Type_commit(0x7fffffffcb00, 1, 0x6aef20, 8, 0x24000000000) = 0
> 1899 MPI_File_set_view(0x2aaad2147c00, 0x1a29464, 0x6aef20,
> 0x2aaad21407b0, 0x4a0edf) = 0
> 1899 MPI_Type_free(0x7fffffffcb00, 0, 0, 0, 0x2aaab08f7000) = 0
> 1899 free(0x2aaabc1e3e00)                        = <void>
> 1899 free(0x2aaac12dd680)                        = <void>
> 1899 free(0x2aaabc1e3e60)                        = <void>
> 1899 malloc(1105920)                             = 0x2aaad65d6020
> 1899 MPI_File_write_all(0x2aaad2147c00, 0x2aaad65d6020, 0x10e000,
> 0x6aef20, 0x7fffffffcc70 <unfinished ...>

OK, what I can see from all this is that you've set up a noncontiguous
in file -- not surprising for a 4D variable.  Things make it pretty
far but get stuck in MPI_File_write_all.  Perfect: that's actually
what I expected to see.

> ------------
> 
> trace 2, from an arbitrary node:
> 
> 28099 MPI_Barrier(0x6ae880, 0, 0, 0, 0)          = 0
> 28099 printf("IOOGSNC: rank %d about to write "..., 16) = 57
> 28099 sleep(30)                                  = 0
> 28099 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcc7c,
> 0x7fffffffcc78, 0x7fffffffcc74, 0x7fffffffcc70) = 0
> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcd30, 0x7fffffffcd18, 0,
> 0x7fffffffcc78) = 0
> 28099 MPI_Type_vector(128, 5, 7, 0x6ae700, 0x7fffffffcd08) = 0
> 28099 MPI_Type_commit(0x7fffffffcd08, 1, 0x6ae700, 8, 0) = 0
> 28099 MPI_Type_create_hvector(72, 1, 7168, 0x2aaac0430fd0,
> 0x7fffffffcd10) = 0
> 28099 MPI_Type_free(0x7fffffffcd08, 1, 0x2aaac0430fd0, 8,
> 0x7fffffffcc90) = 0
> 28099 MPI_Type_commit(0x7fffffffcd10, 1, 0, 8, 0x7fffffffcc90) = 0
> 28099 MPI_Type_create_hvector(1, 1, 516096, 0x2aaac044df90,
> 0x7fffffffcd10) = 0
> 28099 MPI_Type_free(0x7fffffffcd08, 1, 0x2aaac044df90, 8,
> 0x7fffffffcc90) = 0
> 28099 MPI_Type_commit(0x7fffffffcd10, 1, 0, 8, 0x7fffffffcc90) = 0
> 28099 malloc(368640)                             = 0x2aaac045cab0
> 28099 MPI_Type_size(0x2aaac044ae10, 0x7fffffffcc84, 0x2aaac044ae10,
> 0x2aaac045cab0, 46080) = 0
> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcc80, 0x2aaac044ae10,
> 0x2aaac045cab0, 46080) = 0
> 28099 MPI_Pack_size(1, 0x2aaac044ae10, 0x6ae4e0, 0x7fffffffcc7c, 46080) = 0
> 28099 malloc(368640)                             = 0x2aaac04d2310
> 28099 MPI_Pack(0x22ed8d78, 1, 0x2aaac044ae10, 0x2aaac04d2310, 368640) = 0
> 28099 MPI_Unpack(0x2aaac04d2310, 368640, 0x7fffffffcc78,
> 0x2aaac045cab0, 46080) = 0
> 28099 free(0x2aaac04d2310)                       = <void>
> 28099 MPI_Type_free(0x7fffffffcd08, 0x2aaac04d2310, 0x3080151a20, 0,
> 0x2aaac04d2310) = 0
> 28099 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcb9c,
> 0x7fffffffcb98, 0x7fffffffcb94, 0x7fffffffcb90) = 0
> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcc78, 0x7fffffffcc50, 0,
> 0x7fffffffcb98) = 0
> 28099 malloc(16)                                 = 0x2aaabdb60a50
> 28099 malloc(16)                                 = 0x3fedef10
> 28099 malloc(16)                                 = 0xa1c270
> 28099 MPI_Type_create_subarray(4, 0x2aaabdb60a50, 0x3fedef10,
> 0xa1c270, 0) = 0
> 28099 MPI_Type_commit(0x7fffffffcac0, 1, 0x6aef20, 8, 0x68000000000) = 0
> 28099 MPI_File_set_view(0x2aaac0751de0, 0x1a29464, 0x6aef20,
> 0x2aaac0449560, 0x4a0edf) = 0
> 28099 MPI_Type_free(0x7fffffffcac0, 0, 0, 0, 0)  = 0
> 28099 free(0x2aaabdb60a50)                       = <void>
> 28099 free(0x3fedef10)                           = <void>
> 28099 free(0xa1c270)                             = <void>
> 28099 malloc(368640)                             = 0x2aaac04ed340
> 28099 MPI_File_write_all(0x2aaac0751de0, 0x2aaac04ed340, 368640,
> 0x6aef20, 0x7fffffffcc30 <unfinished ...>

Another process stuck in MPI_File_write_all.   A good sign, in that it
suggests your MPI processes are indeed stuck in I/O, and not stuck in
exchanging messages or anything like that.

Are you writing directly to Lustre, or are you writing to NFS-exported
Lustre?

I think we need one more piece of information.  You said if you run on
one big node, you can write very quickly.  Can you send the output of
'ncmpidump -h' or 'ncdump -h' on a completed dataset?

I also have one trick you might want to try: Are you familiar with
MPI-IO "Info" objects?  When you create the file, you are (probably)
passing in MPI_INFO_NULL.  If instead you set up your own info object,
we can guide some of the choices the underlying MPI-IO implementation
makes.  In this case, it sounds like some very poorly-performing
choices have been made.

There are a few hint configurations you might want to try:
Configuration #1:

- set "romio_cb_write" to "enable" -- on Lustre, this is almost always
  the right choice. 

Configuration #2:

- set "romio_cb_write" to "disable" 
- set "romio_ds_write" to "disable"

This configuration turns off all optimizations, but it also avoids
costly file locks.  

If either of those configurations works, let us know.   

The good news is that MPI-IO support on Lustre has recently gotten a
lot more attention.  As the improvements make their way out to more
systems, you might not have to set all these hints.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the parallel-netcdf mailing list