collective memory-mapped array

Wei-keng Liao wkliao at ece.northwestern.edu
Tue Jan 26 11:08:40 CST 2010


A suggestion to see if it is Lustre ADIO driver's problem is to use
the "ufs:" file name prefix, eg. ufs:/home/user_id/testfile.dat

This will force MPI-IO to use the "generic" Unix File System driver.

Wei-keng

On Jan 26, 2010, at 10:03 AM, Rob Latham wrote:

> On Tue, Jan 26, 2010 at 11:31:34AM +0100, Jose Gracia wrote:
>> Hi Rob, everybody,
>>
>> thanks for looking into this.
>
> Hi Jose.  Please make sure to copy the parallel-netcdf at mcs.anl.gov
> mailing list: there might be others on the list who have some
> experience with a problem like yours.
>
>>> OK, I think we need to know if this is a hang or just very very slow
>>> response from the underlying file system.
>>
>> Normally the operation would complete in under a minute. I left it
>> running for several hours a while ago with the same result.
>>
>>> On your system, do you have a way to capture a backtrace of some of
>>> the MPI processors?  I would like to see what the hung processes are
>>> trying to do.
>>>
>> The only thing coming to my mind is ltrace. We have DDT installed,
>> but I am not familiar with it.
>>
>> I have added an MPI_Barrier and sleep(30) just in front of the line
>> quoted above in order to have time to attach the ltrace to the
>> processes. Below are traces from two different nodes (total of 8
>> nodes running 4 MPI tasks each). I don't see the ncmpi_* calls in
>> the trace ... probably have to recompile with debugging symbols?
>
> ltrace is only able to trace symbols in shared libraries, and unless  
> you took extraordinary measures, you have a static libpnetcdf.a
>
>> ------------
>>
>> trace 1, on the master node. The SIGs towards the end come from me
>> trying to kill the run.
>>
>> 1899 MPI_Type_get_envelope(0x6ae700, 0x7fffffffccbc, 0x7fffffffccb8,
>> 0x7fffffffccb4, 0x7fffffffccb0) = 0
>> 1899 MPI_Type_size(0x6ae700, 0x7fffffffcd70, 0x7fffffffcd58, 0,
>> 0x7fffffffccb8) = 0
>> 1899 MPI_Type_vector(128, 15, 17, 0x6ae700, 0x7fffffffcd48) = 0
>> 1899 MPI_Type_commit(0x7fffffffcd48, 1, 0x6ae700, 8, 0) = 0
>> 1899 MPI_Type_create_hvector(72, 1, 17408, 0x2aaad21409c0,
>> 0x7fffffffcd50) = 0
>> 1899 MPI_Type_free(0x7fffffffcd48, 1, 0x2aaad21409c0, 8,  
>> 0x7fffffffccd0) = 0
>> 1899 MPI_Type_commit(0x7fffffffcd50, 1, 0, 8, 0x7fffffffccd0) = 0
>> 1899 MPI_Type_create_hvector(1, 1, 0x132000, 0x2aaad2140c10,
>> 0x7fffffffcd50) = 0
>> 1899 MPI_Type_free(0x7fffffffcd48, 1, 0x2aaad2140c10, 8,  
>> 0x7fffffffccd0) = 0
>> 1899 MPI_Type_commit(0x7fffffffcd50, 1, 0, 8, 0x7fffffffccd0) = 0
>> 1899 malloc(1105920)                             = 0x2aaad64c8010
>> 1899 MPI_Type_size(0x2aaad60681c0, 0x7fffffffccc4, 0x2aaad60681c0,
>> 0x2aaad64c8010, 138240) = 0
>> 1899 MPI_Type_size(0x6ae700, 0x7fffffffccc0, 0x2aaad60681c0,
>> 0x2aaad64c8010, 138240) = 0
>> 1899 MPI_Pack_size(1, 0x2aaad60681c0, 0x6ae4e0, 0x7fffffffccbc,  
>> 138240) = 0
>> 1899 malloc(1105920)                             = 0x2aaad65d6020
>> 1899 MPI_Pack(0x2b967c08, 1, 0x2aaad60681c0, 0x2aaad65d6020,  
>> 0x10e000) = 0
>> 1899 MPI_Unpack(0x2aaad65d6020, 0x10e000, 0x7fffffffccb8,
>> 0x2aaad64c8010, 138240) = 0
>> 1899 free(0x2aaad65d6020)                        = <void>
>> 1899 MPI_Type_free(0x7fffffffcd48, 0x2aaad65d6020, 0x2aaad66e4020,
>> 0x2aaad65d6010, 0x2aaad65d6020) = 0
>> 1899 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcbdc, 0x7fffffffcbd8,
>> 0x7fffffffcbd4, 0x7fffffffcbd0) = 0
>> 1899 MPI_Type_size(0x6ae700, 0x7fffffffccb8, 0x7fffffffcc90, 0,
>> 0x7fffffffcbd8) = 0
>> 1899 malloc(16)                                  = 0x2aaabc1e3e00
>> 1899 malloc(16)                                  = 0x2aaac12dd680
>> 1899 malloc(16)                                  = 0x2aaabc1e3e60
>> 1899 MPI_Type_create_subarray(4, 0x2aaabc1e3e00, 0x2aaac12dd680,
>> 0x2aaabc1e3e60, 0) = 0
>> 1899 MPI_Type_commit(0x7fffffffcb00, 1, 0x6aef20, 8, 0x24000000000)  
>> = 0
>> 1899 MPI_File_set_view(0x2aaad2147c00, 0x1a29464, 0x6aef20,
>> 0x2aaad21407b0, 0x4a0edf) = 0
>> 1899 MPI_Type_free(0x7fffffffcb00, 0, 0, 0, 0x2aaab08f7000) = 0
>> 1899 free(0x2aaabc1e3e00)                        = <void>
>> 1899 free(0x2aaac12dd680)                        = <void>
>> 1899 free(0x2aaabc1e3e60)                        = <void>
>> 1899 malloc(1105920)                             = 0x2aaad65d6020
>> 1899 MPI_File_write_all(0x2aaad2147c00, 0x2aaad65d6020, 0x10e000,
>> 0x6aef20, 0x7fffffffcc70 <unfinished ...>
>
> OK, what I can see from all this is that you've set up a noncontiguous
> in file -- not surprising for a 4D variable.  Things make it pretty
> far but get stuck in MPI_File_write_all.  Perfect: that's actually
> what I expected to see.
>
>> ------------
>>
>> trace 2, from an arbitrary node:
>>
>> 28099 MPI_Barrier(0x6ae880, 0, 0, 0, 0)          = 0
>> 28099 printf("IOOGSNC: rank %d about to write "..., 16) = 57
>> 28099 sleep(30)                                  = 0
>> 28099 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcc7c,
>> 0x7fffffffcc78, 0x7fffffffcc74, 0x7fffffffcc70) = 0
>> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcd30, 0x7fffffffcd18, 0,
>> 0x7fffffffcc78) = 0
>> 28099 MPI_Type_vector(128, 5, 7, 0x6ae700, 0x7fffffffcd08) = 0
>> 28099 MPI_Type_commit(0x7fffffffcd08, 1, 0x6ae700, 8, 0) = 0
>> 28099 MPI_Type_create_hvector(72, 1, 7168, 0x2aaac0430fd0,
>> 0x7fffffffcd10) = 0
>> 28099 MPI_Type_free(0x7fffffffcd08, 1, 0x2aaac0430fd0, 8,
>> 0x7fffffffcc90) = 0
>> 28099 MPI_Type_commit(0x7fffffffcd10, 1, 0, 8, 0x7fffffffcc90) = 0
>> 28099 MPI_Type_create_hvector(1, 1, 516096, 0x2aaac044df90,
>> 0x7fffffffcd10) = 0
>> 28099 MPI_Type_free(0x7fffffffcd08, 1, 0x2aaac044df90, 8,
>> 0x7fffffffcc90) = 0
>> 28099 MPI_Type_commit(0x7fffffffcd10, 1, 0, 8, 0x7fffffffcc90) = 0
>> 28099 malloc(368640)                             = 0x2aaac045cab0
>> 28099 MPI_Type_size(0x2aaac044ae10, 0x7fffffffcc84, 0x2aaac044ae10,
>> 0x2aaac045cab0, 46080) = 0
>> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcc80, 0x2aaac044ae10,
>> 0x2aaac045cab0, 46080) = 0
>> 28099 MPI_Pack_size(1, 0x2aaac044ae10, 0x6ae4e0, 0x7fffffffcc7c,  
>> 46080) = 0
>> 28099 malloc(368640)                             = 0x2aaac04d2310
>> 28099 MPI_Pack(0x22ed8d78, 1, 0x2aaac044ae10, 0x2aaac04d2310,  
>> 368640) = 0
>> 28099 MPI_Unpack(0x2aaac04d2310, 368640, 0x7fffffffcc78,
>> 0x2aaac045cab0, 46080) = 0
>> 28099 free(0x2aaac04d2310)                       = <void>
>> 28099 MPI_Type_free(0x7fffffffcd08, 0x2aaac04d2310, 0x3080151a20, 0,
>> 0x2aaac04d2310) = 0
>> 28099 MPI_Type_get_envelope(0x6ae700, 0x7fffffffcb9c,
>> 0x7fffffffcb98, 0x7fffffffcb94, 0x7fffffffcb90) = 0
>> 28099 MPI_Type_size(0x6ae700, 0x7fffffffcc78, 0x7fffffffcc50, 0,
>> 0x7fffffffcb98) = 0
>> 28099 malloc(16)                                 = 0x2aaabdb60a50
>> 28099 malloc(16)                                 = 0x3fedef10
>> 28099 malloc(16)                                 = 0xa1c270
>> 28099 MPI_Type_create_subarray(4, 0x2aaabdb60a50, 0x3fedef10,
>> 0xa1c270, 0) = 0
>> 28099 MPI_Type_commit(0x7fffffffcac0, 1, 0x6aef20, 8,  
>> 0x68000000000) = 0
>> 28099 MPI_File_set_view(0x2aaac0751de0, 0x1a29464, 0x6aef20,
>> 0x2aaac0449560, 0x4a0edf) = 0
>> 28099 MPI_Type_free(0x7fffffffcac0, 0, 0, 0, 0)  = 0
>> 28099 free(0x2aaabdb60a50)                       = <void>
>> 28099 free(0x3fedef10)                           = <void>
>> 28099 free(0xa1c270)                             = <void>
>> 28099 malloc(368640)                             = 0x2aaac04ed340
>> 28099 MPI_File_write_all(0x2aaac0751de0, 0x2aaac04ed340, 368640,
>> 0x6aef20, 0x7fffffffcc30 <unfinished ...>
>
> Another process stuck in MPI_File_write_all.   A good sign, in that it
> suggests your MPI processes are indeed stuck in I/O, and not stuck in
> exchanging messages or anything like that.
>
> Are you writing directly to Lustre, or are you writing to NFS-exported
> Lustre?
>
> I think we need one more piece of information.  You said if you run on
> one big node, you can write very quickly.  Can you send the output of
> 'ncmpidump -h' or 'ncdump -h' on a completed dataset?
>
> I also have one trick you might want to try: Are you familiar with
> MPI-IO "Info" objects?  When you create the file, you are (probably)
> passing in MPI_INFO_NULL.  If instead you set up your own info object,
> we can guide some of the choices the underlying MPI-IO implementation
> makes.  In this case, it sounds like some very poorly-performing
> choices have been made.
>
> There are a few hint configurations you might want to try:
> Configuration #1:
>
> - set "romio_cb_write" to "enable" -- on Lustre, this is almost always
>  the right choice.
>
> Configuration #2:
>
> - set "romio_cb_write" to "disable"
> - set "romio_ds_write" to "disable"
>
> This configuration turns off all optimizations, but it also avoids
> costly file locks.
>
> If either of those configurations works, let us know.
>
> The good news is that MPI-IO support on Lustre has recently gotten a
> lot more attention.  As the improvements make their way out to more
> systems, you might not have to set all these hints.
>
> ==rob
>
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>



More information about the parallel-netcdf mailing list