collective memory-mapped array

Jose Gracia gracia at hlrs.de
Wed Jan 27 02:16:40 CST 2010



>>> On your system, do you have a way to capture a backtrace of some of
>>> the MPI processors?  I would like to see what the hung processes are
>>> trying to do.
>>>
>> The only thing coming to my mind is ltrace. We have DDT installed,
>> but I am not familiar with it.
>>
>> I have added an MPI_Barrier and sleep(30) just in front of the line
>> quoted above in order to have time to attach the ltrace to the
>> processes. Below are traces from two different nodes (total of 8
>> nodes running 4 MPI tasks each). I don't see the ncmpi_* calls in
>> the trace ... probably have to recompile with debugging symbols?
> 
> ltrace is only able to trace symbols in shared libraries, and unless
> you took extraordinary measures, you have a static libpnetcdf.a
I see. Any other way to trace calls into dynamic libs as well?

>> 1899 MPI_File_write_all(0x2aaad2147c00, 0x2aaad65d6020, 0x10e000,
>> 0x6aef20, 0x7fffffffcc70 <unfinished ...>
> 
> OK, what I can see from all this is that you've set up a noncontiguous
> in file -- not surprising for a 4D variable.  Things make it pretty
> far but get stuck in MPI_File_write_all.  Perfect: that's actually
> what I expected to see.

>> 28099 MPI_File_write_all(0x2aaac0751de0, 0x2aaac04ed340, 368640,
>> 0x6aef20, 0x7fffffffcc30 <unfinished ...>
> 
> Another process stuck in MPI_File_write_all.   A good sign, in that it
> suggests your MPI processes are indeed stuck in I/O, and not stuck in
> exchanging messages or anything like that.
> 
> Are you writing directly to Lustre, or are you writing to NFS-exported
> Lustre?
 From the output of mount

10.130.200.211 at o2ib:10.130.200.212 at o2ib:/lnec on /lustre/ws1 type lustre 
(rw)

I guess it is onto lustre directly.

> I think we need one more piece of information.  You said if you run on
> one big node, you can write very quickly.  Can you send the output of
> 'ncmpidump -h' or 'ncdump -h' on a completed dataset?
-------------------
netcdf BFM.080401O2o {
dimensions:
	time = 1 ;
	z = 72 ;
	y = 128 ;
	x = 362 ;
	x_a = 1 ;
	y_a = 1 ;
	z_a = 3 ;
variables:
	double nav_lon(y, x) ;
	double nav_lat(y, x) ;
	double nav_lev(z) ;
	double time(time) ;
	int time_step(time) ;
	double info(time, z_a, y_a, x_a) ;
	double TRBO2o(time, z, y, x) ;
	double TRNO2o(time, z, y, x) ;
}
---------------------------

I have problems only when writing TRNO2o or TRBO2o. The other vars nav_* 
work fine, also collectively.


> I also have one trick you might want to try: Are you familiar with
> MPI-IO "Info" objects?  When you create the file, you are (probably)
> passing in MPI_INFO_NULL.  If instead you set up your own info object,
> we can guide some of the choices the underlying MPI-IO implementation
> makes.  In this case, it sounds like some very poorly-performing
> choices have been made.
I should have mentioned that I run with "romio_ds_write" to "disable" 
since I had issues with file locking.

> 
> There are a few hint configurations you might want to try:
> Configuration #1:
> 
> - set "romio_cb_write" to "enable" -- on Lustre, this is almost always
>   the right choice. 
That fails with:

File locking failed in ADIOI_Set_lock(fd 20,cmd F_SETLKW/7,type 
F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26.
If the file system is NFS, you need to use NFS version 3, ensure that 
the lockd daemon is running on all the machines, and mount the directory 
with the 'noac' option (no attribute caching).
ADIOI_Set_lock:: Function not implemented
ADIOI_Set_lock:offset 744060, length 524288


> 
> Configuration #2:
> 
> - set "romio_cb_write" to "disable" 
> - set "romio_ds_write" to "disable"
> 
> This configuration turns off all optimizations, but it also avoids
> costly file locks.  
Eureka! Works.

Do I understand correctly, that it might not be necessary to disable 
those optimizations on all systems/clusters? I would also like to 
understand why I need to disable those in the first place? I am still 
worried, I make a mistake when setting the start, count and imap 
parameters in the ncmpi_put_varm_*_all call.


In any case this is a big step forward! Thanks a lot!

Jose


-- 

Dr. Jose Gracia		email:  gracia at hlrs.de
HLRS, Uni Stuttgart	http://www.hlrs.de/people/gracia
Nobelstrasse 19		phone: +49 711 685 87208
70569 Stuttgart		fax:   +49 711 685 65832
Germany


More information about the parallel-netcdf mailing list