pNetCDF problem with WRF on large core counts

Michalakes, John John.Michalakes at nrel.gov
Mon Aug 5 07:38:56 CDT 2013


Hi,

We're running into problems running WRF with pNetCDF and it may have something to do with which MPI implementation we use.  Both Intel MPI and OpenMPI fail (hang processes) on MPI task counts greater that 256.  Mvapich2 works, however.  This is using the Lustre file system on a Sandybridge Linux cluster here.  Are you aware of any task limits associated with MPI-IO in these implementations that might be causing the problem? Any ideas for reconfiguring?   There's a little more information in the email stream below.

Thanks,

John

 John Michalakes
 Computational Sciences Center
 National Renewable Energy Laboratory
15013 Denver West Pkwy,  ESIF301
Golden CO 80401
Phone: 303-275-4297, Fax: 303-275-4007
john.michalakes at nrel.gov
http://www.nrel.gov/csc/staff_michalakes.html


-----Original Message-----
From: Regimbal, Kevin 
Sent: Sunday, August 04, 2013 1:33 PM
To: Michalakes, John

Hi John,

I've been playing with parallel-netcdf this weekend.  As far as I can tell, pnetcdf does not work at large core counts (i.e. 4096) for either intel MPI or openMPI.  It does work with mvapich2 at 4096 cores.  I added a build for mvapich2 and a pnetcdf that ties to intel & mvapich2.

It's probably going to take a while to track down why large core counts work.  Not sure if the issue is pnetcdf or MPIIO on the other MPIs.

Kevin
________________________________________
From: Michalakes, John
Sent: Thursday, August 01, 2013 3:42 PM
Cc: Regimbal, Kevin

[...]

Regarding pNetCDF, this time the executable just hung reading the first input file.  I then did another run and made sure to put an lsf setstripe -c 4 . command in the runscript.  It hung again but this time at least one of the tasks output this strange message before hanging:

n0224:c7f1:167b2700: 82084370 us(82084370 us!!!):  CONN_RTU read: sk 77 ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0 R-0

[...]

I'm tried another run, this time, copying the input data into the directory instead of accessing it via a symlink.  Then I saw this:

n0240:c9e5:50cde700: 64137185 us(64137185 us!!!):  CONN_REQUEST: SOCKOPT ERR No route to host -> 172.20.3.55 53644 - ABORTING 5
n0240:c9e5:50cde700: 64137227 us(42 us): dapl_evd_conn_cb() unknown event 0x0
n0240:c9e5:50cde700: 64173162 us(35935 us):  CONN_REQUEST: SOCKOPT ERR No route to host -> 172.20.3.56 53631 - ABORTING 5
n0240:c9e5:50cde700: 64173192 us(30 us): dapl_evd_conn_cb() unknown event 0x0



More information about the parallel-netcdf mailing list