input/output errors from "make check"

Rob Latham robl at mcs.anl.gov
Fri May 29 09:08:53 CDT 2015



On 05/28/2015 10:20 AM, Wei-keng Liao wrote:
> Hi, Carl
>
> The error message "=>> PBS: job killed: walltime 3636 exceeded limit 3600"
> means the time allocated in your job submitted to the PBS queue is 3600 seconds, and the
> job ran more than that limit and was killed by the system.

Still, an hour to run the test seems really high.  On this system, do 
you have a fast storage system and a slower home file system?

On my 3 year old laptop (so plenty of caching) 'time make check':

make check  141.63s user 10.17s system 94% cpu 2:40.65 total

If you are hitting paths in ROMIO that require lustre or NFS locks, 
that's going to slow things down a lot... but 200x slower?   Wonder 
what's going on here.

Wei-keng is thinking specifically of the situation where one would run 
the pnetcdf tests out of the Blue Gene home directory, which has very 
low performance in addition to having all system calls relayed through 
i/o nodes.

Building on the faster parallel file system (still gpfs, but with more 
servers) takes a long time -- close to 3 hours!  Building on the home 
file system takes 13 minutes.

GPFS, at least the one on our Blue Gene, takes a very long time to link. 
  Perhaps all you will need to do, if you are using GPFS, is build the 
tests first before running them.

That's all I can think of at the moment.

==rob


>
> Each of nc_test, nf_test, and nf90_test performs thousands of small writes to test PnetCDF.
> On some systems, especially those with storage systems separated from the compute systems, eg. IBM BG machines, these tests will take a long time. Please allocated more time for you job, say 2 hours.
>
> Also, you can compile all test programs before running the tests, by doing the followings.
> cd test
> make
> cd ../examples
> make
> cd ..
>
> This way your job will run with all test executables already built.
>
>
> Wei-keng
>
> On May 28, 2015, at 2:00 AM, Carl Ponder wrote:
>
>> On 05/27/2015 06:41 PM, Wei-keng Liao wrote:
>>> However, you should see the same error messages when testing C programs, i.e. nc_test. Is it not the case for you?
>> Ok -- I did get the same errors in the C part of the testing.
>>> As for the issue you encountered when using PGI compilers, can you send me the file config.log and show me the standard output on screen where the program hangs?
>>> A successful run of "make check" should look like below. Are you saying it does not show this first line when entering directory nf90_test?
>>> *** TESTING F90 ./nf90_test for CDF-1                    ------ pass
>> Here's the point where it hung
>> /shared/apps/centos-6.6_SB/OpenMPI/1.8.5/PGI-15.5_CUDA-7.0_HWLoc-1.10.1_NUMACtl-2.0.9/bin/mpif90 -fPIC -m64 -tp=px -o nf90_test fortlib.o nf90_error.o nf90_test.o test_read.o test_write.o util.o test_get.o test_put.o test_iget.o test_iput.o  /shared/apps/centos-6.6_SB/PNetCDF/1.6.0/OpenMPI-1.8.5_PGI-15.5_CUDA-7.0/distro/src/lib/libpnetcdf.a
>> rm -f ./scratch.nc ./test.nc
>> ./nf90_test -c    -d .
>> ./nf90_test       -d .
>> =>> PBS: job killed: walltime 3636 exceeded limit 3600
>> so it finished the F77 tests and had just built the F90 tests.
>> This looks like an issue with the PGI 15.5 compiler, I'd like to be able to reproduce the hang if I can.
>> The config.log is attached here.
>> Thanks,
>>
>>                  Carl
>> This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
>> <config.log>
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the parallel-netcdf mailing list