[Nek5000-users] IO problem

nek5000-users at lists.mcs.anl.gov nek5000-users at lists.mcs.anl.gov
Mon Nov 24 21:44:13 CST 2014


Hello all,

I have sometimes experienced an IO issue, when running on a relatively 
large number of machines on our bgq cluster with MPIIO enabled and 
p65=0. The issue is that while the status of the job shows normal 
running, no output files and even .log file are written on the file 
system. What is strange is that the issue happens occasionally and for 
jobs with more than 512 nodes, as I haven't seen this for smaller scale 
problems and resources. Usually, cancelling and resubmitting the job 
will solve the issue.

I contacted the technical staff of our cluster and they performed 
several tests but they have not yet been able to reproduce the problem 
on small scales. It is more likely that the this problem is caused by a 
hardware issue rather than a code issue since it is not happened all the 
time. However, I wanted to contact you and wondered if anyone has any 
comments or similar experience.

Thanks
Mohsen



More information about the Nek5000-users mailing list