[Nek5000-users] nek hangup in parallel
nek5000-users at lists.mcs.anl.gov
nek5000-users at lists.mcs.anl.gov
Fri Feb 24 10:55:49 CST 2012
Hello All.
I have a question regarding an observation that I have made several times with nek in parallel over several nodes. In short, it's this:
Say I have a nek run (that works fine but slow in serial):
-- If I compile and run it via mpif77/mpicc on 8 cores over ONE node, it runs fine.
-- If I run it over 8 cores over TWO nodes (4-cores per node), it runs fine.
-- If I run it on 16 cores over TWO nodes, it gets hung up; here is an example of what I see at the end of the log file of the hung job:
gs_setup: 47118 unique labels shared
pairwise times (avg, min, max): 0.369331 0.344326 0.39997
crystal router : 0.297755 0.279986 0.319981
all reduce : 3.00363 2.98819 3.02431
used all_to_all method: crystal router
setupds time 5.5196E+01 seconds 4 6 358599 5448
setup h1 coarse grid, nx_crs= 2
call usrsetvert
done :: usrsetvert
gs_setup: 1962 unique labels shared
pairwise times (avg, min, max): 0.0988146 0.0853947 0.103889
crystal router : 0.200801 0.187893 0.208487
all reduce : 1.1251 1.11958 1.13993
used all_to_all method: pairwise
I've notified our cluster-admin group about this to see if they can isolate a local problem, but I wanted to ask the nek community (aka you guys) if anyone has seen something similar.
Thanks.
--Mike
More information about the Nek5000-users
mailing list