[Nek5000-users] nek hangup in parallel

nek5000-users at lists.mcs.anl.gov nek5000-users at lists.mcs.anl.gov
Fri Feb 24 10:55:49 CST 2012


Hello All.

I have a question regarding an observation that I have made several times with nek in parallel over several nodes.  In short, it's this:

Say I have a nek run (that works fine but slow in serial):
-- If I compile and run it via mpif77/mpicc on 8 cores over ONE node, it runs fine.
-- If I run it over  8 cores over TWO nodes (4-cores per node), it runs fine.
-- If I run it on 16 cores over TWO nodes, it gets hung up;  here is an example of what I see at the end of the log file of the hung job:

gs_setup: 47118 unique labels shared
   pairwise times (avg, min, max): 0.369331 0.344326 0.39997
   crystal router                : 0.297755 0.279986 0.319981
   all reduce                    : 3.00363 2.98819 3.02431
   used all_to_all method: crystal router
   setupds time 5.5196E+01 seconds   4  6      358599        5448
 setup h1 coarse grid, nx_crs=           2
 call usrsetvert
 done :: usrsetvert

gs_setup: 1962 unique labels shared
   pairwise times (avg, min, max): 0.0988146 0.0853947 0.103889
   crystal router                : 0.200801 0.187893 0.208487
   all reduce                    : 1.1251 1.11958 1.13993
   used all_to_all method: pairwise

I've notified our cluster-admin group about this to see if they can isolate a local problem, but I wanted to ask the nek community (aka you guys) if anyone has seen something similar.

Thanks.
--Mike




More information about the Nek5000-users mailing list