[Nek5000-users] nek hangup in parallel
nek5000-users at lists.mcs.anl.gov
nek5000-users at lists.mcs.anl.gov
Fri Feb 24 11:04:39 CST 2012
Hi Mike,
Sorry that it's hanging for you.
I don't know of a general hang-up since we routinely run at these
levels and beyond.
If you want to send the case off-list, we can try it on
our linux cluster.
Paul
On Fri, 24 Feb 2012, nek5000-users at lists.mcs.anl.gov wrote:
> Hello All.
>
> I have a question regarding an observation that I have made several times with nek in parallel over several nodes. In short, it's this:
>
> Say I have a nek run (that works fine but slow in serial):
> -- If I compile and run it via mpif77/mpicc on 8 cores over ONE node, it runs fine.
> -- If I run it over 8 cores over TWO nodes (4-cores per node), it runs fine.
> -- If I run it on 16 cores over TWO nodes, it gets hung up; here is an example of what I see at the end of the log file of the hung job:
>
> gs_setup: 47118 unique labels shared
> pairwise times (avg, min, max): 0.369331 0.344326 0.39997
> crystal router : 0.297755 0.279986 0.319981
> all reduce : 3.00363 2.98819 3.02431
> used all_to_all method: crystal router
> setupds time 5.5196E+01 seconds 4 6 358599 5448
> setup h1 coarse grid, nx_crs= 2
> call usrsetvert
> done :: usrsetvert
>
> gs_setup: 1962 unique labels shared
> pairwise times (avg, min, max): 0.0988146 0.0853947 0.103889
> crystal router : 0.200801 0.187893 0.208487
> all reduce : 1.1251 1.11958 1.13993
> used all_to_all method: pairwise
>
> I've notified our cluster-admin group about this to see if they can isolate a local problem, but I wanted to ask the nek community (aka you guys) if anyone has seen something similar.
>
> Thanks.
> --Mike
>
> _______________________________________________
> Nek5000-users mailing list
> Nek5000-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/nek5000-users
>
More information about the Nek5000-users
mailing list