[Nek5000-users] nek hangup in parallel

Fri Feb 24 11:04:39 CST 2012

Hi Mike,

Sorry that it's hanging for you.

I don't know of a general hang-up since we routinely run at these
levels and beyond.

If you want to send the case off-list, we can try it on
our linux cluster.

Paul

On Fri, 24 Feb 2012, nek5000-users at lists.mcs.anl.gov wrote:

> Hello All.
>
> I have a question regarding an observation that I have made several times with nek in parallel over several nodes.  In short, it's this:
>
> Say I have a nek run (that works fine but slow in serial):
> -- If I compile and run it via mpif77/mpicc on 8 cores over ONE node, it runs fine.
> -- If I run it over  8 cores over TWO nodes (4-cores per node), it runs fine.
> -- If I run it on 16 cores over TWO nodes, it gets hung up;  here is an example of what I see at the end of the log file of the hung job:
>
> gs_setup: 47118 unique labels shared
>   pairwise times (avg, min, max): 0.369331 0.344326 0.39997
>   crystal router                : 0.297755 0.279986 0.319981
>   all reduce                    : 3.00363 2.98819 3.02431
>   used all_to_all method: crystal router
>   setupds time 5.5196E+01 seconds   4  6      358599        5448
> setup h1 coarse grid, nx_crs=           2
> call usrsetvert
> done :: usrsetvert
>
> gs_setup: 1962 unique labels shared
>   pairwise times (avg, min, max): 0.0988146 0.0853947 0.103889
>   crystal router                : 0.200801 0.187893 0.208487
>   all reduce                    : 1.1251 1.11958 1.13993
>   used all_to_all method: pairwise
>
> I've notified our cluster-admin group about this to see if they can isolate a local problem, but I wanted to ask the nek community (aka you guys) if anyone has seen something similar.
>
> Thanks.
> --Mike
>
> _______________________________________________
> Nek5000-users mailing list
> Nek5000-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/nek5000-users
>