[petsc-dev] problem with hypre with '--with-openmp=1'

Junchao Zhang jczhang at mcs.anl.gov
Tue Jun 26 14:44:26 CDT 2018


I did not set OMP_NUM_THREADS in my .bashrc or job script. The job ran out
of time.
If I did export OMP_NUM_THREADS=1 in job script on Cori, the job ran very
slowly, i.e., finished in 200 seconds compared to 1s without --with-openmp.

--Junchao Zhang

On Tue, Jun 26, 2018 at 12:05 PM, Balay, Satish <balay at mcs.anl.gov> wrote:

> I wonder if these jobs are scheduled in such a way so that they are not
> oversubscribed.
>
> i.e number_mpi_jobs_per_node * number_of_openmp_threads_per_node <=
> no_of_cores_per_node
>
> Satish
>
> On Tue, 26 Jun 2018, Mark Adams wrote:
>
> > Interesting, I am seeing the same thing with ksp/ex56 (elasticity) with
> > 30^3 grid on each process. One process runs fine (1.5 sec) but 8
> processes
> > with 30^3 on each process took 156 sec.
> >
> > And, PETSc's log_view is running extremely slow. I have the total time
> > (156) but each event is taking like a minute or more to come out.
> >
> > On Tue, Jun 26, 2018 at 10:13 AM Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
> >
> > >
> > > On Tue, Jun 26, 2018 at 8:26 AM, Mark Adams <mfadams at lbl.gov> wrote:
> > >
> > >>
> > >>
> > >> On Tue, Jun 26, 2018 at 12:19 AM Junchao Zhang <jczhang at mcs.anl.gov>
> > >> wrote:
> > >>
> > >>> Mark,
> > >>>   Your email reminded me my recent experiments. My PETSc was
> configured --with-openmp=1.
> > >>> With hypre, my job ran out of time. That was on an Argonne Xeon
> cluster.
> > >>>
> > >>
> > >> Interesting. I tested on Cori's Haswell nodes and it looked fine. I
> did
> > >> not time it but seemed OK.
> > >>
> > >>
> > >>>   I repeated the experiments on Cori's Haswell nodes.
> --with-openmp=1,
> > >>> "Linear solve converged due to CONVERGED_RTOL iterations 5". But it
> took
> > >>> very long time (10 mins). Without --with-openmp=1, it took less than
> 1
> > >>> second.
> > >>>
> > >>
> > >> Humm. I seemed to run OK on Cori's Haswell nodes. Where you running a
> > >> significant sized job? I was test small serial runs.
> > >>
> > >
> > >  I ran with 27 processors and each had 30^3 unknowns.
> > >
> > >>
> > >>
> > >>>
> > >>> --Junchao Zhang
> > >>>
> > >>> On Fri, Jun 22, 2018 at 3:33 PM, Mark Adams <mfadams at lbl.gov> wrote:
> > >>>
> > >>>> We are using KNL (Cori) and hypre is not working when configured
> > >>>> with  '--with-openmp=1', even when not using threads (as far as I
> can tell,
> > >>>> I never use threads).
> > >>>>
> > >>>> Hypre is not converging, for instance with an optimized build:
> > >>>>
> > >>>> srun -n 1 ./ex56 -pc_type hypre -ksp_monitor -ksp_converged_reason
> > >>>> -ksp_type cg -pc_hypre_type boomeramg
> > >>>> OMP: Warning #239: KMP_AFFINITY: granularity=fine will be used.
> > >>>>   0 KSP Residual norm 7.366251922394e+22
> > >>>>   1 KSP Residual norm 3.676434682799e+22
> > >>>> Linear solve did not converge due to DIVERGED_INDEFINITE_PC
> iterations 2
> > >>>>
> > >>>> Interestingly in debug mode it almost looks good but it is dying:
> > >>>>
> > >>>> 05:09 nid02516 maint *=
> > >>>> ~/petsc_install/petsc/src/ksp/ksp/examples/tutorials$ make
> > >>>> PETSC_DIR=/global/homes/m/madams/petsc_install/petsc-
> cori-knl-dbg64-intel-omp
> > >>>> PETSC_ARCH="" run
> > >>>> srun -n 1 ./ex56 -pc_type hypre -ksp_monitor -ksp_converged_reason
> > >>>> -ksp_type cg -pc_hypre_type boomeramg
> > >>>> OMP: Warning #239: KMP_AFFINITY: granularity=fine will be used.
> > >>>>   0 KSP Residual norm 7.882081712007e+02
> > >>>>   1 KSP Residual norm 2.500214073037e+02
> > >>>>   2 KSP Residual norm 3.371746347713e+01
> > >>>>   3 KSP Residual norm 2.918759396143e+00
> > >>>>   4 KSP Residual norm 9.006505495017e-01
> > >>>> Linear solve did not converge due to DIVERGED_INDEFINITE_PC
> iterations 5
> > >>>>
> > >>>> This test runs fine on Xeon nodes. I assume that Hypre has been
> tested
> > >>>> on KNL. GAMG runs fine, of coarse and the initial residual is
> similar to
> > >>>> this debug run.
> > >>>>
> > >>>> Could PETSc be messing up the matrix conversion to hypre
> > >>>> '--with-openmp=1' ?
> > >>>>
> > >>>> Any ideas?
> > >>>>
> > >>>> Thanks,
> > >>>> Mark
> > >>>>
> > >>>>
> > >>>
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180626/d09078e6/attachment-0001.html>


More information about the petsc-dev mailing list