[petsc-users] MPICH error in KSPSolve

John Mousel john.mousel at gmail.com
Mon Jul 9 15:31:26 CDT 2012


Mark,

I just tried the following options on Kraken on 1200 cores:

-pres_ksp_type bcgsl -pres_pc_type gamg -pres_pc_gamg_type agg
-pres_pc_gamg_agg_nsmooths 1 -pres_pc_gamg_threshold 0.05
-pres_mg_levels_ksp_type richardson -pres_mg_levels_pc_type sor
-pres_mg_coarse_ksp_typ
e richardson -pres_mg_coarse_pc_type sor -pres_mg_coarse_pc_sor_its 4

It hung at

[0]PCSetData_AGG bs=1 MM=10672

for nearly 15 minutes. I take it this is not normal.

John



On Mon, Jul 9, 2012 at 2:41 PM, John Mousel <john.mousel at gmail.com> wrote:

> Can you clarify what you  mean by null-space cleaning. I just run SOR on
> the coarse grid.
>
>
>
>
> On Mon, Jul 9, 2012 at 11:52 AM, Mark F. Adams <mark.adams at columbia.edu>wrote:
>
>>
>> On Jul 9, 2012, at 12:39 PM, John Mousel wrote:
>>
>> Mark,
>>
>> The problem is indeed non-symmetric. We went back and forth in March
>> about this problem. I think we ended up concluding that the coarse size
>> couldn't get too small or the null-space presented problems.
>>
>>
>> Oh its singular.  I forget what the issues were but an iterative coarse
>> grid solver should be fine for singular problems, perhaps with null space
>> cleaning if the kernel is sneaking in.   Actually there is an SVD coarse
>> grid solver:
>>
>> -mg_coarse_pc_type svd
>>
>> That is the most robust.
>>
>> When I did get it to work, I tried to scale it up, and on my local
>> university cluster, it seemed to just hang when the core counts got above
>> something like 16 cores. I don't really trust that machine though.
>>
>>
>> That's the machine.  GAMG does have some issues but I've not seen it hang.
>>
>> It's new and has been plagued by hardware incompatability issues since
>> day 1. I could re-examine this on Kraken. Also, what option are you talking
>> about with ML. I thought I had tried all the -pc_ml_CoarsenScheme options,
>> but I could be wrong.
>>
>>
>> This sounds like the right one.  I try to be careful in my solvers to be
>> invariant to subdomain shapes and sizes and I think Ray Tuminaro (ML
>> developer) at least has options that should be careful about this also.
>>  But I don't know much about what they are deploying these days.
>>
>> Mark
>>
>>
>> John
>>
>>
>>
>> On Mon, Jul 9, 2012 at 11:30 AM, Mark F. Adams <mark.adams at columbia.edu>wrote:
>>
>>> What problems are you having again with GAMG?  Are you problems
>>> unsymmetric?
>>>
>>> ML has several coarsening strategies available and I think the default
>>> does aggregation locally and does not aggregate across processor
>>> subdomains.  If you have poorly shaped domains then you want to use a
>>> global coarsening method (these are not expensive).
>>>
>>> Mark
>>>
>>> On Jul 9, 2012, at 12:17 PM, John Mousel wrote:
>>>
>>> Mark,
>>>
>>> I still haven't had much luck getting GAMG to work consistently for my
>>> Poisson problem. ML seems to work nicely on low core counts, but I have a
>>> problem where I can get long thin portions of grid on some processors
>>> instead of nice block like chunks at high core counts, which leads to a
>>> pretty tough time for ML.
>>>
>>> John
>>>
>>> On Mon, Jul 9, 2012 at 10:58 AM, John Mousel <john.mousel at gmail.com>wrote:
>>>
>>>> Getting rid of the Hypre option seemed to be the trick.
>>>>
>>>> On Mon, Jul 9, 2012 at 10:40 AM, Mark F. Adams <mark.adams at columbia.edu
>>>> > wrote:
>>>>
>>>>> Google PTL_NO_SPACE and you will find some NERSC presentations on how
>>>>> to go about fixing this.  (I have run into these problems years ago but
>>>>> forget the issues)
>>>>>
>>>>> Also, I would try running with a Jacobi solver to see if that fixes
>>>>> the problem.  If so then you might try
>>>>>
>>>>> -pc_type gamg
>>>>> -pc_gamg_agg_nsmooths 1
>>>>> -pc_gamg_type agg
>>>>>
>>>>> This is a built in AMG solver so perhaps it plays nicer with resources
>>>>> ...
>>>>>
>>>>> Mark
>>>>>
>>>>> On Jul 9, 2012, at 10:57 AM, John Mousel wrote:
>>>>>
>>>>> > I'm running on Kraken and am currently working with 4320 cores. I
>>>>> get the following error in KSPSolve.
>>>>> >
>>>>> > [2711]:
>>>>> (/ptmp/ulib/mpt/nightly/5.3/120211/mpich2/src/mpid/cray/src/adi/ptldev.c:2046)
>>>>> PtlMEInsert failed with error : PTL_NO_SPACE
>>>>> > MHV_exe:
>>>>> /ptmp/ulib/mpt/nightly/5.3/120211/mpich2/src/mpid/cray/src/adi/ptldev.c:2046:
>>>>> MPIDI_CRAY_ptldev_desc_pkt: Assertion `0' failed.
>>>>> > forrtl: error (76): Abort trap signal
>>>>> > Image              PC                Routine            Line
>>>>>  Source
>>>>> > MHV_exe            00000000014758CB  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            000000000182ED43  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            0000000001829460  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            00000000017EDE3E  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            00000000017B3FE6  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            00000000017B3738  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            00000000017B2B12  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            00000000017B428F  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            000000000177FCE1  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            0000000001590A43  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            00000000014F909B  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            00000000014FF53B  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            00000000014A4E25  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            0000000001487D57  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            000000000147F726  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            000000000137A8D3  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            0000000000E97BF2  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            000000000098EAF1  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            0000000000989C20  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            000000000097A9C2  Unknown               Unknown
>>>>>  Unknown
>>>>> > MHV_exe            000000000082FF2D  axbsolve_                 539
>>>>>  PetscObjectsOperations.F90
>>>>> >
>>>>> > This is somewhere in KSPSolve. Is there an MPICH environment
>>>>> variable that needs tweaking? I couldn't really find much on this
>>>>> particular error.
>>>>> > The solver is BiCGStab with Hypre as a preconditioner.
>>>>> >
>>>>> > -ksp_type bcgsl -pc_type hypre -pc_hypre_type boomeramg -ksp_monitor
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> > John
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120709/43861cb1/attachment.html>


More information about the petsc-users mailing list