[petsc-users] MPICH error in KSPSolve

Mark F. Adams mark.adams at columbia.edu
Mon Jul 9 18:42:58 CDT 2012


On Jul 9, 2012, at 4:31 PM, John Mousel wrote:

> Mark,
> 
> I just tried the following options on Kraken on 1200 cores:
> 
> -pres_ksp_type bcgsl -pres_pc_type gamg -pres_pc_gamg_type agg -pres_pc_gamg_agg_nsmooths 1 -pres_pc_gamg_threshold 0.05 -pres_mg_levels_ksp_type richardson -pres_mg_levels_pc_type sor -pres_mg_coarse_ksp_typ
> e richardson -pres_mg_coarse_pc_type sor -pres_mg_coarse_pc_sor_its 4
> 
> It hung at 
> 
> [0]PCSetData_AGG bs=1 MM=10672

Humm, I don't see that print statement in my code anymore.  What version of PETSc are you using?

This is/was at the very beginning of the code.  Are you using '-pc_gamg_sym_graph true'?  This has been the default in some versions so if you do not have this it does not mean you are not using it.

You should use this parameter.  My experience is the code give an internal error, and exits gracefully, if this is wrong but it could manifest itself as a hang (the parallel graph algorithms get confused if they do not have a symmetric graph)

Are your problems structurally unsymetric?  If not then you can use '-pc_gamg_threshold -1.' and it should work even without '-pc_gamg_sym_graph true'.

Unfortunately this uses MatTranspose which is very slow for some reason.  I saw it take about 1 minute with 8,000 vertices per processor on a Cray XE6 recently.  So if you are running with very large processor subdomains this could explain the 15 minutes.  We need to fix this soon.

Mark

> 
> for nearly 15 minutes. I take it this is not normal.
> 
> John
> 
> 
> 
> On Mon, Jul 9, 2012 at 2:41 PM, John Mousel <john.mousel at gmail.com> wrote:
> Can you clarify what you  mean by null-space cleaning. I just run SOR on the coarse grid.
> 
> 
> 
> 
> On Mon, Jul 9, 2012 at 11:52 AM, Mark F. Adams <mark.adams at columbia.edu> wrote:
> 
> On Jul 9, 2012, at 12:39 PM, John Mousel wrote:
> 
>> Mark,
>> 
>> The problem is indeed non-symmetric. We went back and forth in March about this problem. I think we ended up concluding that the coarse size couldn't get too small or the null-space presented problems.
> 
> Oh its singular.  I forget what the issues were but an iterative coarse grid solver should be fine for singular problems, perhaps with null space cleaning if the kernel is sneaking in.   Actually there is an SVD coarse grid solver:
> 
> -mg_coarse_pc_type svd
> 
> That is the most robust.
> 
>> When I did get it to work, I tried to scale it up, and on my local university cluster, it seemed to just hang when the core counts got above something like 16 cores. I don't really trust that machine though.
> 
> That's the machine.  GAMG does have some issues but I've not seen it hang.
> 
>> It's new and has been plagued by hardware incompatability issues since day 1. I could re-examine this on Kraken. Also, what option are you talking about with ML. I thought I had tried all the -pc_ml_CoarsenScheme options, but I could be wrong.
> 
> This sounds like the right one.  I try to be careful in my solvers to be invariant to subdomain shapes and sizes and I think Ray Tuminaro (ML developer) at least has options that should be careful about this also.  But I don't know much about what they are deploying these days.
> 
> Mark
> 
>> 
>> John
>> 
>>  
>> 
>> On Mon, Jul 9, 2012 at 11:30 AM, Mark F. Adams <mark.adams at columbia.edu> wrote:
>> What problems are you having again with GAMG?  Are you problems unsymmetric?
>> 
>> ML has several coarsening strategies available and I think the default does aggregation locally and does not aggregate across processor subdomains.  If you have poorly shaped domains then you want to use a global coarsening method (these are not expensive).
>> 
>> Mark
>> 
>> On Jul 9, 2012, at 12:17 PM, John Mousel wrote:
>> 
>>> Mark,
>>> 
>>> I still haven't had much luck getting GAMG to work consistently for my Poisson problem. ML seems to work nicely on low core counts, but I have a problem where I can get long thin portions of grid on some processors instead of nice block like chunks at high core counts, which leads to a pretty tough time for ML. 
>>> 
>>> John
>>> 
>>> On Mon, Jul 9, 2012 at 10:58 AM, John Mousel <john.mousel at gmail.com> wrote:
>>> Getting rid of the Hypre option seemed to be the trick. 
>>> 
>>> On Mon, Jul 9, 2012 at 10:40 AM, Mark F. Adams <mark.adams at columbia.edu> wrote:
>>> Google PTL_NO_SPACE and you will find some NERSC presentations on how to go about fixing this.  (I have run into these problems years ago but forget the issues)
>>> 
>>> Also, I would try running with a Jacobi solver to see if that fixes the problem.  If so then you might try
>>> 
>>> -pc_type gamg
>>> -pc_gamg_agg_nsmooths 1
>>> -pc_gamg_type agg
>>> 
>>> This is a built in AMG solver so perhaps it plays nicer with resources ...
>>> 
>>> Mark
>>> 
>>> On Jul 9, 2012, at 10:57 AM, John Mousel wrote:
>>> 
>>> > I'm running on Kraken and am currently working with 4320 cores. I get the following error in KSPSolve.
>>> >
>>> > [2711]: (/ptmp/ulib/mpt/nightly/5.3/120211/mpich2/src/mpid/cray/src/adi/ptldev.c:2046) PtlMEInsert failed with error : PTL_NO_SPACE
>>> > MHV_exe: /ptmp/ulib/mpt/nightly/5.3/120211/mpich2/src/mpid/cray/src/adi/ptldev.c:2046: MPIDI_CRAY_ptldev_desc_pkt: Assertion `0' failed.
>>> > forrtl: error (76): Abort trap signal
>>> > Image              PC                Routine            Line        Source
>>> > MHV_exe            00000000014758CB  Unknown               Unknown  Unknown
>>> > MHV_exe            000000000182ED43  Unknown               Unknown  Unknown
>>> > MHV_exe            0000000001829460  Unknown               Unknown  Unknown
>>> > MHV_exe            00000000017EDE3E  Unknown               Unknown  Unknown
>>> > MHV_exe            00000000017B3FE6  Unknown               Unknown  Unknown
>>> > MHV_exe            00000000017B3738  Unknown               Unknown  Unknown
>>> > MHV_exe            00000000017B2B12  Unknown               Unknown  Unknown
>>> > MHV_exe            00000000017B428F  Unknown               Unknown  Unknown
>>> > MHV_exe            000000000177FCE1  Unknown               Unknown  Unknown
>>> > MHV_exe            0000000001590A43  Unknown               Unknown  Unknown
>>> > MHV_exe            00000000014F909B  Unknown               Unknown  Unknown
>>> > MHV_exe            00000000014FF53B  Unknown               Unknown  Unknown
>>> > MHV_exe            00000000014A4E25  Unknown               Unknown  Unknown
>>> > MHV_exe            0000000001487D57  Unknown               Unknown  Unknown
>>> > MHV_exe            000000000147F726  Unknown               Unknown  Unknown
>>> > MHV_exe            000000000137A8D3  Unknown               Unknown  Unknown
>>> > MHV_exe            0000000000E97BF2  Unknown               Unknown  Unknown
>>> > MHV_exe            000000000098EAF1  Unknown               Unknown  Unknown
>>> > MHV_exe            0000000000989C20  Unknown               Unknown  Unknown
>>> > MHV_exe            000000000097A9C2  Unknown               Unknown  Unknown
>>> > MHV_exe            000000000082FF2D  axbsolve_                 539  PetscObjectsOperations.F90
>>> >
>>> > This is somewhere in KSPSolve. Is there an MPICH environment variable that needs tweaking? I couldn't really find much on this particular error.
>>> > The solver is BiCGStab with Hypre as a preconditioner.
>>> >
>>> > -ksp_type bcgsl -pc_type hypre -pc_hypre_type boomeramg -ksp_monitor
>>> >
>>> > Thanks,
>>> >
>>> > John
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120709/0ef4cc8d/attachment.html>


More information about the petsc-users mailing list