[petsc-users] generate entries on 'wrong' process (Barry Smith)

Barry Smith bsmith at mcs.anl.gov
Wed Jan 18 15:05:56 CST 2012


On Jan 18, 2012, at 2:22 PM, Wen Jiang wrote:

> Hi Barry,
> 
> The symptom of "just got stuck" is simply that the code just stays there and never moves on. One more thing is that all the processes are at 99% cpu utilization. I do see some network traffic between the head node and computation nodes. The quantity is very small, but the sheer number of packets is huge. The processes are sending between 550 and 620 Million packets per second across the network. 
> 
> Since my code never finishes, I cannot get the summary files by add -log_summary. any other way to get summary file?

   My guess is that you are running a larger problem on the this system and your preallocation for the matrix is wrong. While in the small run you sent the preallocation is correct. 

   Usually the only thing that causes it to take forever is not the parallel communication but is the preallocation. After you create the matrix and set its preallocation call 
MatSetOption(mat, NEW_NONZERO_ALLOCATION_ERR,PETSC_TRUE);  then run. It will stop with an error message if preallocation is wrong.

   Barry



> 
> BTW, my codes are running without any problem on shared-memory desktop with any number of processes. 
> 
> On Wed, Jan 18, 2012 at 3:03 PM, <petsc-users-request at mcs.anl.gov> wrote:
> Send petsc-users mailing list submissions to
>        petsc-users at mcs.anl.gov
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://lists.mcs.anl.gov/mailman/listinfo/petsc-users
> or, via email, send a message with subject or body 'help' to
>        petsc-users-request at mcs.anl.gov
> 
> You can reach the person managing the list at
>        petsc-users-owner at mcs.anl.gov
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of petsc-users digest..."
> 
> 
> Today's Topics:
> 
>   1. Re:  generate entries on 'wrong' process (Barry Smith)
>   2. Re:  [petsc-dev] boomerAmg scalability (Ravi Kannan)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Wed, 18 Jan 2012 12:56:10 -0600
> From: Barry Smith <bsmith at mcs.anl.gov>
> Subject: Re: [petsc-users] generate entries on 'wrong' process
> To: PETSc users list <petsc-users at mcs.anl.gov>
> Message-ID: <47754349-9741-4740-BBB4-F4B84EA07CEF at mcs.anl.gov>
> Content-Type: text/plain; charset=us-ascii
> 
> 
>   What is the symptom of "just got stuck".   Send the results of the whole run with -log_summary to petsc-maint at mcs.anl.gov and we'll see how much time is in that communication.
> 
>    Barry
> 
> 
> On Jan 18, 2012, at 10:32 AM, Wen Jiang wrote:
> 
> > Hi,
> >
> > I am working on FEM codes with spline-based element type. For 3D case, one element has 64 nodes and every two neighboring elements share 48 nodes. Thus regardless how I partition a mesh,  there are still very large number of entries that have to write on the 'wrong' processor. And my code is running on clusters, the processes are sending between 550 and 620 Million packets per second across the network. My code seems IO-bound at this moment and  just get stuck at the matrix assembly stage. A -info file is attached. Do I have other options to optimize my codes to be less io-intensive?
> >
> > Thanks in advance.
> >
> > [0] VecAssemblyBegin_MPI(): Stash has 210720 entries, uses 12 mallocs.
> > [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs.
> > [5] MatAssemblyBegin_MPIAIJ(): Stash has 4806656 entries, uses 8 mallocs.
> > [6] MatAssemblyBegin_MPIAIJ(): Stash has 5727744 entries, uses 9 mallocs.
> > [4] MatAssemblyBegin_MPIAIJ(): Stash has 5964288 entries, uses 9 mallocs.
> > [7] MatAssemblyBegin_MPIAIJ(): Stash has 7408128 entries, uses 9 mallocs.
> > [3] MatAssemblyBegin_MPIAIJ(): Stash has 8123904 entries, uses 9 mallocs.
> > [2] MatAssemblyBegin_MPIAIJ(): Stash has 11544576 entries, uses 10 mallocs.
> > [0] MatStashScatterBegin_Private(): No of messages: 1
> > [0] MatStashScatterBegin_Private(): Mesg_to: 1: size: 107888648
> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 13486080 entries, uses 10 mallocs.
> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 16386048 entries, uses 10 mallocs.
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 11391 X 11391; storage space: 0 unneeded,2514537 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 294
> > [0] Mat_CheckInode(): Found 11391 nodes out of 11391 rows. Not using Inode routines
> > [5] MatAssemblyEnd_SeqAIJ(): Matrix size: 11390 X 11390; storage space: 0 unneeded,2525390 used
> > [5] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [5] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 294
> > [5] Mat_CheckInode(): Found 11390 nodes out of 11390 rows. Not using Inode routines
> > [3] MatAssemblyEnd_SeqAIJ(): Matrix size: 11391 X 11391; storage space: 0 unneeded,2500281 used
> > [3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 294
> > [3] Mat_CheckInode(): Found 11391 nodes out of 11391 rows. Not using Inode routines
> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 11391 X 11391; storage space: 0 unneeded,2500281 used
> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 294
> > [1] Mat_CheckInode(): Found 11391 nodes out of 11391 rows. Not using Inode routines
> > [4] MatAssemblyEnd_SeqAIJ(): Matrix size: 11391 X 11391; storage space: 0 unneeded,2500281 used
> > [4] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [4] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 294
> > [4] Mat_CheckInode(): Found 11391 nodes out of 11391 rows. Not using Inode routines
> > [2] MatAssemblyEnd_SeqAIJ(): Matrix size: 11391 X 11391; storage space: 0 unneeded,2525733 used
> > [2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 294
> > [2] Mat_CheckInode(): Found 11391 nodes out of 11391 rows. Not using Inode routines
> > <petsc_info>
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Wed, 18 Jan 2012 14:03:43 -0600
> From: "Ravi Kannan" <rxk at cfdrc.com>
> Subject: Re: [petsc-users] [petsc-dev] boomerAmg scalability
> To: "'Mark F. Adams'" <mark.adams at columbia.edu>
> Cc: 'PETSc users list' <petsc-users at mcs.anl.gov>
> Message-ID: <006f01ccd61c$47c0fc80$d742f580$@com>
> Content-Type: text/plain; charset="us-ascii"
> 
> Hi Mark, Hong,
> 
> 
> 
> As you might remember, the reason for this whole exercise was to obtain a
> solution for a very stiff problem.
> 
> 
> 
> We did have Hypre Boomer amg. This did not scale, but gives correct
> solution. So we wanted an alternative; hence we approached you for gamg.
> 
> 
> 
> However for certain cases, gamg crashes. Even for the working cases, it
> takes about 15-20 times more sweeps than the boomer-hypre. Hence it is
> cost-prohibitive.
> 
> 
> 
> Hopefully this gamg solver can be improved in the near future, for users
> like us.
> 
> 
> 
> Warm Regards,
> 
> Ravi.
> 
> 
> 
> 
> 
> From: Mark F. Adams [mailto:mark.adams at columbia.edu]
> Sent: Wednesday, January 18, 2012 9:56 AM
> To: Hong Zhang
> Cc: rxk at cfdrc.com
> Subject: Re: [petsc-dev] boomerAmg scalability
> 
> 
> 
> Hong and Ravi,
> 
> 
> 
> I fixed a bug with the 6x6 problem.  There seemed to be a bug in
> MatTranposeMat with funny decomposition, that was not really verified.  So
> we can wait for Ravi to continue with his tests a fix them as they arise.
> 
> 
> 
> Mark
> 
> ps, Ravi, I may not have cc'ed so I will send again.
> 
> 
> 
> On Jan 17, 2012, at 7:37 PM, Hong Zhang wrote:
> 
> 
> 
> 
> 
> Ravi,
> 
> I wrote a simple test ex163.c (attached) on MatTransposeMatMult().
> 
> Loading your 6x6 matrix gives no error from MatTransposeMatMult()
> 
> using 1,2,...7 processes.
> 
> For example,
> 
> 
> 
> petsc-dev/src/mat/examples/tests>mpiexec -n 4 ./ex163 -f
> /Users/hong/Downloads/repetscdevboomeramgscalability/binaryoutput
> 
> A:
> 
> Matrix Object: 1 MPI processes
> 
>  type: mpiaij
> 
> row 0: (0, 1.66668e+06)  (1, -1.35)  (3, -0.6)
> 
> row 1: (0, -1.35)  (1, 1.66667e+06)  (2, -1.35)  (4, -0.6)
> 
> row 2: (1, -1.35)  (2, 1.66667e+06)  (5, -0.6)
> 
> row 3: (0, -0.6)  (3, 1.66668e+06)  (4, -1.35)
> 
> row 4: (1, -0.6)  (3, -1.35)  (4, 1.66667e+06)  (5, -1.35)
> 
> row 5: (2, -0.6)  (4, -1.35)  (5, 1.66667e+06)
> 
> 
> 
> C = A^T * A:
> 
> Matrix Object: 1 MPI processes
> 
>  type: mpiaij
> 
> row 0: (0, 2.77781e+12)  (1, -4.50002e+06)  (2, 1.8225)  (3, -2.00001e+06)
> (4, 1.62)
> 
> row 1: (0, -4.50002e+06)  (1, 2.77779e+12)  (2, -4.50001e+06)  (3, 1.62)
> (4, -2.00001e+06)  (5, 1.62)
> 
> row 2: (0, 1.8225)  (1, -4.50001e+06)  (2, 2.7778e+12)  (4, 1.62)  (5,
> -2.00001e+06)
> 
> row 3: (0, -2.00001e+06)  (1, 1.62)  (3, 2.77781e+12)  (4, -4.50002e+06)
> (5, 1.8225)
> 
> row 4: (0, 1.62)  (1, -2.00001e+06)  (2, 1.62)  (3, -4.50002e+06)  (4,
> 2.77779e+12)  (5, -4.50001e+06)
> 
> row 5: (1, 1.62)  (2, -2.00001e+06)  (3, 1.8225)  (4, -4.50001e+06)  (5,
> 2.7778e+12)
> 
> 
> 
> Do I miss something?
> 
> 
> 
> Hong
> 
> 
> 
> On Sat, Jan 14, 2012 at 3:37 PM, Mark F. Adams <mark.adams at columbia.edu>
> wrote:
> 
> Ravi, this system is highly diagonally dominate.  I've fixed the code so you
> can pull and try again.
> 
> 
> 
> I've decided to basically just do a one level method with DD systems.  I
> don't know if that is the best semantics, I think Barry will hate it,
> because it gives you a one level solver when you asked for MG.  It now picks
> up the coarse grid solver as the solver, which is wrong, so I need to fix
> this if we decide to stick with the current semantics.
> 
> 
> 
> And again thanks for helping to pound on this code.
> 
> 
> 
> Mark
> 
> 
> 
> On Jan 13, 2012, at 6:33 PM, Ravi Kannan wrote:
> 
> 
> 
> Hi Mark, Hong,
> 
> 
> 
> Lets make it simpler. I fixed my partitiotion bug (in metis). Now there is a
> equidivision of cells.
> 
> 
> 
> To simplify even further, lets run a much smaller case : with 6 cells
> (equations) in SERIAL. This one crashes. The out and the ksp_view_binary
> files are attached.
> 
> 
> 
> Thanks,
> 
> RAvi.
> 
> 
> 
> From: petsc-dev-bounces at mcs.anl.gov [mailto:petsc-dev-bounces at mcs.anl.gov]
> On Behalf Of Mark F. Adams
> Sent: Friday, January 13, 2012 3:00 PM
> To: For users of the development version of PETSc
> Subject: Re: [petsc-dev] boomerAmg scalability
> 
> 
> 
> Well, we do have a bug here.  It should work with zero elements on a proc,
> but the code is being actively developed so you are really helping us to
> find these cracks.
> 
> 
> 
> If its not too hard it would be nice if you could give use these matrices,
> before you fix it, so we can fix this bug.  You can just send it to Hong and
> I (cc'ed).
> 
> 
> 
> Mark
> 
> 
> 
> On Jan 13, 2012, at 12:16 PM, Ravi Kannan wrote:
> 
> 
> 
> Hi Mark,Hong
> 
> 
> 
> Thanks for the observation w.r.t the proc 0 having 2 equations. This is a
> bug from our end. We will fix it and get back to you if needed.
> 
> 
> 
> Thanks,
> 
> Ravi.
> 
> 
> 
> From: petsc-dev-bounces at mcs.anl.gov [mailto:petsc-dev-bounces at mcs.anl.gov]
> On Behalf Of Mark F. Adams
> Sent: Thursday, January 12, 2012 10:03 PM
> To: Hong Zhang
> Cc: For users of the development version of PETSc
> Subject: Re: [petsc-dev] boomerAmg scalability
> 
> 
> 
> Ravi, can you run with -ksp_view_binary? This will produce two files.
> 
> 
> 
> Hong, ex10 will read in these files and solve them.  I will probably not be
> able to get to this until Monday.
> 
> 
> 
> Also, this matrix has just two equations on proc 0 and and about 11000 on
> proc 1 so its is strangely balanced, in case that helps ...
> 
> 
> 
> Mark
> 
> 
> 
> On Jan 12, 2012, at 10:35 PM, Hong Zhang wrote:
> 
> 
> 
> 
> 
> Ravi,
> 
> 
> 
> I need more info for debugging. Can you provide a simple stand alone code
> and matrices in petsc
> 
> binary format that reproduce the error?
> 
> 
> 
> MatTransposeMatMult() for mpiaij is a newly developed subroutine - less than
> one month old
> 
> and not well tested yet :-(
> 
> I used petsc-dev/src/mat/examples/tests/ex94.c for testing.
> 
> 
> 
> Thanks,
> 
> 
> 
> Hong
> 
> On Thu, Jan 12, 2012 at 9:17 PM, Mark F. Adams <mark.adams at columbia.edu>
> wrote:
> 
> It looks like the problem is in MatTransposeMatMult and Hong (cc'ed) is
> working on it.
> 
> 
> 
> I'm hoping that your output will be enough for Hong to figure this out but I
> could not reproduce this problem with any of my tests.
> 
> 
> 
> If Hong can not figure this out then we will need to get the matrix from you
> to reproduce this.
> 
> 
> 
> Mark
> 
> 
> 
> 
> 
> On Jan 12, 2012, at 6:25 PM, Ravi Kannan wrote:
> 
> 
> 
> 
> 
> Hi Mark,
> 
> 
> 
> Any luck with the gamg bug fix?
> 
> 
> 
> Thanks,
> 
> Ravi.
> 
> 
> 
> From: petsc-dev-bounces at mcs.anl.gov [mailto:petsc-dev-bounces at mcs.anl.gov]
> On Behalf Of Mark F. Adams
> Sent: Wednesday, January 11, 2012 1:54 PM
> To: For users of the development version of PETSc
> Subject: Re: [petsc-dev] boomerAmg scalability
> 
> 
> 
> This seems to be dying earlier than it was last week, so it looks like a new
> bug in MatTransposeMatMult.
> 
> 
> 
> Mark
> 
> 
> 
> On Jan 11, 2012, at 1:59 PM, Matthew Knepley wrote:
> 
> 
> 
> On Wed, Jan 11, 2012 at 12:23 PM, Ravi Kannan <rxk at cfdrc.com> wrote:
> 
> Hi Mark,
> 
> 
> 
> I downloaded the dev version again. This time, the program crashes even
> earlier. Attached is the serial and parallel info outputs.
> 
> 
> 
> Could you kindly take a look.
> 
> 
> 
> It looks like this is a problem with MatMatMult(). Can you try to reproduce
> this using KSP ex10? You put
> 
> your matrix in binary format and use -pc_type gamg. Then you can send us the
> matrix and we can track
> 
> it down. Or are you running an example there?
> 
> 
> 
>  Thanks,
> 
> 
> 
>    Matt
> 
> 
> 
> 
> 
> 
> 
> Thanks,
> 
> Ravi.
> 
> 
> 
> From: petsc-dev-bounces at mcs.anl.gov [mailto:petsc-dev-bounces at mcs.anl.gov]
> On Behalf Of Mark F. Adams
> Sent: Monday, January 09, 2012 3:08 PM
> 
> 
> To: For users of the development version of PETSc
> Subject: Re: [petsc-dev] boomerAmg scalability
> 
> 
> 
> 
> 
> Yes its all checked it, just pull from dev.
> 
> Mark
> 
> 
> 
> On Jan 9, 2012, at 2:54 PM, Ravi Kannan wrote:
> 
> 
> 
> Hi Mark,
> 
> 
> 
> Thanks for your efforts.
> 
> 
> 
> Do I need to do the install from scratch once again? Or some particular
> files (check out gamg.c for instance)?
> 
> 
> 
> Thanks,
> 
> Ravi.
> 
> 
> 
> From: petsc-dev-bounces at mcs.anl.gov [mailto:petsc-dev-bounces at mcs.anl.gov]
> On Behalf Of Mark F. Adams
> Sent: Friday, January 06, 2012 10:30 AM
> To: For users of the development version of PETSc
> Subject: Re: [petsc-dev] boomerAmg scalability
> 
> 
> 
> I think I found the problem.  You will need to use petsc-dev to get the fix.
> 
> 
> 
> Mark
> 
> 
> 
> On Jan 6, 2012, at 8:55 AM, Mark F. Adams wrote:
> 
> 
> 
> Ravi, I forgot but you can just use -ksp_view_binary to output the matrix
> data (two files).  You could run it with two procs and a Jacobi solver to
> get it past the solve, where it writes the matrix (I believe).
> 
> Mark
> 
> 
> 
> On Jan 5, 2012, at 6:19 PM, Ravi Kannan wrote:
> 
> 
> 
> Just send in another email with the attachment.
> 
> 
> 
> From: petsc-dev-bounces at mcs.anl.gov [mailto:petsc-dev-bounces at mcs.anl.gov]
> On Behalf Of Jed Brown
> Sent: Thursday, January 05, 2012 5:15 PM
> To: For users of the development version of PETSc
> Subject: Re: [petsc-dev] boomerAmg scalability
> 
> 
> 
> On Thu, Jan 5, 2012 at 17:12, Ravi Kannan <rxk at cfdrc.com> wrote:
> 
> I have attached the verbose+info outputs for both the serial and the
> parallel (2 partitions). NOTE: the serial output at some location says
> PC=Jacobi! Is it implicitly converting the PC to a Jacobi?
> 
> 
> 
> Looks like you forgot the attachment.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> What most experimenters take for granted before they begin their experiments
> is infinitely more interesting than any results to which their experiments
> lead.
> -- Norbert Wiener
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> <out><binaryoutput><binaryoutput.info <http://binaryoutput.info/> >
> 
> 
> 
> 
> 
> <ex163.c>
> 
> 
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120118/965a1679/attachment.htm>
> 
> ------------------------------
> 
> _______________________________________________
> petsc-users mailing list
> petsc-users at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/petsc-users
> 
> 
> End of petsc-users Digest, Vol 37, Issue 41
> *******************************************
> 



More information about the petsc-users mailing list