From Stephen.R.Ball at awe.co.uk Thu Apr 3 06:34:13 2008 From: Stephen.R.Ball at awe.co.uk (Stephen R Ball) Date: Thu, 3 Apr 2008 12:34:13 +0100 Subject: Default convergence test used in PETSc Message-ID: <843CZB023912@awe.co.uk> Hi I am trying to determine the default convergence test used by PETSc. Your pdf user's guide states the default test is based on the decrease of the residual norm relative to the right-hand-side while your html documentation for KSPDefaultConverged() seems to imply that the default test is based on the decrease of the residual norm relative to the initial residual norm. Can you tell me which is actually the default? Regards Stephen From knepley at gmail.com Thu Apr 3 08:53:23 2008 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 3 Apr 2008 08:53:23 -0500 Subject: Default convergence test used in PETSc In-Reply-To: <843CZB023912@awe.co.uk> References: <843CZB023912@awe.co.uk> Message-ID: There are special cases, the explanation can be rather lengthy. I instead point you to the actual code: http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/src/ksp/ksp/interface/iterativ.c.html#KSPDefaultConverged Matt On Thu, Apr 3, 2008 at 6:34 AM, Stephen R Ball wrote: > Hi > > I am trying to determine the default convergence test used by PETSc. > Your pdf user's guide states the default test is based on the decrease > of the residual norm relative to the right-hand-side while your html > documentation for KSPDefaultConverged() seems to imply that the default > test is based on the decrease of the residual norm relative to the > initial residual norm. Can you tell me which is actually the default? > > Regards > > Stephen > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From bsmith at mcs.anl.gov Wed Apr 2 01:58:42 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 1 Apr 2008 23:58:42 -0700 Subject: Non repeatability issue In-Reply-To: <47F0FF08.9000800@unibas.it> References: <4715D89B.8070005@unibas.it> <47D6B3E1.5070606@unibas.it> <9D646B5A-2D5C-4492-82AF-8E732AF848BD@mcs.anl.gov> <47DFA037.7030707@unibas.it> <81BF6B4F-9199-41AA-B571-20186E2DE43E@mcs.anl.gov> <47F0FF08.9000800@unibas.it> Message-ID: <4DC60834-9639-47A1-A0E9-17145DF972B8@mcs.anl.gov> Try a ksp_tol of 1.e-14 instead of 1.e-12? Barry On Mar 31, 2008, at 8:11 AM, Aldo Bonfiglioli wrote: > Barry, Matt > I am back on the Non repeatability issue with answers > to your questions. > >> 2) did you do the -ksp_rtol 1.e-12 at the same time as the - >> vecscatter_reproduce? They >> must be done together. > > The enclosed plot (res_vs_step) shows the mass residual > history versus the Newton step counter. > > For these same runs, the continuation parameter (CFL) shows similar > jumps > being based upon the SER approach, see plot cfl_vs_its.pdf > >> When you just fix the CFL and run Newton runs to completion >> is it stable? > > I have restarted the code from an almost fully converged solution > using infinite CFL and let it run for 30 Newton steps. > The behaviour is much more "reasonable" and the solution > remains within the steady state (see plot restarted....) > >> Then if you ramp up the CFL much more slowly is it stable and Newton >> convergence much smoother? > > I have not tried yet. > I know there exist smoother strategies > than SER to rump the continuation parameter > (I know this > http://www.cs.kuleuven.ac.be/publicaties/rapporten/tw/TW304.ps.gz > for instance) > > > Aldo > > -- > Dr. Aldo Bonfiglioli > Dip.to di Ingegneria e Fisica dell'Ambiente (DIFA) > Universita' della Basilicata > V.le dell'Ateneo lucano, 10 85100 Potenza ITALY > tel:+39.0971.205203 fax:+39.0971.205160 > > > < > res_vs_step > .pdf> From amjad11 at gmail.com Sun Apr 6 23:49:58 2008 From: amjad11 at gmail.com (amjad ali) Date: Mon, 7 Apr 2008 09:49:58 +0500 Subject: Installation with Intel or PGI compilers Message-ID: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com> *Hello all, I installed PETSc with intel compilers. Please comment on that what is the difference between the PETSc installed with gnu compilers and the PETSC installed with intel compilers. Any difference in efficiency? or what so ever? What you say if we intall PETSc with PGI compilers and also we use MPI-profiler/debugger (available in PGI Cluster Toolkit) for PETSc applications? Is it possible? and beneficial? with best regards, Amjad Ali.* -------------- next part -------------- An HTML attachment was scrubbed... URL: From petsc-maint at mcs.anl.gov Sun Apr 6 23:59:07 2008 From: petsc-maint at mcs.anl.gov (Satish Balay) Date: Sun, 6 Apr 2008 23:59:07 -0500 (CDT) Subject: Installation with Intel or PGI compilers In-Reply-To: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com> References: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com> Message-ID: On Mon, 7 Apr 2008, amjad ali wrote: > *Hello all, > > I installed PETSc with intel compilers. Please comment on that what is the > difference between the PETSc installed with gnu compilers and the PETSC > installed with intel compilers. Any difference in efficiency? or what so > ever? > > What you say if we intall PETSc with PGI compilers and also we use > MPI-profiler/debugger (available in PGI Cluster Toolkit) for PETSc > applications? Is it possible? and beneficial? We expect Intel compilers to be able to optimize code better than GNU compilers. Perhaps PGI might do the same. Its best to run your code [with PETSc compiled with the desired compilers - and optimization options] and run with -log_summary to compare performance differences. And we have no experience with PGI Cluster Toolkit or its usefulness. Note: The choice of compilers [and their cost/benifit] varies depending on the usage. - For code development - debuggability matters. gnu compilers work reasonably well for this usage [with gdb, valgrind, etc..] - For production runs, a 10% performance difference might not matter for code that runs for less than an hour. But it might be significant for long-running jobs. etc.. Satish From dave.knez at gmail.com Mon Apr 7 04:52:30 2008 From: dave.knez at gmail.com (David Knezevic) Date: Mon, 07 Apr 2008 10:52:30 +0100 Subject: Stalling once linear system becomes a certain size Message-ID: <47F9EEDE.1050604@gmail.com> Hello, I am trying to run a PETSc code on a parallel machine (it may be relevant that each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in all) as an SMP unit with 32GB of memory) and I'm observing some behaviour I don't understand. I'm using PETSC_COMM_SELF in order to construct the same matrix on each processor (and solve the system with a different right-hand side vector on each processor), and when each linear system is around 315x315 (block-sparse), then each linear system is solved very quickly on each processor (approx 7x10^{-4} seconds), but when I increase the size of the linear system to 350x350 (or larger), the linear solves completely stall. I've tried a number of different solvers and preconditioners, but nothing seems to help. Also, this code has worked very well on other machines, although the machines I have used it on before have not had this architecture in which each node is an SMP unit. I was wondering if you have observed this kind of issue before? I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler. Thanks very much, David From niko.karin at gmail.com Mon Apr 7 08:16:44 2008 From: niko.karin at gmail.com (Nicolas Tardieu) Date: Mon, 7 Apr 2008 09:16:44 -0400 Subject: Installation with Intel or PGI compilers In-Reply-To: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com> References: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com> Message-ID: Hi, I have some troubles using PETSc compiled with Intel compilers (version 10.1) in Fortran language in parallel on a 64 bits machine. The PetscInitialize always fails. In order to make it work, I have to make the following changes in petscconf.h. 320,321c320,321 < #ifndef PETSC_HAVE_IPXFARGC_ < #define PETSC_HAVE_IPXFARGC_ 1 --- > #ifndef PETSC_HAVE_IARGC_ > #define PETSC_HAVE_IARGC_ 1 488,489c488,489 < #ifndef PETSC_HAVE_PXFGETARG_NEW < #define PETSC_HAVE_PXFGETARG_NEW 1 --- > #ifndef PETSC_HAVE_BGL_IARGC > #define PETSC_HAVE_BGL_IARGC 1 Once this is done, PetscInitialize and the rest of the code works fine. Strange, isn't it..... Nicolas 2008/4/7, amjad ali : > > *Hello all, > > I installed PETSc with intel compilers. Please comment on that what is the > difference between the PETSc installed with gnu compilers and the PETSC > installed with intel compilers. Any difference in efficiency? or what so > ever? > > What you say if we intall PETSc with PGI compilers and also we use > MPI-profiler/debugger (available in PGI Cluster Toolkit) for PETSc > applications? Is it possible? and beneficial? > > with best regards, > Amjad Ali.* -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay at mcs.anl.gov Mon Apr 7 08:28:18 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 7 Apr 2008 08:28:18 -0500 (CDT) Subject: Stalling once linear system becomes a certain size In-Reply-To: <47F9EEDE.1050604@gmail.com> References: <47F9EEDE.1050604@gmail.com> Message-ID: On Mon, 7 Apr 2008, David Knezevic wrote: > Hello, > > I am trying to run a PETSc code on a parallel machine (it may be relevant that > each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in > all) as an SMP unit with 32GB of memory) and I'm observing some behaviour I > don't understand. > > I'm using PETSC_COMM_SELF in order to construct the same matrix on each > processor (and solve the system with a different right-hand side vector on > each processor), and when each linear system is around 315x315 (block-sparse), > then each linear system is solved very quickly on each processor (approx > 7x10^{-4} seconds), but when I increase the size of the linear system to > 350x350 (or larger), the linear solves completely stall. I've tried a number > of different solvers and preconditioners, but nothing seems to help. Also, > this code has worked very well on other machines, although the machines I have > used it on before have not had this architecture in which each node is an SMP > unit. I was wondering if you have observed this kind of issue before? > > I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler. I would sugest running the code in a debugger to determine the exact location where the stall happens [with the minimum number of procs] mpiexec -n 4 ./exe -start_in_debugger By default the above tries to open xterms on the localhost - so to get this working on the cluster - you might need proper ssh-x11-portforwarding setup to the node, and then use the extra command line option '-display' [when the job kinda hangs - I would do ctrl-c in gdb and look at the stack trace on each mpi-thread] Satish From balay at mcs.anl.gov Mon Apr 7 08:29:31 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 7 Apr 2008 08:29:31 -0500 (CDT) Subject: Installation with Intel or PGI compilers In-Reply-To: References: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com> Message-ID: Please send the corresponding confiure.log to petsc-maint at mcs.anl.gov Satish On Mon, 7 Apr 2008, Nicolas Tardieu wrote: > Hi, > > I have some troubles using PETSc compiled with Intel compilers (version > 10.1) in Fortran language in parallel on a 64 bits machine. The > PetscInitialize always fails. In order to make it work, I have to make the > following changes in petscconf.h. > > 320,321c320,321 > < #ifndef PETSC_HAVE_IPXFARGC_ > < #define PETSC_HAVE_IPXFARGC_ 1 > --- > > #ifndef PETSC_HAVE_IARGC_ > > #define PETSC_HAVE_IARGC_ 1 > 488,489c488,489 > < #ifndef PETSC_HAVE_PXFGETARG_NEW > < #define PETSC_HAVE_PXFGETARG_NEW 1 > --- > > #ifndef PETSC_HAVE_BGL_IARGC > > #define PETSC_HAVE_BGL_IARGC 1 > > Once this is done, PetscInitialize and the rest of the code works fine. > Strange, isn't it..... > > Nicolas From knepley at gmail.com Mon Apr 7 08:34:19 2008 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 7 Apr 2008 08:34:19 -0500 Subject: Stalling once linear system becomes a certain size In-Reply-To: References: <47F9EEDE.1050604@gmail.com> Message-ID: It sounds like he is saying that the iterative solvers fail to converge. It could be that the systems become much more ill-conditioned. When solving anything, first use LU -ksp_type preonly -pc_type lu to determine if the system is consistent. Then use something simple, like GMRES by itself -ksp_type gmres -pc_type none -ksp_monitor_singular_value -ksp_gmres_restart 500 to get an idea of the condition number. Then start trying other solvers and PCs. Matt On Mon, Apr 7, 2008 at 8:28 AM, Satish Balay wrote: > > On Mon, 7 Apr 2008, David Knezevic wrote: > > > Hello, > > > > I am trying to run a PETSc code on a parallel machine (it may be relevant that > > each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in > > all) as an SMP unit with 32GB of memory) and I'm observing some behaviour I > > don't understand. > > > > I'm using PETSC_COMM_SELF in order to construct the same matrix on each > > processor (and solve the system with a different right-hand side vector on > > each processor), and when each linear system is around 315x315 (block-sparse), > > then each linear system is solved very quickly on each processor (approx > > 7x10^{-4} seconds), but when I increase the size of the linear system to > > 350x350 (or larger), the linear solves completely stall. I've tried a number > > of different solvers and preconditioners, but nothing seems to help. Also, > > this code has worked very well on other machines, although the machines I have > > used it on before have not had this architecture in which each node is an SMP > > unit. I was wondering if you have observed this kind of issue before? > > > > I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler. > > I would sugest running the code in a debugger to determine the exact > location where the stall happens [with the minimum number of procs] > > mpiexec -n 4 ./exe -start_in_debugger > > By default the above tries to open xterms on the localhost - so to get > this working on the cluster - you might need proper > ssh-x11-portforwarding setup to the node, and then use the extra > command line option '-display' > > [when the job kinda hangs - I would do ctrl-c in gdb and look at the > stack trace on each mpi-thread] > > Satish > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From balay at mcs.anl.gov Mon Apr 7 09:42:27 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 7 Apr 2008 09:42:27 -0500 (CDT) Subject: Stalling once linear system becomes a certain size In-Reply-To: References: <47F9EEDE.1050604@gmail.com> Message-ID: Matt, > > I'm using PETSC_COMM_SELF in order to construct the same matrix > > on each processor (and solve the system with a different > > right-hand side vector on each processor), So its a bunch of similar sequential solves - over PETSC_COMM_SELF. So a seq solve on a given mpi-thread should not affect another seq solve on another thread.. Satish On Mon, 7 Apr 2008, Matthew Knepley wrote: > It sounds like he is saying that the iterative solvers fail to > converge. It could be > that the systems become much more ill-conditioned. When solving anything, > first use LU > > -ksp_type preonly -pc_type lu > > to determine if the system is consistent. Then use something simple, like > GMRES by itself > > -ksp_type gmres -pc_type none -ksp_monitor_singular_value > -ksp_gmres_restart 500 > > to get an idea of the condition number. Then start trying other solvers and PCs. > > Matt > > On Mon, Apr 7, 2008 at 8:28 AM, Satish Balay wrote: > > > > On Mon, 7 Apr 2008, David Knezevic wrote: > > > > > Hello, > > > > > > I am trying to run a PETSc code on a parallel machine (it may be relevant that > > > each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in > > > all) as an SMP unit with 32GB of memory) and I'm observing some behaviour I > > > don't understand. > > > > > > I'm using PETSC_COMM_SELF in order to construct the same matrix on each > > > processor (and solve the system with a different right-hand side vector on > > > each processor), and when each linear system is around 315x315 (block-sparse), > > > then each linear system is solved very quickly on each processor (approx > > > 7x10^{-4} seconds), but when I increase the size of the linear system to > > > 350x350 (or larger), the linear solves completely stall. I've tried a number > > > of different solvers and preconditioners, but nothing seems to help. Also, > > > this code has worked very well on other machines, although the machines I have > > > used it on before have not had this architecture in which each node is an SMP > > > unit. I was wondering if you have observed this kind of issue before? > > > > > > I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler. > > > > I would sugest running the code in a debugger to determine the exact > > location where the stall happens [with the minimum number of procs] > > > > mpiexec -n 4 ./exe -start_in_debugger > > > > By default the above tries to open xterms on the localhost - so to get > > this working on the cluster - you might need proper > > ssh-x11-portforwarding setup to the node, and then use the extra > > command line option '-display' > > > > [when the job kinda hangs - I would do ctrl-c in gdb and look at the > > stack trace on each mpi-thread] > > > > Satish > > > > > > > > From rlmackie862 at gmail.com Mon Apr 7 14:27:01 2008 From: rlmackie862 at gmail.com (Randall Mackie) Date: Mon, 07 Apr 2008 12:27:01 -0700 Subject: error in creating 3d DA Message-ID: <47FA7585.3060700@gmail.com> I've run into a problem with my code where, for a smaller problem, it bombs out in creating a 3D DA (with error message about the partition being too fine in the z direction) for the case where np=121, but works fine for the case np=484. I would think that the creation of the DA should work fine for the smaller number of processors as well, but maybe there is a bug in the logic? Randy From knepley at gmail.com Mon Apr 7 15:20:06 2008 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 7 Apr 2008 15:20:06 -0500 Subject: error in creating 3d DA In-Reply-To: <47FA7585.3060700@gmail.com> References: <47FA7585.3060700@gmail.com> Message-ID: On Mon, Apr 7, 2008 at 2:27 PM, Randall Mackie wrote: > I've run into a problem with my code where, for a smaller problem, it > bombs out in creating a 3D DA (with error message about the partition being > too fine in the z direction) for the case where np=121, but works fine > for the case np=484. > > I would think that the creation of the DA should work fine for the smaller > number of processors as well, but maybe there is a bug in the logic? DA does only the very simplest partitioning. Thus, the number of processors must factor into np = np_x * np_y * np_z. However, things will be clearer if you send the actual error message to petsc-maint at mcs.anl.gov. Matt > Randy -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From tyoung at ippt.gov.pl Tue Apr 8 05:40:10 2008 From: tyoung at ippt.gov.pl (Toby D. Young) Date: Tue, 8 Apr 2008 12:40:10 +0200 Subject: MatTranspose Message-ID: <20080408124010.7d183e23@rav.ippt.gov.pl> Hello all. I confused about the statement on MatTranspose() on the manual pages at http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatTranspose.html where for #include "petscmat.h" PetscErrorCode MatTranspose(Mat mat,Mat *B) is the statement: Notes If you pass in PETSC_NULL for B an in-place transpose in mat will be done Does this mean that if I pass PETSC_NULL then the matrix "A" will be returned as its own transpose? Does this save memory if I do not need the original matrix and only its transpose? If not, is there an efficient way to destroy the original matrix, thus keeping the transpose only? Can anyone please clarify for me what this statement means? ...and finally thanks to all for answering my previous confused questions. :-) Best, Toby -- Toby D. Young - Adiunkt (Assistant Professor) Department of Computational Science Institute of Fundamental Technological Research Polish Academy of Sciences Room 206, ul. Swietokrzyska 21 00-049 Warszawa, Polska +48 22 826 12 81 ext. 184 http://rav.ippt.gov.pl/~tyoung From knepley at gmail.com Tue Apr 8 07:34:27 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 8 Apr 2008 07:34:27 -0500 Subject: MatTranspose In-Reply-To: <20080408124010.7d183e23@rav.ippt.gov.pl> References: <20080408124010.7d183e23@rav.ippt.gov.pl> Message-ID: On Tue, Apr 8, 2008 at 5:40 AM, Toby D. Young wrote: > > > Hello all. > > I confused about the statement on MatTranspose() on the manual pages at > > http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatTranspose.html > > where for > > #include "petscmat.h" > PetscErrorCode MatTranspose(Mat mat,Mat *B) > > is the statement: > > Notes > If you pass in PETSC_NULL for B an in-place transpose in mat will be > done > > Does this mean that if I pass PETSC_NULL then the matrix "A" will be > returned as its own transpose? Does this save memory if I do not need Yes. > the original matrix and only its transpose? If not, is there an Yes. Matt > efficient way to destroy the original matrix, thus keeping the > transpose only? > > Can anyone please clarify for me what this statement means? > > ...and finally thanks to all for answering my previous confused > questions. :-) > > Best, > Toby > > > > -- > > Toby D. Young - Adiunkt (Assistant Professor) > Department of Computational Science > Institute of Fundamental Technological Research > Polish Academy of Sciences > Room 206, ul. Swietokrzyska 21 > 00-049 Warszawa, Polska > > +48 22 826 12 81 ext. 184 > http://rav.ippt.gov.pl/~tyoung > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From balay at mcs.anl.gov Tue Apr 8 08:52:06 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Tue, 8 Apr 2008 08:52:06 -0500 (CDT) Subject: MatTranspose In-Reply-To: References: <20080408124010.7d183e23@rav.ippt.gov.pl> Message-ID: On Tue, 8 Apr 2008, Matthew Knepley wrote: > On Tue, Apr 8, 2008 at 5:40 AM, Toby D. Young wrote: > > > > > > Hello all. > > > > I confused about the statement on MatTranspose() on the manual pages at > > > > http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatTranspose.html > > > > where for > > > > #include "petscmat.h" > > PetscErrorCode MatTranspose(Mat mat,Mat *B) > > > > is the statement: > > > > Notes > > If you pass in PETSC_NULL for B an in-place transpose in mat will be > > done > > > > Does this mean that if I pass PETSC_NULL then the matrix "A" will be > > returned as its own transpose? Does this save memory if I do not need > > Yes. > > > the original matrix and only its transpose? If not, is there an > > Yes. > > Matt > > > efficient way to destroy the original matrix, thus keeping the > > transpose only? Jut a note: MatTranspose(A,PETSC_NULL) is *almost* equivalent to: MatTranspose(A,&B) MatDestroy(A) A=B So there is temporary increase in memory usage - until the original matrix is deallocated. Satish From tyoung at ippt.gov.pl Tue Apr 8 09:15:37 2008 From: tyoung at ippt.gov.pl (Toby D. Young) Date: Tue, 8 Apr 2008 16:15:37 +0200 (CEST) Subject: MatTranspose In-Reply-To: References: <20080408124010.7d183e23@rav.ippt.gov.pl> Message-ID: Problem cleared up. :-) Thank you Matt and Satish! Best, Toby ----- Toby D. Young - Adiunkt (Assistant Professor) Department of Computational Science Institute of Fundamental Technological Research Polish Academy of Sciences Room 206, ul. Swietokrzyska 21 00-049 Warszawa, Polska +48 22 826 12 81 ext. 184 http://rav.ippt.gov.pl/~tyoung From jinzishuai at yahoo.com Tue Apr 8 19:16:41 2008 From: jinzishuai at yahoo.com (Shi Jin) Date: Tue, 8 Apr 2008 17:16:41 -0700 (PDT) Subject: Further question about PC with Jaocbi Row Sum Message-ID: <158168.69319.qm@web36208.mail.mud.yahoo.com> Hi there, First of all, I want to thank Matt for his previous suggestion on the use of -pc_jacobi_rowsum 1 option. Now I have a more theoretical question I hope you may help although it does not necessarily connects to PETsc directly. Basically, I am trying to speed up the solution of a finite element mass matrix, which is constructed on a second order tetrahedral element. My idea is to use the lumped mass matrix to precondition it. However, it does not seem to work originally but I guessed it may has something to do with the fact that it is second order since the theory for second order lumped mass matrix is not so clear, at least to me. So I decided to work out the linear element first, since everything there is mathematically well established. OK. I first constructed a mass matrix based on linear elements. I can construct its lumped mass matrix by three methods: 1. sum of each row 2. scale the diagonal entry 3. use a nodal quadrature rule to construct a diagonal matrix These three methods turned out to produce identical diagonal matrices as the lumped mass matrix, just as the theory predicts. So perfect! I can further test that the lumped mass matrix has similar eigenvalues to the original consistent mass matrix, although different. And the solutions of a linear system is quite similar too. So I understand the reason lots of people replace solving the consistent mass matrix with solving the lumped one to achieve much improved efficiency but without losing much of the accuracy. So far, it is all making sense. I naturally think it could be very helpful if I can use this lumped mass matrix as the prediction matrix for the consistent mass matrix solver, where the consistent matrix is kept to have the better accuracy. However, my tests show that it does not help at all. I tried several ways to do it within PETSc, such as setting KSPSetOperators( solM, M, lumpedM SAME_PRECONDITIONER); or directly use the -pc_type jacobi -pc_jacobi_rowsum 1 to the built-in method. These two methods turned out to be equivalent but they both produce less efficient solutions: it actually took twice more steps to converge than without these options. This is quite puzzling to me. Although I have to admit that I have seen a lot of replacing the consistent mass matrix with the lumped one in the literature but have not seen much of using the lumped mass matrix as a preconditioner. Maybe using the lumped matrix for preconditioning simply does not work? I would love to hear some comment here. If that's all, I don't feel too bad. Then I came back to the second order elements since that's what I want to use. Accidentally I decided to try to solve the second order consistent mass matrix with the -pc_type jacobi -pc_jacobi_rowsum 1 option. Bang! It converges almost three times faster. For a particular system, it usually converges in 9-10 iterations and now it converges in 2-3 iterations. This is amazing! But I don't know why it is so. If that's all, I would be just happy. Then I ran my single particle sedimentation code with -pc_type jacobi -pc_jacobi_rowsum 1 and it does run a lot faster. However, the results I got are slightly different from what I used to, which is weird since the only thing changed is the preconditioner while the same linear system was solved. I tried several -pc_type options, and they are all consistent with the old one. So I am a little bit hesitant adapting this new speed up method. What troubles me most is that the new simulation results are actually even closer to the experiments we are comparing, which may suggest that the row sum PC is even better. But this is just one test case I would rather believe it happens to cause errors in the direction to compensate other simulation errors. If I had other well established test case which unfortunately I don't, I would imagine it may work differently. So my strongest puzzle is that how could a change in the pre-conditioner make such an observable change in the solutions. I understand different PCs produce different solutions but they should be numerically very close and non-detectable on a physical quantity level plot, right? Is there something particular about this rowsum method? I apologize about this lengthy email but I do hope to have some in depth scientific discussion. Thank you very much. Shi Jin, PhD ____________________________________________________________________________________ You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. http://tc.deals.yahoo.com/tc/blockbuster/text5.com From bsmith at mcs.anl.gov Tue Apr 8 21:38:00 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 8 Apr 2008 21:38:00 -0500 Subject: Further question about PC with Jaocbi Row Sum In-Reply-To: <158168.69319.qm@web36208.mail.mud.yahoo.com> References: <158168.69319.qm@web36208.mail.mud.yahoo.com> Message-ID: On Apr 8, 2008, at 7:16 PM, Shi Jin wrote: > Hi there, > > First of all, I want to thank Matt for his previous suggestion on > the use of -pc_jacobi_rowsum 1 option. Now I have a more theoretical > question I hope you may help although it does not necessarily > connects to PETsc directly. > > Basically, I am trying to speed up the solution of a finite element > mass matrix, which is constructed on a second order tetrahedral > element. My idea is to use the lumped mass matrix to precondition > it. However, it does not seem to work originally but I guessed it > may has something to do with the fact that it is second order since > the theory for second order lumped mass matrix is not so clear, at > least to me. So I decided to work out the linear element first, > since everything there is mathematically well established. > > OK. I first constructed a mass matrix based on linear elements. I > can construct its lumped mass matrix by three methods: > 1. sum of each row > 2. scale the diagonal entry > 3. use a nodal quadrature rule to construct a diagonal matrix > These three methods turned out to produce identical diagonal > matrices as the lumped mass matrix, just as the theory predicts. So > perfect! > I can further test that the lumped mass matrix has similar > eigenvalues to the original consistent mass matrix, although > different. And the solutions of a linear system is quite similar > too. So I understand the reason lots of people replace solving the > consistent mass matrix with solving the lumped one to achieve much > improved efficiency but without losing much of the accuracy. So far, > it is all making sense. > > I naturally think it could be very helpful if I can use this lumped > mass matrix as the prediction matrix for the consistent mass > matrix solver, where the consistent matrix is kept to have > the better accuracy. However, my tests show that it does not help > at all. I tried several ways to do it within PETSc, such as setting > KSPSetOperators( solM, M, lumpedM SAME_PRECONDITIONER); > or directly use the -pc_type jacobi -pc_jacobi_rowsum 1 to the > built-in method. > These two methods turned out to be equivalent but they both produce > less efficient solutions: it actually took twice more steps to > converge than without these options. > This is quite puzzling to me. Although I have to admit that I have > seen a lot of replacing the consistent mass matrix with the lumped > one in the literature but have not seen much of using the lumped > mass matrix as a preconditioner. Maybe using the lumped matrix for > preconditioning simply does not work? I would love to hear some > comment here. The lumped mass matrix being a good replacement for the mass matrix is a question about approximation. How good is each to the continuous L_2 norm? It isn't really a question about how close each is to the other. Being a good preconditioner is a linear algebra question, what is the (complex) relationship between the eigenvalues and eigenvectors of the two matrices? (what happens to the eigenvalues of B^{-1} M?) I think these two questions are distinct, intuitively they seem to be related, but mathematically I don't think there is a direct relationship so I am not surprised by your observations. By the way, I have seen cases where using the lumped mass matrix resulted in BETTER approximation to the continuous solution then using the "true" mass matrix; again this is counter intuitive but there is nothing mathematically that says it shouldn't be. > > > If that's all, I don't feel too bad. Then I came back to the second > order elements since that's what I want to use. Accidentally I > decided to try to solve the second order consistent mass matrix > with the -pc_type jacobi -pc_jacobi_rowsum 1 option. Bang! It > converges almost three times faster. For a particular system, it > usually converges in 9-10 iterations and now it converges in 2-3 > iterations. This is amazing! But I don't know why it is so. > > If that's all, I would be just happy. Then I ran my single particle > sedimentation code with -pc_type jacobi -pc_jacobi_rowsum 1 and it > does run a lot faster. However, the results I got are slightly > different from what I used to, which is weird since the only thing > changed is the preconditioner while the same linear system was > solved. I tried several -pc_type options, and they are all > consistent with the old one. So I am a little bit hesitant adapting > this new speed up method. What troubles me most is that the new > simulation results are actually even closer to the experiments we > are comparing, which may suggest that the row sum PC is even better. > But this is just one test case I would rather believe it happens to > cause errors in the direction to compensate other simulation > errors. If I had other well established test case which > unfortunately I don't, I would imagine it may work differently. > > So my strongest puzzle is that how could a change in the pre- > conditioner make such an observable change in the solutions. I > understand different PCs produce different solutions but they should > be numerically very close and non-detectable on a physical quantity > level plot, right? Yes, so long as you use a tight enough convergence tolerance with KSPSetTolerances(). Also by default with most KSP solvers the "preconditioned" residual norm is used to determine convergence, thus in some way the preconditioner helps determines when the KSP stops. > Is there something particular about this rowsum method? No. If you use a -ksp_rtol of 1.e-12 and still get different answers, this needs to be investigated. Barry > > > I apologize about this lengthy email but I do hope to have some in > depth scientific discussion. > Thank you very much. > > Shi Jin, PhD > > > > > > ____________________________________________________________________________________ > You rock. That's why Blockbuster's offering you one month of > Blockbuster Total Access, No Cost. > http://tc.deals.yahoo.com/tc/blockbuster/text5.com > From recrusader at gmail.com Wed Apr 9 12:02:41 2008 From: recrusader at gmail.com (Yujie) Date: Wed, 9 Apr 2008 10:02:41 -0700 Subject: about MatMatMultTranspose_seqdense_seqdense() Message-ID: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com> hi, everyone My codes are as follows: ierr=MatGetSubMatrices(tempM_mat,1,&is_row,&is_col,MAT_INITIAL_MATRIX,&tempA_mat); CHKERRQ(ierr); A_mat=*tempA_mat; ierr=MatDestroy(tempM_mat);CHKERRQ(ierr); ierr=MatGetSize(A_mat,&M,&N);CHKERRQ(ierr); //AtA ierr=MatMatMultTranspose(A_mat,A_mat,MAT_INITIAL_MATRIX,fill,&AtA_mat); I get a seqdense submatrix "A_mat" by MatGetSubMatrices(). I further get At*A by MatMatMultTranspose(). However, I meet an error: " ** On entry to DGEMM parameter number 8 had an illegal value" I debug my codes. In MatMatMultTranspose_seqdense_seqdense(), the codes call "BLASgemm_("T","N",&m,&n,&k,&_DOne,a->v,&a->lda,b->v,&b->lda,&_DZero,c->v,&c->lda);" I don't know the meaning of the 8th parameters"&a->lda". In my codes, its value is "0". Are there any problems in my codes? could you give me some advice? thanks a lot. Regards, Yujie -------------- next part -------------- An HTML attachment was scrubbed... URL: From Amit.Itagi at seagate.com Wed Apr 9 13:35:44 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 9 Apr 2008 14:35:44 -0400 Subject: DA question Message-ID: Hi, Is it possible to use DA to perform finite differences on two staggered regular grids (as in the electromagnetic finite difference time domain method) ? Surrounding nodes from one grid are used to update the value in the dual grid. In addition, local manipulations need to be done on the nodal values. Thanks Rgds, Amit From berend at chalmers.se Wed Apr 9 13:59:49 2008 From: berend at chalmers.se (Berend van Wachem) Date: Wed, 09 Apr 2008 20:59:49 +0200 Subject: DA question In-Reply-To: References: Message-ID: <47FD1225.4020704@chalmers.se> Dear Amit, Could you explain how the two grids are attached? I am using multiple DA's for multiple structured grids glued together. I've done the gluing with setting up various IS objects. From the multiple DA's, one global variable vector is formed. Is that what you are looking for? Best regards, Berend. Amit.Itagi at seagate.com wrote: > Hi, > > Is it possible to use DA to perform finite differences on two staggered > regular grids (as in the electromagnetic finite difference time domain > method) ? Surrounding nodes from one grid are used to update the value in > the dual grid. In addition, local manipulations need to be done on the > nodal values. > > Thanks > > Rgds, > Amit > From knepley at gmail.com Wed Apr 9 14:10:19 2008 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 9 Apr 2008 14:10:19 -0500 Subject: DA question In-Reply-To: References: Message-ID: DAs only know about vertex values. You can simulate staggered grids by storing one grid on top of the other, so each vertex has two values. Matt On Wed, Apr 9, 2008 at 1:35 PM, wrote: > > Hi, > > Is it possible to use DA to perform finite differences on two staggered > regular grids (as in the electromagnetic finite difference time domain > method) ? Surrounding nodes from one grid are used to update the value in > the dual grid. In addition, local manipulations need to be done on the > nodal values. > > Thanks > > Rgds, > Amit > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From jed at 59A2.org Wed Apr 9 14:13:37 2008 From: jed at 59A2.org (Jed Brown) Date: Wed, 9 Apr 2008 21:13:37 +0200 Subject: Forming a sparse approximation of a MatShell Message-ID: <20080409191337.GA6137@brakk.ethz.ch> I'm trying to improve the preconditioning of my spectral collocation method for non-Newtonian incompressible Stokes flow. My current algorithm uses MatShell for the full Jacobian as well as each of its blocks [A B1'; B2 0] and the Schur complement S = -B2*A*B1'. I needed a preconditioner for A so I thought I'd solve the same problem using finite differences on the Chebyshev nodes. In reality, the stencil is really ugly in 3D so I just used a simpler elliptic operator. This works okay, but it's performance decays significantly as I increase the continuation parameter. Also, dealing with general boundary conditions is rather tricky and it seems to be a much weaker preconditioner when I have mixed boundary conditions. To rectify this, I tried a finite element discretization on the Chebyshev nodes (using Q1 elements). This must be scaled by the inverse (lumped) mass matrix due to the collocation nature of the spectral method. Strangely, even though it captures all the terms in the Jacobian, it is slightly weaker than the finite difference version. At least it is less error-prone and boundary conditions are easier to get right. Regardless, forming the explicit matrix separately from the spectral matrix causes a duplication of concepts that have to be kept in sync. So I started thinking, the spectral matrix is pretty cheap to apply a few times, so perhaps I can use a coloring to compute a sparse approximation. However, the documentation I found is using the function from the SNES context to form the matrix. In my case, the entire Jacobian doesn't help, I just want an approximation of A. (A itself is full, but implemented via FFT.) What is the correct way to do this? Should I just stick with finite differences or finite elements? Also, any ideas for preconditioning S? It's condition number also grows significantly with the continuation parameter. Thanks, Jed -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From Amit.Itagi at seagate.com Wed Apr 9 14:38:56 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 9 Apr 2008 15:38:56 -0400 Subject: DA question In-Reply-To: <47FD1225.4020704@chalmers.se> Message-ID: Hi Berend, A detailed explanation of the finite difference scheme is given here : http://en.wikipedia.org/wiki/Finite-difference_time-domain_method Thanks Rgds, Amit Berend van Wachem To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: DA question 04/09/2008 02:59 PM Please respond to petsc-users at mcs.a nl.gov Dear Amit, Could you explain how the two grids are attached? I am using multiple DA's for multiple structured grids glued together. I've done the gluing with setting up various IS objects. From the multiple DA's, one global variable vector is formed. Is that what you are looking for? Best regards, Berend. Amit.Itagi at seagate.com wrote: > Hi, > > Is it possible to use DA to perform finite differences on two staggered > regular grids (as in the electromagnetic finite difference time domain > method) ? Surrounding nodes from one grid are used to update the value in > the dual grid. In addition, local manipulations need to be done on the > nodal values. > > Thanks > > Rgds, > Amit > From schuang at ats.ucla.edu Wed Apr 9 14:18:42 2008 From: schuang at ats.ucla.edu (Shao-Ching Huang) Date: Wed, 9 Apr 2008 12:18:42 -0700 Subject: about MatMatMultTranspose_seqdense_seqdense() In-Reply-To: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com> References: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com> Message-ID: <20080409191842.GA24448@ats.ucla.edu> In BLAS API, the eight parameter for DGEMM is the physically-allocated leading (in Fortran sense) dimension of matrix B, as in C=(alpha)*A*B + (beta)*C. See the comments in http://www.netlib.org/blas/dgemm.f Shao-Ching On Wed, Apr 09, 2008 at 10:02:41AM -0700, Yujie wrote: > hi, everyone > > My codes are as follows: > > ierr=MatGetSubMatrices(tempM_mat,1,&is_row,&is_col,MAT_INITIAL_MATRIX,&tempA_mat); > CHKERRQ(ierr); > A_mat=*tempA_mat; > ierr=MatDestroy(tempM_mat);CHKERRQ(ierr); > ierr=MatGetSize(A_mat,&M,&N);CHKERRQ(ierr); > //AtA > ierr=MatMatMultTranspose(A_mat,A_mat,MAT_INITIAL_MATRIX,fill,&AtA_mat); > > I get a seqdense submatrix "A_mat" by > MatGetSubMatrices(). I further get At*A by MatMatMultTranspose(). > However, I meet an error: > " ** On entry to DGEMM parameter number 8 had an illegal value" > > I debug my codes. > In MatMatMultTranspose_seqdense_seqdense(), the codes call > "BLASgemm_("T","N",&m,&n,&k,&_DOne,a->v,&a->lda,b->v,&b->lda,&_DZero,c->v,&c->lda);" > I don't know the meaning of the 8th parameters"&a->lda". In my codes, its > value is "0". > > Are there any problems in my codes? could you give me some advice? thanks a lot. > > Regards, > Yujie From bsmith at mcs.anl.gov Wed Apr 9 15:10:23 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 9 Apr 2008 15:10:23 -0500 Subject: Forming a sparse approximation of a MatShell In-Reply-To: <20080409191337.GA6137@brakk.ethz.ch> References: <20080409191337.GA6137@brakk.ethz.ch> Message-ID: Jed, The Mat coloring code can also be used directly, not through SNES. Once you have the coloring for the matrix (you can get that with MatGetColoring(), of course, this assumes you have already set a nonzero pattern for your matrix)). Call MatFDColoringCreate() then MatFDColoringSetFunction(), MatFDColoringSetFromOptions() and then MatFDColoringApply(). Good luck, Barry On Apr 9, 2008, at 2:13 PM, Jed Brown wrote: > I'm trying to improve the preconditioning of my spectral collocation > method for > non-Newtonian incompressible Stokes flow. My current algorithm uses > MatShell > for the full Jacobian as well as each of its blocks [A B1'; B2 0] > and the Schur > complement S = -B2*A*B1'. I needed a preconditioner for A so I > thought I'd > solve the same problem using finite differences on the Chebyshev > nodes. In > reality, the stencil is really ugly in 3D so I just used a simpler > elliptic > operator. This works okay, but it's performance decays > significantly as I > increase the continuation parameter. Also, dealing with general > boundary > conditions is rather tricky and it seems to be a much weaker > preconditioner when > I have mixed boundary conditions. To rectify this, I tried a finite > element > discretization on the Chebyshev nodes (using Q1 elements). This > must be scaled > by the inverse (lumped) mass matrix due to the collocation nature of > the > spectral method. Strangely, even though it captures all the terms > in the > Jacobian, it is slightly weaker than the finite difference version. > At least it > is less error-prone and boundary conditions are easier to get right. > Regardless, forming the explicit matrix separately from the spectral > matrix > causes a duplication of concepts that have to be kept in sync. So I > started > thinking, the spectral matrix is pretty cheap to apply a few times, > so perhaps I > can use a coloring to compute a sparse approximation. However, the > documentation I found is using the function from the SNES context to > form the > matrix. In my case, the entire Jacobian doesn't help, I just want an > approximation of A. (A itself is full, but implemented via FFT.) > What is the > correct way to do this? Should I just stick with finite differences > or finite > elements? > > Also, any ideas for preconditioning S? It's condition number also > grows > significantly with the continuation parameter. > > Thanks, > > Jed From rlmackie862 at gmail.com Wed Apr 9 15:09:59 2008 From: rlmackie862 at gmail.com (Randall Mackie) Date: Wed, 09 Apr 2008 13:09:59 -0700 Subject: DA question In-Reply-To: References: Message-ID: <47FD2297.1010602@gmail.com> Hi Amit, Why do you need two staggered grids? I do EM finite difference frequency domain modeling on a staggered grid using just one DA. Works perfectly fine. There are some grid points that are not used, but you just set them to zero and put a 1 on the diagonal of the coefficient matrix. Randy Amit.Itagi at seagate.com wrote: > Hi Berend, > > A detailed explanation of the finite difference scheme is given here : > > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method > > > Thanks > > Rgds, > Amit > > > > > Berend van Wachem > se> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: DA question > > > 04/09/2008 02:59 > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > Dear Amit, > > Could you explain how the two grids are attached? > I am using multiple DA's for multiple structured grids glued together. > I've done the gluing with setting up various IS objects. From the > multiple DA's, one global variable vector is formed. Is that what you > are looking for? > > Best regards, > > Berend. > > > Amit.Itagi at seagate.com wrote: >> Hi, >> >> Is it possible to use DA to perform finite differences on two staggered >> regular grids (as in the electromagnetic finite difference time domain >> method) ? Surrounding nodes from one grid are used to update the value in >> the dual grid. In addition, local manipulations need to be done on the >> nodal values. >> >> Thanks >> >> Rgds, >> Amit >> > > > From jinzishuai at yahoo.com Wed Apr 9 15:25:30 2008 From: jinzishuai at yahoo.com (Shi Jin) Date: Wed, 9 Apr 2008 13:25:30 -0700 (PDT) Subject: Further question about PC with Jaocbi Row Sum Message-ID: <780632.38038.qm@web36201.mail.mud.yahoo.com> Thank you very much. > > Is there something particular about this rowsum method? > > No. If you use a -ksp_rtol of 1.e-12 and still get different > answers, this needs to be investigated. > > I have tried even with -ksp_rtol 1.e-20 but still got different results. Here is what I got when solving the mass matrix with -pc_type jacobi -pc_jacobi_rowsum 1 -ksp_type cg -sub_pc_type icc -ksp_rtol 1.e-20 -ksp_monitor -ksp_view 0 KSP Residual norm 2.975203858623e+00 1 KSP Residual norm 2.674371671721e-01 2 KSP Residual norm 1.841074927355e-01 KSP Object: type: cg maximum iterations=10000, initial guess is zero tolerances: relative=1e-20, absolute=1e-50, divergence=10000 left preconditioning PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type=seqaij, rows=8775, cols=8775 total: nonzeros=214591, allocated nonzeros=214591 not using I-node routines I realize that the iteration ended when the residual norm is quite large. Do you think this indicates something wrong here? Thank you again. Shi __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From knepley at gmail.com Wed Apr 9 15:50:29 2008 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 9 Apr 2008 15:50:29 -0500 Subject: Further question about PC with Jaocbi Row Sum In-Reply-To: <780632.38038.qm@web36201.mail.mud.yahoo.com> References: <780632.38038.qm@web36201.mail.mud.yahoo.com> Message-ID: On Wed, Apr 9, 2008 at 3:25 PM, Shi Jin wrote: > Thank you very much. > > > > > > Is there something particular about this rowsum method? > > > > No. If you use a -ksp_rtol of 1.e-12 and still get different > > answers, this needs to be investigated. > > > > > > I have tried even with -ksp_rtol 1.e-20 but still got different results. > > Here is what I got when solving the mass matrix with > > -pc_type jacobi > -pc_jacobi_rowsum 1 > -ksp_type cg > -sub_pc_type icc > -ksp_rtol 1.e-20 > -ksp_monitor > -ksp_view > > 0 KSP Residual norm 2.975203858623e+00 > 1 KSP Residual norm 2.674371671721e-01 > 2 KSP Residual norm 1.841074927355e-01 > KSP Object: > type: cg > maximum iterations=10000, initial guess is zero > tolerances: relative=1e-20, absolute=1e-50, divergence=10000 > left preconditioning > PC Object: > type: jacobi > linear system matrix = precond matrix: > Matrix Object: > type=seqaij, rows=8775, cols=8775 > total: nonzeros=214591, allocated nonzeros=214591 > not using I-node routines > > I realize that the iteration ended when the residual norm is quite large. > Do you think this indicates something wrong here? Can you run with -ksp_converged_reason It appears that the solve fails rather than terminates with an answer. Is it possible that your matrix is not SPD? Matt > Thank you again. > > Shi > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From Amit.Itagi at seagate.com Wed Apr 9 16:06:18 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 9 Apr 2008 17:06:18 -0400 Subject: DA question In-Reply-To: <47FD2297.1010602@gmail.com> Message-ID: Randy, I guess, since you are doing a frequency domain calculation, you eventually end up with a single matrix equation. I am planning to work in the time domain. Will that change things ? Thanks Rgds, Amit Randall Mackie To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: DA question 04/09/2008 04:09 PM Please respond to petsc-users at mcs.a nl.gov Hi Amit, Why do you need two staggered grids? I do EM finite difference frequency domain modeling on a staggered grid using just one DA. Works perfectly fine. There are some grid points that are not used, but you just set them to zero and put a 1 on the diagonal of the coefficient matrix. Randy Amit.Itagi at seagate.com wrote: > Hi Berend, > > A detailed explanation of the finite difference scheme is given here : > > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method > > > Thanks > > Rgds, > Amit > > > > > Berend van Wachem > se> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: DA question > > > 04/09/2008 02:59 > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > Dear Amit, > > Could you explain how the two grids are attached? > I am using multiple DA's for multiple structured grids glued together. > I've done the gluing with setting up various IS objects. From the > multiple DA's, one global variable vector is formed. Is that what you > are looking for? > > Best regards, > > Berend. > > > Amit.Itagi at seagate.com wrote: >> Hi, >> >> Is it possible to use DA to perform finite differences on two staggered >> regular grids (as in the electromagnetic finite difference time domain >> method) ? Surrounding nodes from one grid are used to update the value in >> the dual grid. In addition, local manipulations need to be done on the >> nodal values. >> >> Thanks >> >> Rgds, >> Amit >> > > > From bsmith at mcs.anl.gov Wed Apr 9 16:35:34 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 9 Apr 2008 16:35:34 -0500 Subject: about MatMatMultTranspose_seqdense_seqdense() In-Reply-To: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com> References: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com> Message-ID: Please send us a compilable code that reproduces this problem to petsc-maint at mcs.anl.gov Barry I cannot reproduce it. On Apr 9, 2008, at 12:02 PM, Yujie wrote: > hi, everyone > > My codes are as follows: > ierr=MatGetSubMatrices(tempM_mat, > 1,&is_row,&is_col,MAT_INITIAL_MATRIX,&tempA_mat); CHKERRQ(ierr); > A_mat=*tempA_mat; > ierr=MatDestroy(tempM_mat);CHKERRQ(ierr); > ierr=MatGetSize(A_mat,&M,&N);CHKERRQ(ierr); > //AtA > > ierr > =MatMatMultTranspose(A_mat,A_mat,MAT_INITIAL_MATRIX,fill,&AtA_mat); > > I get a seqdense submatrix "A_mat" by MatGetSubMatrices(). I further > get At*A by MatMatMultTranspose(). However, I meet an error: > " ** On entry to DGEMM parameter number 8 had an illegal value" > > I debug my codes. In MatMatMultTranspose_seqdense_seqdense(), the > codes call > "BLASgemm_("T","N",&m,&n,&k,&_DOne,a->v,&a->lda,b->v,&b- > >lda,&_DZero,c->v,&c->lda);" > I don't know the meaning of the 8th parameters"&a->lda". In my > codes, its value is "0". > > Are there any problems in my codes? could you give me some advice? > thanks a lot. > > Regards, > Yujie From sdettrick at gmail.com Wed Apr 9 16:36:05 2008 From: sdettrick at gmail.com (Sean Dettrick) Date: Wed, 9 Apr 2008 17:36:05 -0400 Subject: DA question In-Reply-To: References: <47FD2297.1010602@gmail.com> Message-ID: <44114ec40804091436o25657b1eua89cba52848d5717@mail.gmail.com> To elaborate on Matt's suggestion, a staggered grid/Yee mesh code could use a single DA with one degree-of-freedom per component of H and E. The extra overlap required for staggered guard cells at the domain boundaries could be dealt with by having a bigger-than-usual stencil width. For the 2nd order 3D case, this suggests the DACreate3d routine would have arguments dof=6, s=2, and stencil_type=DA_STENCIL_STAR. It is just a suggestion - I have not tried it. Sean On Wed, Apr 9, 2008 at 5:06 PM, wrote: > Randy, > > I guess, since you are doing a frequency domain calculation, you eventually > end up with a single matrix equation. > > I am planning to work in the time domain. Will that change things ? > > Thanks > > Rgds, > Amit > > > > > Randall Mackie > l.com> To > > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: DA question > > > 04/09/2008 04:09 > > > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > Hi Amit, > > Why do you need two staggered grids? I do EM finite difference frequency > domain modeling on a staggered grid using just one DA. Works perfectly > fine. > There are some grid points that are not used, but you just set them to zero > and put a 1 on the diagonal of the coefficient matrix. > > > Randy > > > Amit.Itagi at seagate.com wrote: > > Hi Berend, > > > > A detailed explanation of the finite difference scheme is given here : > > > > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method > > > > > > Thanks > > > > Rgds, > > Amit > > > > > > > > > > > Berend van Wachem > > > > > se> > To > > Sent by: petsc-users at mcs.anl.gov > > > owner-petsc-users > cc > > @mcs.anl.gov > > > No Phone Info > Subject > > Available Re: DA question > > > > > > > > > 04/09/2008 02:59 > > > PM > > > > > > > > > Please respond to > > > petsc-users at mcs.a > > > nl.gov > > > > > > > > > > > > > > > > > Dear Amit, > > > > Could you explain how the two grids are attached? > > I am using multiple DA's for multiple structured grids glued together. > > I've done the gluing with setting up various IS objects. From the > > multiple DA's, one global variable vector is formed. Is that what you > > are looking for? > > > > Best regards, > > > > Berend. > > > > > > Amit.Itagi at seagate.com wrote: > >> Hi, > >> > >> Is it possible to use DA to perform finite differences on two staggered > >> regular grids (as in the electromagnetic finite difference time domain > >> method) ? Surrounding nodes from one grid are used to update the value > in > >> the dual grid. In addition, local manipulations need to be done on the > >> nodal values. > >> > >> Thanks > >> > >> Rgds, > >> Amit > >> > > > > > > > > > > From jed at 59A2.org Wed Apr 9 16:40:06 2008 From: jed at 59A2.org (Jed Brown) Date: Wed, 9 Apr 2008 23:40:06 +0200 Subject: Forming a sparse approximation of a MatShell In-Reply-To: References: <20080409191337.GA6137@brakk.ethz.ch> Message-ID: <20080409214006.GB6137@brakk.ethz.ch> On Wed 2008-04-09 15:10, Barry Smith wrote: > > Jed, > > The Mat coloring code can also be used directly, not through SNES. Once > you have > the coloring for the matrix (you can get that with MatGetColoring(), of > course, this assumes > you have already set a nonzero pattern for your matrix)). Call > MatFDColoringCreate() > then MatFDColoringSetFunction(), MatFDColoringSetFromOptions() and then > MatFDColoringApply(). Cool, I tried this and I can confirm that it is generating the correct matrix (by comparing entries with the output of -snes_fd) but unfortunately the matrix entries of the spectral operator corresponding to neighbors are actually not a very good approximation of the full operator. Bummer. It looks like I'm stuck with formulating the problem twice, once for the spectral operators and once for the FD/FE preconditioner. Thanks for the help. Jed > On Apr 9, 2008, at 2:13 PM, Jed Brown wrote: >> I'm trying to improve the preconditioning of my spectral collocation >> method for >> non-Newtonian incompressible Stokes flow. My current algorithm uses >> MatShell >> for the full Jacobian as well as each of its blocks [A B1'; B2 0] and the >> Schur >> complement S = -B2*A*B1'. I needed a preconditioner for A so I thought >> I'd >> solve the same problem using finite differences on the Chebyshev nodes. >> In >> reality, the stencil is really ugly in 3D so I just used a simpler >> elliptic >> operator. This works okay, but it's performance decays significantly as I >> increase the continuation parameter. Also, dealing with general boundary >> conditions is rather tricky and it seems to be a much weaker >> preconditioner when >> I have mixed boundary conditions. To rectify this, I tried a finite >> element >> discretization on the Chebyshev nodes (using Q1 elements). This must be >> scaled >> by the inverse (lumped) mass matrix due to the collocation nature of the >> spectral method. Strangely, even though it captures all the terms in the >> Jacobian, it is slightly weaker than the finite difference version. At >> least it >> is less error-prone and boundary conditions are easier to get right. >> Regardless, forming the explicit matrix separately from the spectral >> matrix >> causes a duplication of concepts that have to be kept in sync. So I >> started >> thinking, the spectral matrix is pretty cheap to apply a few times, so >> perhaps I >> can use a coloring to compute a sparse approximation. However, the >> documentation I found is using the function from the SNES context to form >> the >> matrix. In my case, the entire Jacobian doesn't help, I just want an >> approximation of A. (A itself is full, but implemented via FFT.) What is >> the >> correct way to do this? Should I just stick with finite differences or >> finite >> elements? >> >> Also, any ideas for preconditioning S? It's condition number also grows >> significantly with the continuation parameter. >> >> Thanks, >> >> Jed > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From rlmackie862 at gmail.com Wed Apr 9 17:44:15 2008 From: rlmackie862 at gmail.com (Randall Mackie) Date: Wed, 09 Apr 2008 15:44:15 -0700 Subject: DA question In-Reply-To: <44114ec40804091436o25657b1eua89cba52848d5717@mail.gmail.com> References: <47FD2297.1010602@gmail.com> <44114ec40804091436o25657b1eua89cba52848d5717@mail.gmail.com> Message-ID: <47FD46BF.3050901@gmail.com> Amit, I have a staggered grid with H defined along the edges and E as normals across the block faces. So if you have l x m x n blocks, then you need to define your DA as l+1, m+1, n+1, to handle the extra grid point you need for the staggered grid. I use 3 degrees of freedom (for Hx, Hy, and Hz), and all my local calculations just need the box stencil. Randy Sean Dettrick wrote: > To elaborate on Matt's suggestion, a staggered grid/Yee mesh code > could use a single DA with one degree-of-freedom per component of H > and E. The extra overlap required for staggered guard cells at the > domain boundaries could be dealt with by having a bigger-than-usual > stencil width. For the 2nd order 3D case, this suggests the > DACreate3d routine would have arguments dof=6, s=2, and > stencil_type=DA_STENCIL_STAR. > > It is just a suggestion - I have not tried it. > > Sean > > On Wed, Apr 9, 2008 at 5:06 PM, wrote: >> Randy, >> >> I guess, since you are doing a frequency domain calculation, you eventually >> end up with a single matrix equation. >> >> I am planning to work in the time domain. Will that change things ? >> >> Thanks >> >> Rgds, >> Amit >> >> >> >> >> Randall Mackie >> > l.com> To >> >> Sent by: petsc-users at mcs.anl.gov >> owner-petsc-users cc >> @mcs.anl.gov >> No Phone Info Subject >> Available Re: DA question >> >> >> 04/09/2008 04:09 >> >> >> PM >> >> >> Please respond to >> petsc-users at mcs.a >> nl.gov >> >> >> >> >> >> >> Hi Amit, >> >> Why do you need two staggered grids? I do EM finite difference frequency >> domain modeling on a staggered grid using just one DA. Works perfectly >> fine. >> There are some grid points that are not used, but you just set them to zero >> and put a 1 on the diagonal of the coefficient matrix. >> >> >> Randy >> >> >> Amit.Itagi at seagate.com wrote: >> > Hi Berend, >> > >> > A detailed explanation of the finite difference scheme is given here : >> > >> > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method >> > >> > >> > Thanks >> > >> > Rgds, >> > Amit >> > >> > >> > >> > >> >> > Berend van Wachem >> >> > > >> > se> >> To >> > Sent by: petsc-users at mcs.anl.gov >> >> > owner-petsc-users >> cc >> > @mcs.anl.gov >> >> > No Phone Info >> Subject >> > Available Re: DA question >> >> > >> >> > >> >> > 04/09/2008 02:59 >> >> > PM >> >> > >> >> > >> >> > Please respond to >> >> > petsc-users at mcs.a >> >> > nl.gov >> >> > >> >> > >> >> > >> > >> > >> > >> > Dear Amit, >> > >> > Could you explain how the two grids are attached? >> > I am using multiple DA's for multiple structured grids glued together. >> > I've done the gluing with setting up various IS objects. From the >> > multiple DA's, one global variable vector is formed. Is that what you >> > are looking for? >> > >> > Best regards, >> > >> > Berend. >> > >> > >> > Amit.Itagi at seagate.com wrote: >> >> Hi, >> >> >> >> Is it possible to use DA to perform finite differences on two staggered >> >> regular grids (as in the electromagnetic finite difference time domain >> >> method) ? Surrounding nodes from one grid are used to update the value >> in >> >> the dual grid. In addition, local manipulations need to be done on the >> >> nodal values. >> >> >> >> Thanks >> >> >> >> Rgds, >> >> Amit >> >> >> > >> > >> > >> >> >> >> > From jinzishuai at yahoo.com Thu Apr 10 00:04:03 2008 From: jinzishuai at yahoo.com (Shi Jin) Date: Wed, 9 Apr 2008 22:04:03 -0700 (PDT) Subject: Further question about PC with Jaocbi Row Sum Message-ID: <149510.10833.qm@web36208.mail.mud.yahoo.com> Thank you. I have used the -ksp_converged_reason option. The result says: Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 2 I then further checked the row sum matrix, it has negative eigenvalues. So I guess it does not work at all. Thank you all for your help. -- Shi Jin, PhD ----- Original Message ---- > From: Matthew Knepley > To: petsc-users at mcs.anl.gov > Sent: Wednesday, April 9, 2008 2:50:29 PM > Subject: Re: Further question about PC with Jaocbi Row Sum > > On Wed, Apr 9, 2008 at 3:25 PM, Shi Jin wrote: > > Thank you very much. > > > > > > > > > > Is there something particular about this rowsum method? > > > > > > No. If you use a -ksp_rtol of 1.e-12 and still get different > > > answers, this needs to be investigated. > > > > > > > > > > I have tried even with -ksp_rtol 1.e-20 but still got different results. > > > > Here is what I got when solving the mass matrix with > > > > -pc_type jacobi > > -pc_jacobi_rowsum 1 > > -ksp_type cg > > -sub_pc_type icc > > -ksp_rtol 1.e-20 > > -ksp_monitor > > -ksp_view > > > > 0 KSP Residual norm 2.975203858623e+00 > > 1 KSP Residual norm 2.674371671721e-01 > > 2 KSP Residual norm 1.841074927355e-01 > > KSP Object: > > type: cg > > maximum iterations=10000, initial guess is zero > > tolerances: relative=1e-20, absolute=1e-50, divergence=10000 > > left preconditioning > > PC Object: > > type: jacobi > > linear system matrix = precond matrix: > > Matrix Object: > > type=seqaij, rows=8775, cols=8775 > > total: nonzeros=214591, allocated nonzeros=214591 > > not using I-node routines > > > > I realize that the iteration ended when the residual norm is quite large. > > Do you think this indicates something wrong here? > > Can you run with > > -ksp_converged_reason > > It appears that the solve fails rather than terminates with an answer. Is it > possible that your matrix is not SPD? > > Matt > > > Thank you again. > > > > Shi > > > > > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam protection around > > http://mail.yahoo.com > > > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From Amit.Itagi at seagate.com Thu Apr 10 08:10:34 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Thu, 10 Apr 2008 09:10:34 -0400 Subject: DA question In-Reply-To: <47FD46BF.3050901@gmail.com> Message-ID: Randy/Sean/Matt, Thanks for the suggestions. I will try to implement the algorithm on the suggested lines. Rgds, Amit Randall Mackie To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: DA question 04/09/2008 06:44 PM Please respond to petsc-users at mcs.a nl.gov Amit, I have a staggered grid with H defined along the edges and E as normals across the block faces. So if you have l x m x n blocks, then you need to define your DA as l+1, m+1, n+1, to handle the extra grid point you need for the staggered grid. I use 3 degrees of freedom (for Hx, Hy, and Hz), and all my local calculations just need the box stencil. Randy Sean Dettrick wrote: > To elaborate on Matt's suggestion, a staggered grid/Yee mesh code > could use a single DA with one degree-of-freedom per component of H > and E. The extra overlap required for staggered guard cells at the > domain boundaries could be dealt with by having a bigger-than-usual > stencil width. For the 2nd order 3D case, this suggests the > DACreate3d routine would have arguments dof=6, s=2, and > stencil_type=DA_STENCIL_STAR. > > It is just a suggestion - I have not tried it. > > Sean > > On Wed, Apr 9, 2008 at 5:06 PM, wrote: >> Randy, >> >> I guess, since you are doing a frequency domain calculation, you eventually >> end up with a single matrix equation. >> >> I am planning to work in the time domain. Will that change things ? >> >> Thanks >> >> Rgds, >> Amit >> >> >> >> >> Randall Mackie >> > l.com> To >> >> Sent by: petsc-users at mcs.anl.gov >> owner-petsc-users cc >> @mcs.anl.gov >> No Phone Info Subject >> Available Re: DA question >> >> >> 04/09/2008 04:09 >> >> >> PM >> >> >> Please respond to >> petsc-users at mcs.a >> nl.gov >> >> >> >> >> >> >> Hi Amit, >> >> Why do you need two staggered grids? I do EM finite difference frequency >> domain modeling on a staggered grid using just one DA. Works perfectly >> fine. >> There are some grid points that are not used, but you just set them to zero >> and put a 1 on the diagonal of the coefficient matrix. >> >> >> Randy >> >> >> Amit.Itagi at seagate.com wrote: >> > Hi Berend, >> > >> > A detailed explanation of the finite difference scheme is given here : >> > >> > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method >> > >> > >> > Thanks >> > >> > Rgds, >> > Amit >> > >> > >> > >> > >> >> > Berend van Wachem >> >> > > >> > se> >> To >> > Sent by: petsc-users at mcs.anl.gov >> >> > owner-petsc-users >> cc >> > @mcs.anl.gov >> >> > No Phone Info >> Subject >> > Available Re: DA question >> >> > >> >> > >> >> > 04/09/2008 02:59 >> >> > PM >> >> > >> >> > >> >> > Please respond to >> >> > petsc-users at mcs.a >> >> > nl.gov >> >> > >> >> > >> >> > >> > >> > >> > >> > Dear Amit, >> > >> > Could you explain how the two grids are attached? >> > I am using multiple DA's for multiple structured grids glued together. >> > I've done the gluing with setting up various IS objects. From the >> > multiple DA's, one global variable vector is formed. Is that what you >> > are looking for? >> > >> > Best regards, >> > >> > Berend. >> > >> > >> > Amit.Itagi at seagate.com wrote: >> >> Hi, >> >> >> >> Is it possible to use DA to perform finite differences on two staggered >> >> regular grids (as in the electromagnetic finite difference time domain >> >> method) ? Surrounding nodes from one grid are used to update the value >> in >> >> the dual grid. In addition, local manipulations need to be done on the >> >> nodal values. >> >> >> >> Thanks >> >> >> >> Rgds, >> >> Amit >> >> >> > >> > >> > >> >> >> >> > From hzhang at mcs.anl.gov Thu Apr 10 09:01:13 2008 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Thu, 10 Apr 2008 09:01:13 -0500 (CDT) Subject: Further question about PC with Jaocbi Row Sum In-Reply-To: <149510.10833.qm@web36208.mail.mud.yahoo.com> References: <149510.10833.qm@web36208.mail.mud.yahoo.com> Message-ID: Then you may try direct sparse linear solver, sequential run: -ksp_type preonly -pc_type cholesky parallel run (install external packages superlu_dist or mumps): -ksp_type preonly -pc_type lu -mat_type superlu_dist or -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps Hong On Wed, 9 Apr 2008, Shi Jin wrote: > > Thank you. I have used the -ksp_converged_reason option. > The result says: > Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 2 > I then further checked the row sum matrix, it has negative eigenvalues. > So I guess it does not work at all. > Thank you all for your help. > > -- > Shi Jin, PhD > > ----- Original Message ---- > > From: Matthew Knepley > > To: petsc-users at mcs.anl.gov > > Sent: Wednesday, April 9, 2008 2:50:29 PM > > Subject: Re: Further question about PC with Jaocbi Row Sum > > > > On Wed, Apr 9, 2008 at 3:25 PM, Shi Jin wrote: > > > Thank you very much. > > > > > > > > > > > > > > Is there something particular about this rowsum method? > > > > > > > > No. If you use a -ksp_rtol of 1.e-12 and still get different > > > > answers, this needs to be investigated. > > > > > > > > > > > > > > I have tried even with -ksp_rtol 1.e-20 but still got different results. > > > > > > Here is what I got when solving the mass matrix with > > > > > > -pc_type jacobi > > > -pc_jacobi_rowsum 1 > > > -ksp_type cg > > > -sub_pc_type icc > > > -ksp_rtol 1.e-20 > > > -ksp_monitor > > > -ksp_view > > > > > > 0 KSP Residual norm 2.975203858623e+00 > > > 1 KSP Residual norm 2.674371671721e-01 > > > 2 KSP Residual norm 1.841074927355e-01 > > > KSP Object: > > > type: cg > > > maximum iterations=10000, initial guess is zero > > > tolerances: relative=1e-20, absolute=1e-50, divergence=10000 > > > left preconditioning > > > PC Object: > > > type: jacobi > > > linear system matrix = precond matrix: > > > Matrix Object: > > > type=seqaij, rows=8775, cols=8775 > > > total: nonzeros=214591, allocated nonzeros=214591 > > > not using I-node routines > > > > > > I realize that the iteration ended when the residual norm is quite large. > > > Do you think this indicates something wrong here? > > > > Can you run with > > > > -ksp_converged_reason > > > > It appears that the solve fails rather than terminates with an answer. Is it > > possible that your matrix is not SPD? > > > > Matt > > > > > Thank you again. > > > > > > Shi > > > > > > > > > > > > __________________________________________________ > > > Do You Yahoo!? > > > Tired of spam? Yahoo! Mail has the best spam protection around > > > http://mail.yahoo.com > > > > > > > > > > > > > > -- > > What most experimenters take for granted before they begin their > > experiments is infinitely more interesting than any results to which > > their experiments lead. > > -- Norbert Wiener > > > > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > From bsmith at mcs.anl.gov Thu Apr 10 11:39:21 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Thu, 10 Apr 2008 11:39:21 -0500 Subject: Further question about PC with Jaocbi Row Sum In-Reply-To: <149510.10833.qm@web36208.mail.mud.yahoo.com> References: <149510.10833.qm@web36208.mail.mud.yahoo.com> Message-ID: the row sum option assumes that all the entries of the matrix are positive; this is true to linear elements and mass matrices. If you have negative entries in your mass matrix then I would not trust any kind of mass lumping as a preconditioner. Barry On Apr 10, 2008, at 12:04 AM, Shi Jin wrote: > > Thank you. I have used the -ksp_converged_reason option. > The result says: > Linear solve did not converge due to DIVERGED_INDEFINITE_PC > iterations 2 > I then further checked the row sum matrix, it has negative > eigenvalues. > So I guess it does not work at all. > Thank you all for your help. > > -- > Shi Jin, PhD > > ----- Original Message ---- >> From: Matthew Knepley >> To: petsc-users at mcs.anl.gov >> Sent: Wednesday, April 9, 2008 2:50:29 PM >> Subject: Re: Further question about PC with Jaocbi Row Sum >> >> On Wed, Apr 9, 2008 at 3:25 PM, Shi Jin wrote: >>> Thank you very much. >>> >>> >>> >>>>> Is there something particular about this rowsum method? >>>> >>>> No. If you use a -ksp_rtol of 1.e-12 and still get different >>>> answers, this needs to be investigated. >>>> >>>> >>> >>> I have tried even with -ksp_rtol 1.e-20 but still got different >>> results. >>> >>> Here is what I got when solving the mass matrix with >>> >>> -pc_type jacobi >>> -pc_jacobi_rowsum 1 >>> -ksp_type cg >>> -sub_pc_type icc >>> -ksp_rtol 1.e-20 >>> -ksp_monitor >>> -ksp_view >>> >>> 0 KSP Residual norm 2.975203858623e+00 >>> 1 KSP Residual norm 2.674371671721e-01 >>> 2 KSP Residual norm 1.841074927355e-01 >>> KSP Object: >>> type: cg >>> maximum iterations=10000, initial guess is zero >>> tolerances: relative=1e-20, absolute=1e-50, divergence=10000 >>> left preconditioning >>> PC Object: >>> type: jacobi >>> linear system matrix = precond matrix: >>> Matrix Object: >>> type=seqaij, rows=8775, cols=8775 >>> total: nonzeros=214591, allocated nonzeros=214591 >>> not using I-node routines >>> >>> I realize that the iteration ended when the residual norm is quite >>> large. >>> Do you think this indicates something wrong here? >> >> Can you run with >> >> -ksp_converged_reason >> >> It appears that the solve fails rather than terminates with an >> answer. Is it >> possible that your matrix is not SPD? >> >> Matt >> >>> Thank you again. >>> >>> Shi >>> >>> >>> >>> __________________________________________________ >>> Do You Yahoo!? >>> Tired of spam? Yahoo! Mail has the best spam protection around >>> http://mail.yahoo.com >>> >>> >> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which >> their experiments lead. >> -- Norbert Wiener >> >> > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > From nliu at fit.edu Thu Apr 10 23:28:13 2008 From: nliu at fit.edu (Ningyu Liu) Date: Fri, 11 Apr 2008 00:28:13 -0400 (EDT) Subject: Question on DA Message-ID: <51333.68.202.24.62.1207888093.squirrel@webaccess.fit.edu> Hello, I have a question on DA. If I create two DAs using DACreate2D() with the same input except different degrees of freedom, will they share the same communication information. If not, how can I create two DAs corresponding to the same structured grid and communication information but different degrees of freedom. Thank you very much! Ningyu From bsmith at mcs.anl.gov Fri Apr 11 08:24:36 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 11 Apr 2008 08:24:36 -0500 Subject: Question on DA In-Reply-To: <51333.68.202.24.62.1207888093.squirrel@webaccess.fit.edu> References: <51333.68.202.24.62.1207888093.squirrel@webaccess.fit.edu> Message-ID: <3041331F-0906-4076-916A-B120ED60F506@mcs.anl.gov> The default layouts of "grid points" is independent of the number of degree's of freedom per point so each process will get the same "patch" for both DA's. If you are worried about the two DA's having some duplicate information that wastes memory (information that could be shared between the two), don't; the amount of excess data is very small relative to everything else in the code and is not worth worrying about. Barry On Apr 10, 2008, at 11:28 PM, Ningyu Liu wrote: > Hello, > > I have a question on DA. If I create two DAs using DACreate2D() with > the > same input except different degrees of freedom, will they share the > same > communication information. If not, how can I create two DAs > corresponding > to the same structured grid and communication information but > different > degrees of freedom. Thank you very much! > > Ningyu > From jinzishuai at yahoo.com Fri Apr 11 15:56:56 2008 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 11 Apr 2008 13:56:56 -0700 (PDT) Subject: Further question about PC with Jaocbi Row Sum Message-ID: <548693.2032.qm@web36202.mail.mud.yahoo.com> Thank you. Suppose I have a diagonal matrix, what is the best way to invert it in PETSc? Do I have to install the external packages superlu_dist or mumps? I realized that LU or Cholesky decomposition does not work with MPIAIJ matrices. I also know the best way is probably to directly call Vector operations directly. However, I want to keep the same KSPSolve structure so that the same code can be used for non-diagonal MPIAIJ matrices without changing each call to KSPSolve. Thank you very much. Shi > Then you may try direct sparse linear solver, > sequential run: > -ksp_type preonly -pc_type cholesky > parallel run (install external packages superlu_dist or mumps): > -ksp_type preonly -pc_type lu -mat_type superlu_dist > or > -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps > > Hong > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From hzhang at mcs.anl.gov Fri Apr 11 16:19:20 2008 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Fri, 11 Apr 2008 16:19:20 -0500 (CDT) Subject: Further question about PC with Jaocbi Row Sum In-Reply-To: <548693.2032.qm@web36202.mail.mud.yahoo.com> References: <548693.2032.qm@web36202.mail.mud.yahoo.com> Message-ID: Shi, > Suppose I have a diagonal matrix, what is the best way to invert it in PETSc? > Do I have to install the external packages superlu_dist or mumps? > I realized that LU or Cholesky decomposition does not work with MPIAIJ matrices. > I also know the best way is probably to directly call Vector operations directly. > However, I want to keep the same KSPSolve structure so that the same code can be used for non-diagonal MPIAIJ matrices without changing each call to KSPSolve. > Thank you very much. Without changing your application code, i.e., keep the same KSPSolve structure, running it with the option '-pc_type jacobi' actually inverts the diagonal matrix, in both sequential and parallel cases. Install external packages superlu_dist or mumps, then run your code in sequential or parallel with -ksp_type preonly -pc_type lu -mat_type superlu_dist (work with mpiaij matrix) >> or >> -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps (work with mpisbaij matrix format). Hong > > Shi >> Then you may try direct sparse linear solver, >> sequential run: >> -ksp_type preonly -pc_type cholesky >> parallel run (install external packages superlu_dist or mumps): >> -ksp_type preonly -pc_type lu -mat_type superlu_dist >> or >> -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps >> >> Hong >> > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > From bsmith at mcs.anl.gov Fri Apr 11 16:04:12 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 11 Apr 2008 16:04:12 -0500 Subject: Further question about PC with Jaocbi Row Sum In-Reply-To: <548693.2032.qm@web36202.mail.mud.yahoo.com> References: <548693.2032.qm@web36202.mail.mud.yahoo.com> Message-ID: <6BB577AC-01F8-4D52-B13C-82863778450E@mcs.anl.gov> There is no super easy way to do this that I can think of. The diagonal cases you can run with -pc_type jacobi and the nondiagonal with -pc_type lu (or Cholesky) I realize this is not exactly what you want. Barry On Apr 11, 2008, at 3:56 PM, Shi Jin wrote: > Thank you. > Suppose I have a diagonal matrix, what is the best way to invert it > in PETSc? > Do I have to install the external packages superlu_dist or mumps? > I realized that LU or Cholesky decomposition does not work with > MPIAIJ matrices. > I also know the best way is probably to directly call Vector > operations directly. > However, I want to keep the same KSPSolve structure so that the same > code can be used for non-diagonal MPIAIJ matrices without changing > each call to KSPSolve. > Thank you very much. > > Shi >> Then you may try direct sparse linear solver, >> sequential run: >> -ksp_type preonly -pc_type cholesky >> parallel run (install external packages superlu_dist or mumps): >> -ksp_type preonly -pc_type lu -mat_type superlu_dist >> or >> -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps >> >> Hong >> > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > From jinzishuai at yahoo.com Fri Apr 11 16:40:10 2008 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 11 Apr 2008 14:40:10 -0700 (PDT) Subject: Further question about PC with Jaocbi Row Sum Message-ID: <932872.39873.qm@web36207.mail.mud.yahoo.com> Thank you very much. -pc_type jacobi -ksp_type preonly does exactly what I want, even in parallel. Shi ----- Original Message ---- > From: Hong Zhang > >Suppose I have a diagonal matrix, what is the best way to invert it in PETSc? > > Do I have to install the external packages superlu_dist or mumps? > > I realized that LU or Cholesky decomposition does not work with MPIAIJ > matrices. > > I also know the best way is probably to directly call Vector operations > directly. > > However, I want to keep the same KSPSolve structure so that the same code can > be used for non-diagonal MPIAIJ matrices without changing each call to > KSPSolve. > > Thank you very much. > > Without changing your application code, i.e., keep the same KSPSolve > structure, > running it with the option > '-pc_type jacobi' > actually inverts the diagonal matrix, in both sequential and parallel > cases. > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From knepley at gmail.com Fri Apr 11 16:04:54 2008 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 11 Apr 2008 16:04:54 -0500 Subject: Further question about PC with Jaocbi Row Sum In-Reply-To: <548693.2032.qm@web36202.mail.mud.yahoo.com> References: <548693.2032.qm@web36202.mail.mud.yahoo.com> Message-ID: On Fri, Apr 11, 2008 at 3:56 PM, Shi Jin wrote: > Thank you. > Suppose I have a diagonal matrix, what is the best way to invert it in PETSc? If you have a diagonal matrix, you just use -ksp_type preonly -pc_type jacobi Matt > Do I have to install the external packages superlu_dist or mumps? > I realized that LU or Cholesky decomposition does not work with MPIAIJ matrices. > I also know the best way is probably to directly call Vector operations directly. > However, I want to keep the same KSPSolve structure so that the same code can be used for non-diagonal MPIAIJ matrices without changing each call to KSPSolve. > Thank you very much. > > Shi > > Then you may try direct sparse linear solver, > > sequential run: > > -ksp_type preonly -pc_type cholesky > > parallel run (install external packages superlu_dist or mumps): > > -ksp_type preonly -pc_type lu -mat_type superlu_dist > > or > > -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps > > > > Hong > > > > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From recrusader at gmail.com Sat Apr 12 12:52:50 2008 From: recrusader at gmail.com (Yujie) Date: Sat, 12 Apr 2008 10:52:50 -0700 Subject: how to create sequential Vec or Mat in parallel mode. Message-ID: <7ff0ee010804121052j1d4d517foae52149ea7b79ac8@mail.gmail.com> Now, I use several processor to run my codes. However, I need to create sequential Vec and Mat. I use VecCreateSeq() to create Vec. I get error information. How to get it? thanks a lot. Regards, Yujie -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Sat Apr 12 13:16:15 2008 From: knepley at gmail.com (Matthew Knepley) Date: Sat, 12 Apr 2008 13:16:15 -0500 Subject: how to create sequential Vec or Mat in parallel mode. In-Reply-To: <7ff0ee010804121052j1d4d517foae52149ea7b79ac8@mail.gmail.com> References: <7ff0ee010804121052j1d4d517foae52149ea7b79ac8@mail.gmail.com> Message-ID: You cannot create a sequential Vec with a parallel communicator. If you truly want a VecSeq, use PETSC_COMM_SELF. Matt On Sat, Apr 12, 2008 at 12:52 PM, Yujie wrote: > Now, I use several processor to run my codes. However, I need to create > sequential Vec and Mat. I use VecCreateSeq() to create Vec. I get error > information. How to get it? thanks a lot. > > Regards, > Yujie > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From recrusader at gmail.com Sat Apr 12 13:21:27 2008 From: recrusader at gmail.com (Yujie) Date: Sat, 12 Apr 2008 11:21:27 -0700 Subject: how to create sequential Vec or Mat in parallel mode. In-Reply-To: References: <7ff0ee010804121052j1d4d517foae52149ea7b79ac8@mail.gmail.com> Message-ID: <7ff0ee010804121121v7267bc69v421a5f5b1fe495b8@mail.gmail.com> I got it. thanks a lot:) Regards, Yujie On 4/12/08, Matthew Knepley wrote: > > You cannot create a sequential Vec with a parallel communicator. If > you truly want > a VecSeq, use PETSC_COMM_SELF. > > Matt > > > On Sat, Apr 12, 2008 at 12:52 PM, Yujie wrote: > > Now, I use several processor to run my codes. However, I need to create > > sequential Vec and Mat. I use VecCreateSeq() to create Vec. I get error > > information. How to get it? thanks a lot. > > > > Regards, > > Yujie > > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zonexo at gmail.com Sun Apr 13 04:12:41 2008 From: zonexo at gmail.com (Ben Tay) Date: Sun, 13 Apr 2008 17:12:41 +0800 Subject: Slow speed after changing from serial to parallel Message-ID: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> Hi, I've a serial 2D CFD code. As my grid size requirement increases, the simulation takes longer. Also, memory requirement becomes a problem. Grid size 've reached 1200x1200. Going higher is not possible due to memory problem. I tried to convert my code to a parallel one, following the examples given. I also need to restructure parts of my code to enable parallel looping. I 1st changed the PETSc solver to be parallel enabled and then I restructured parts of my code. I proceed on as longer as the answer for a simple test case is correct. I thought it's not really possible to do any speed testing since the code is not fully parallelized yet. When I finished during most of the conversion, I found that in the actual run that it is much slower, although the answer is correct. So what is the remedy now? I wonder what I should do to check what's wrong. Must I restart everything again? Btw, my grid size is 1200x1200. I believed it should be suitable for parallel run of 4 processors? Is that so? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Sun Apr 13 12:47:41 2008 From: knepley at gmail.com (Matthew Knepley) Date: Sun, 13 Apr 2008 12:47:41 -0500 Subject: Slow speed after changing from serial to parallel In-Reply-To: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> Message-ID: 1) There is no way to have any idea what is going on in your code without -log_summary output 2) Looking at that output, look at the percentage taken by the solver KSPSolve event. I suspect it is not the biggest component, because it is very scalable. Matt On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay wrote: > > > Hi, > > I've a serial 2D CFD code. As my grid size requirement increases, the > simulation takes longer. Also, memory requirement becomes a problem. Grid > size 've reached 1200x1200. Going higher is not possible due to memory > problem. > > I tried to convert my code to a parallel one, following the examples given. > I also need to restructure parts of my code to enable parallel looping. I > 1st changed the PETSc solver to be parallel enabled and then I restructured > parts of my code. I proceed on as longer as the answer for a simple test > case is correct. I thought it's not really possible to do any speed testing > since the code is not fully parallelized yet. When I finished during most of > the conversion, I found that in the actual run that it is much slower, > although the answer is correct. > > So what is the remedy now? I wonder what I should do to check what's wrong. > Must I restart everything again? Btw, my grid size is 1200x1200. I believed > it should be suitable for parallel run of 4 processors? Is that so? > > Thank you. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From zonexo at gmail.com Mon Apr 14 05:49:34 2008 From: zonexo at gmail.com (Ben Tay) Date: Mon, 14 Apr 2008 18:49:34 +0800 Subject: Slow speed after changing from serial to parallel In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> Message-ID: <480336BE.3070507@gmail.com> Thank you Matthew. Sorry to trouble you again. I tried to run it with -log_summary output and I found that there's some errors in the execution. Well, I was busy with other things and I just came back to this problem. Some of my files on the server has also been deleted. It has been a while and I remember that it worked before, only much slower. Anyway, most of the serial code has been updated and maybe it's easier to convert the new serial code instead of debugging on the old parallel code now. I believe I can still reuse part of the old parallel code. However, I hope I can approach it better this time. So supposed I need to start converting my new serial code to parallel. There's 2 eqns to be solved using PETSc, the momentum and poisson. I also need to parallelize other parts of my code. I wonder which route is the best: 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify other parts of my code to parallel e.g. looping, updating of values etc. Once the execution is fine and speedup is reasonable, then modify the PETSc part - poisson eqn 1st followed by the momentum eqn. 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st followed by the momentum eqn. Then do other parts of my code. I'm not sure if the above 2 mtds can work or if there will be conflicts. Of course, an alternative will be: 3. Do the poisson, momentum eqns and other parts of the code separately. That is, code a standalone parallel poisson eqn and use samples values to test it. Same for the momentum and other parts of the code. When each of them is working, combine them to form the full parallel code. However, this will be much more troublesome. I hope someone can give me some recommendations. Thank you once again. Matthew Knepley wrote: > 1) There is no way to have any idea what is going on in your code > without -log_summary output > > 2) Looking at that output, look at the percentage taken by the solver > KSPSolve event. I suspect it is not the biggest component, because > it is very scalable. > > Matt > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay wrote: > >> Hi, >> >> I've a serial 2D CFD code. As my grid size requirement increases, the >> simulation takes longer. Also, memory requirement becomes a problem. Grid >> size 've reached 1200x1200. Going higher is not possible due to memory >> problem. >> >> I tried to convert my code to a parallel one, following the examples given. >> I also need to restructure parts of my code to enable parallel looping. I >> 1st changed the PETSc solver to be parallel enabled and then I restructured >> parts of my code. I proceed on as longer as the answer for a simple test >> case is correct. I thought it's not really possible to do any speed testing >> since the code is not fully parallelized yet. When I finished during most of >> the conversion, I found that in the actual run that it is much slower, >> although the answer is correct. >> >> So what is the remedy now? I wonder what I should do to check what's wrong. >> Must I restart everything again? Btw, my grid size is 1200x1200. I believed >> it should be suitable for parallel run of 4 processors? Is that so? >> >> Thank you. >> > > > > From knepley at gmail.com Mon Apr 14 08:23:48 2008 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 14 Apr 2008 08:23:48 -0500 Subject: Slow speed after changing from serial to parallel In-Reply-To: <480336BE.3070507@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> Message-ID: I am not sure why you would ever have two codes. I never do this. PETSc is designed to write one code to run in serial and parallel. The PETSc part should look identical. To test, run the code yo uhave verified in serial and output PETSc data structures (like Mat and Vec) using a binary viewer. Then run in parallel with the same code, which will output the same structures. Take the two files and write a small verification code that loads both versions and calls MatEqual and VecEqual. Matt On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay wrote: > Thank you Matthew. Sorry to trouble you again. > > I tried to run it with -log_summary output and I found that there's some > errors in the execution. Well, I was busy with other things and I just came > back to this problem. Some of my files on the server has also been deleted. > It has been a while and I remember that it worked before, only much > slower. > > Anyway, most of the serial code has been updated and maybe it's easier to > convert the new serial code instead of debugging on the old parallel code > now. I believe I can still reuse part of the old parallel code. However, I > hope I can approach it better this time. > > So supposed I need to start converting my new serial code to parallel. > There's 2 eqns to be solved using PETSc, the momentum and poisson. I also > need to parallelize other parts of my code. I wonder which route is the > best: > > 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify > other parts of my code to parallel e.g. looping, updating of values etc. > Once the execution is fine and speedup is reasonable, then modify the PETSc > part - poisson eqn 1st followed by the momentum eqn. > > 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st > followed by the momentum eqn. Then do other parts of my code. > > I'm not sure if the above 2 mtds can work or if there will be conflicts. Of > course, an alternative will be: > > 3. Do the poisson, momentum eqns and other parts of the code separately. > That is, code a standalone parallel poisson eqn and use samples values to > test it. Same for the momentum and other parts of the code. When each of > them is working, combine them to form the full parallel code. However, this > will be much more troublesome. > > I hope someone can give me some recommendations. > > Thank you once again. > > > > Matthew Knepley wrote: > > > 1) There is no way to have any idea what is going on in your code > > without -log_summary output > > > > 2) Looking at that output, look at the percentage taken by the solver > > KSPSolve event. I suspect it is not the biggest component, because > > it is very scalable. > > > > Matt > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay wrote: > > > > > > > Hi, > > > > > > I've a serial 2D CFD code. As my grid size requirement increases, the > > > simulation takes longer. Also, memory requirement becomes a problem. > Grid > > > size 've reached 1200x1200. Going higher is not possible due to memory > > > problem. > > > > > > I tried to convert my code to a parallel one, following the examples > given. > > > I also need to restructure parts of my code to enable parallel looping. > I > > > 1st changed the PETSc solver to be parallel enabled and then I > restructured > > > parts of my code. I proceed on as longer as the answer for a simple test > > > case is correct. I thought it's not really possible to do any speed > testing > > > since the code is not fully parallelized yet. When I finished during > most of > > > the conversion, I found that in the actual run that it is much slower, > > > although the answer is correct. > > > > > > So what is the remedy now? I wonder what I should do to check what's > wrong. > > > Must I restart everything again? Btw, my grid size is 1200x1200. I > believed > > > it should be suitable for parallel run of 4 processors? Is that so? > > > > > > Thank you. > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From zonexo at gmail.com Mon Apr 14 08:43:36 2008 From: zonexo at gmail.com (Ben Tay) Date: Mon, 14 Apr 2008 21:43:36 +0800 Subject: Slow speed after changing from serial to parallel In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> Message-ID: <48035F88.2080003@gmail.com> Hi Matthew, I think you've misunderstood what I meant. What I'm trying to say is initially I've got a serial code. I tried to convert to a parallel one. Then I tested it and it was pretty slow. Due to some work requirement, I need to go back to make some changes to my code. Since the parallel is not working well, I updated and changed the serial one. Well, that was a while ago and now, due to the updates and changes, the serial code is different from the old converted parallel code. Some files were also deleted and I can't seem to get it working now. So I thought I might as well convert the new serial code to parallel. But I'm not very sure what I should do 1st. Maybe I should rephrase my question in that if I just convert my poisson equation subroutine from a serial PETSc to a parallel PETSc version, will it work? Should I expect a speedup? The rest of my code is still serial. Thank you very much. Matthew Knepley wrote: > I am not sure why you would ever have two codes. I never do this. PETSc > is designed to write one code to run in serial and parallel. The PETSc part > should look identical. To test, run the code yo uhave verified in serial and > output PETSc data structures (like Mat and Vec) using a binary viewer. > Then run in parallel with the same code, which will output the same > structures. Take the two files and write a small verification code that > loads both versions and calls MatEqual and VecEqual. > > Matt > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay wrote: > >> Thank you Matthew. Sorry to trouble you again. >> >> I tried to run it with -log_summary output and I found that there's some >> errors in the execution. Well, I was busy with other things and I just came >> back to this problem. Some of my files on the server has also been deleted. >> It has been a while and I remember that it worked before, only much >> slower. >> >> Anyway, most of the serial code has been updated and maybe it's easier to >> convert the new serial code instead of debugging on the old parallel code >> now. I believe I can still reuse part of the old parallel code. However, I >> hope I can approach it better this time. >> >> So supposed I need to start converting my new serial code to parallel. >> There's 2 eqns to be solved using PETSc, the momentum and poisson. I also >> need to parallelize other parts of my code. I wonder which route is the >> best: >> >> 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify >> other parts of my code to parallel e.g. looping, updating of values etc. >> Once the execution is fine and speedup is reasonable, then modify the PETSc >> part - poisson eqn 1st followed by the momentum eqn. >> >> 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st >> followed by the momentum eqn. Then do other parts of my code. >> >> I'm not sure if the above 2 mtds can work or if there will be conflicts. Of >> course, an alternative will be: >> >> 3. Do the poisson, momentum eqns and other parts of the code separately. >> That is, code a standalone parallel poisson eqn and use samples values to >> test it. Same for the momentum and other parts of the code. When each of >> them is working, combine them to form the full parallel code. However, this >> will be much more troublesome. >> >> I hope someone can give me some recommendations. >> >> Thank you once again. >> >> >> >> Matthew Knepley wrote: >> >> >>> 1) There is no way to have any idea what is going on in your code >>> without -log_summary output >>> >>> 2) Looking at that output, look at the percentage taken by the solver >>> KSPSolve event. I suspect it is not the biggest component, because >>> it is very scalable. >>> >>> Matt >>> >>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay wrote: >>> >>> >>> >>>> Hi, >>>> >>>> I've a serial 2D CFD code. As my grid size requirement increases, the >>>> simulation takes longer. Also, memory requirement becomes a problem. >>>> >> Grid >> >>>> size 've reached 1200x1200. Going higher is not possible due to memory >>>> problem. >>>> >>>> I tried to convert my code to a parallel one, following the examples >>>> >> given. >> >>>> I also need to restructure parts of my code to enable parallel looping. >>>> >> I >> >>>> 1st changed the PETSc solver to be parallel enabled and then I >>>> >> restructured >> >>>> parts of my code. I proceed on as longer as the answer for a simple test >>>> case is correct. I thought it's not really possible to do any speed >>>> >> testing >> >>>> since the code is not fully parallelized yet. When I finished during >>>> >> most of >> >>>> the conversion, I found that in the actual run that it is much slower, >>>> although the answer is correct. >>>> >>>> So what is the remedy now? I wonder what I should do to check what's >>>> >> wrong. >> >>>> Must I restart everything again? Btw, my grid size is 1200x1200. I >>>> >> believed >> >>>> it should be suitable for parallel run of 4 processors? Is that so? >>>> >>>> Thank you. >>>> >>>> >>>> >>> >>> >>> >>> >> > > > > From knepley at gmail.com Mon Apr 14 08:58:20 2008 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 14 Apr 2008 08:58:20 -0500 Subject: Slow speed after changing from serial to parallel In-Reply-To: <48035F88.2080003@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> Message-ID: On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay wrote: > Hi Matthew, > > I think you've misunderstood what I meant. What I'm trying to say is > initially I've got a serial code. I tried to convert to a parallel one. Then > I tested it and it was pretty slow. Due to some work requirement, I need to > go back to make some changes to my code. Since the parallel is not working > well, I updated and changed the serial one. > > Well, that was a while ago and now, due to the updates and changes, the > serial code is different from the old converted parallel code. Some files > were also deleted and I can't seem to get it working now. So I thought I > might as well convert the new serial code to parallel. But I'm not very sure > what I should do 1st. > > Maybe I should rephrase my question in that if I just convert my poisson > equation subroutine from a serial PETSc to a parallel PETSc version, will it > work? Should I expect a speedup? The rest of my code is still serial. You should, of course, only expect speedup in the parallel parts Matt > Thank you very much. > > > > Matthew Knepley wrote: > > > I am not sure why you would ever have two codes. I never do this. PETSc > > is designed to write one code to run in serial and parallel. The PETSc > part > > should look identical. To test, run the code yo uhave verified in serial > and > > output PETSc data structures (like Mat and Vec) using a binary viewer. > > Then run in parallel with the same code, which will output the same > > structures. Take the two files and write a small verification code that > > loads both versions and calls MatEqual and VecEqual. > > > > Matt > > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay wrote: > > > > > > > Thank you Matthew. Sorry to trouble you again. > > > > > > I tried to run it with -log_summary output and I found that there's > some > > > errors in the execution. Well, I was busy with other things and I just > came > > > back to this problem. Some of my files on the server has also been > deleted. > > > It has been a while and I remember that it worked before, only much > > > slower. > > > > > > Anyway, most of the serial code has been updated and maybe it's easier > to > > > convert the new serial code instead of debugging on the old parallel > code > > > now. I believe I can still reuse part of the old parallel code. However, > I > > > hope I can approach it better this time. > > > > > > So supposed I need to start converting my new serial code to parallel. > > > There's 2 eqns to be solved using PETSc, the momentum and poisson. I > also > > > need to parallelize other parts of my code. I wonder which route is the > > > best: > > > > > > 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, > modify > > > other parts of my code to parallel e.g. looping, updating of values etc. > > > Once the execution is fine and speedup is reasonable, then modify the > PETSc > > > part - poisson eqn 1st followed by the momentum eqn. > > > > > > 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st > > > followed by the momentum eqn. Then do other parts of my code. > > > > > > I'm not sure if the above 2 mtds can work or if there will be > conflicts. Of > > > course, an alternative will be: > > > > > > 3. Do the poisson, momentum eqns and other parts of the code > separately. > > > That is, code a standalone parallel poisson eqn and use samples values > to > > > test it. Same for the momentum and other parts of the code. When each of > > > them is working, combine them to form the full parallel code. However, > this > > > will be much more troublesome. > > > > > > I hope someone can give me some recommendations. > > > > > > Thank you once again. > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > 1) There is no way to have any idea what is going on in your code > > > > without -log_summary output > > > > > > > > 2) Looking at that output, look at the percentage taken by the solver > > > > KSPSolve event. I suspect it is not the biggest component, because > > > > it is very scalable. > > > > > > > > Matt > > > > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > I've a serial 2D CFD code. As my grid size requirement increases, > the > > > > > simulation takes longer. Also, memory requirement becomes a problem. > > > > > > > > > > > > > > > > > Grid > > > > > > > > > > > > > > > size 've reached 1200x1200. Going higher is not possible due to > memory > > > > > problem. > > > > > > > > > > I tried to convert my code to a parallel one, following the examples > > > > > > > > > > > > > > > > > given. > > > > > > > > > > > > > > > I also need to restructure parts of my code to enable parallel > looping. > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > 1st changed the PETSc solver to be parallel enabled and then I > > > > > > > > > > > > > > > > > restructured > > > > > > > > > > > > > > > parts of my code. I proceed on as longer as the answer for a simple > test > > > > > case is correct. I thought it's not really possible to do any speed > > > > > > > > > > > > > > > > > testing > > > > > > > > > > > > > > > since the code is not fully parallelized yet. When I finished during > > > > > > > > > > > > > > > > > most of > > > > > > > > > > > > > > > the conversion, I found that in the actual run that it is much > slower, > > > > > although the answer is correct. > > > > > > > > > > So what is the remedy now? I wonder what I should do to check what's > > > > > > > > > > > > > > > > > wrong. > > > > > > > > > > > > > > > Must I restart everything again? Btw, my grid size is 1200x1200. I > > > > > > > > > > > > > > > > > believed > > > > > > > > > > > > > > > it should be suitable for parallel run of 4 processors? Is that so? > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From pivello at gmail.com Tue Apr 15 09:22:54 2008 From: pivello at gmail.com (=?ISO-8859-1?Q?M=E1rcio_Ricardo_Pivello?=) Date: Tue, 15 Apr 2008 11:22:54 -0300 Subject: PETSc + HYPRE Message-ID: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> Hi, I want to use hypre preconditioners coupled with PETSc, but so far I have not succeeded. Here's what I've done: Firstly I create the preconditioner: Mat A_Par(NSubSteps) Vec Unk_Par(NSubSteps) Vec B_Load_Par(NSubSteps) KSP KspSolv ---> PC precond ****************************** Later in the code I set the preconditioner type and create the Krylov solver: ----> call PCSetType(precond,'hypre',iError) ----> call PCHYPRESetType(precond,'boomeramg',iError) ----> call KSPCreate (PETSC_COMM_WORLD, KspSolv, iError) ----> call KSPSetFromOptions (KspSolv, iError) call KSPSetOperators (KspSolv, A_Par(nstp), A_Par(nstp), SAME_NONZERO_PATTERN, iError) call KSPSolve (KspSolv, B_Load_Par(nstp), Unk_Par(nstp), iError) *************************** Then, when I run the program I put the following options in the command line: mpirun -np 2 /home/mpivello/bin/SolverGP.x -pc_type hypre -pc_hypre_type boomeramg -pc_hypre_boomeramg_sweep_all true -pc_hypre_boomeramg_grid_sweeps 1 -pc_hypre_boomeramg_strong_threshold 0.9 -pc_hypre_boomeramg_max_iter 5 -pc_hypre_boomeramg_coarsen_type modifiedRuge-Stueben -f0 dummy.tmp 2>&1 -ksp_gmres_restart 200 -ksp_max_it 3000 -ksp_rtol 1.0e-10 -ksp_atol 1.0e-15 -ksp_monitor -log_summary < /dev/null > run.parallel.log & But this proceeding is not working. What am I doing wrong? Thanks in advance M?rcio Ricardo -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Apr 15 09:36:23 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 09:36:23 -0500 Subject: PETSc + HYPRE In-Reply-To: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> Message-ID: On Tue, Apr 15, 2008 at 9:22 AM, M?rcio Ricardo Pivello wrote: > Hi, I want to use hypre preconditioners coupled with PETSc, but so far I > have not succeeded. Here's what I've done: > > Firstly I create the preconditioner: > > > Mat A_Par(NSubSteps) > Vec Unk_Par(NSubSteps) > Vec B_Load_Par(NSubSteps) > KSP KspSolv > ---> PC precond > > ****************************** > > Later in the code I set the preconditioner type and create the Krylov > solver: > > ----> call PCSetType(precond,'hypre',iError) > ----> call PCHYPRESetType(precond,'boomeramg',iError) > ----> call KSPCreate (PETSC_COMM_WORLD, KspSolv, iError) > ----> call KSPSetFromOptions (KspSolv, iError) > call KSPSetOperators (KspSolv, A_Par(nstp), A_Par(nstp), > SAME_NONZERO_PATTERN, iError) > call KSPSolve (KspSolv, B_Load_Par(nstp), Unk_Par(nstp), iError) > > > *************************** > > Then, when I run the program I put the following options in the command > line: > > mpirun -np 2 /home/mpivello/bin/SolverGP.x -pc_type hypre -pc_hypre_type > boomeramg -pc_hypre_boomeramg_sweep_all true -pc_hypre_boomeramg_grid_sweeps > 1 -pc_hypre_boomeramg_strong_threshold 0.9 -pc_hypre_boomeramg_max_iter 5 > -pc_hypre_boomeramg_coarsen_type modifiedRuge-Stueben -f0 dummy.tmp 2>&1 > -ksp_gmres_restart 200 -ksp_max_it 3000 -ksp_rtol 1.0e-10 -ksp_atol 1.0e-15 > -ksp_monitor -log_summary < /dev/null > run.parallel.log & > > But this proceeding is not working. What am I doing wrong? What does "not working" mean? 1) What is actually being run? Use -ksp_view to find out (always). 2) Above you set the PC type before creating the KSP. How does the KSP know about the PC? You should retrieve the PC from the KSP using KSPGetPC() and then customize it. Better yet, do everything from the command line -pc_type hypre -pc_hypre_type boomeramg ... 3) Did you configure with HYPRE? Matt > Thanks in advance > > M?rcio Ricardo -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. - Norbert Wiener From dalcinl at gmail.com Tue Apr 15 09:52:03 2008 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Tue, 15 Apr 2008 11:52:03 -0300 Subject: PETSc + HYPRE In-Reply-To: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> Message-ID: Do not create the PC !! Create first the KSP, next do KSPGetPC, and then configure the PC On 4/15/08, M?rcio Ricardo Pivello wrote: > Hi, I want to use hypre preconditioners coupled with PETSc, but so far I > have not succeeded. Here's what I've done: > > Firstly I create the preconditioner: > > > Mat A_Par(NSubSteps) > Vec Unk_Par(NSubSteps) > Vec B_Load_Par(NSubSteps) > KSP KspSolv > ---> PC precond > > ****************************** > > Later in the code I set the preconditioner type and create the Krylov > solver: > > ----> call PCSetType(precond,'hypre',iError) > ----> call PCHYPRESetType(precond,'boomeramg',iError) > ----> call KSPCreate (PETSC_COMM_WORLD, KspSolv, iError) > ----> call KSPSetFromOptions (KspSolv, iError) > call KSPSetOperators (KspSolv, A_Par(nstp), A_Par(nstp), > SAME_NONZERO_PATTERN, iError) > call KSPSolve (KspSolv, B_Load_Par(nstp), Unk_Par(nstp), iError) > > > *************************** > > Then, when I run the program I put the following options in the command > line: > > mpirun -np 2 /home/mpivello/bin/SolverGP.x -pc_type hypre -pc_hypre_type > boomeramg -pc_hypre_boomeramg_sweep_all true -pc_hypre_boomeramg_grid_sweeps > 1 -pc_hypre_boomeramg_strong_threshold 0.9 > -pc_hypre_boomeramg_max_iter 5 > -pc_hypre_boomeramg_coarsen_type modifiedRuge-Stueben -f0 > dummy.tmp 2>&1 -ksp_gmres_restart 200 -ksp_max_it 3000 -ksp_rtol 1.0e-10 > -ksp_atol 1.0e-15 -ksp_monitor -log_summary < /dev/null > run.parallel.log & > > But this proceeding is not working. What am I doing wrong? > > Thanks in advance > > M?rcio Ricardo > > > > > -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From zonexo at gmail.com Tue Apr 15 10:33:20 2008 From: zonexo at gmail.com (Ben Tay) Date: Tue, 15 Apr 2008 23:33:20 +0800 Subject: Slow speed after changing from serial to parallel In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> Message-ID: <4804CAC0.6060201@gmail.com> Hi, I have converted the poisson eqn part of the CFD code to parallel. The grid size tested is 600x720. For the momentum eqn, I used another serial linear solver (nspcg) to prevent mixing of results. Here's the output summary: --- Event Stage 0: Main Stage MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 0.0e+00 10 11100100 0 10 11100100 0 217 MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0* MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 1.7e+04 89100100100100 89100100100100 317 PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 8.8e+03 9 2 0 0 51 9 2 0 0 51 42* *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 636* VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 0.0e+00 0 0100100 0 0 0100100 0 0* *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0* *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 8.8e+03 9 4 0 0 51 9 4 0 0 51 62* ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 4 4 49227380 0 Krylov Solver 2 2 17216 0 Preconditioner 2 2 256 0 Index Set 5 5 2596120 0 Vec 40 40 62243224 0 Vec Scatter 1 1 0 0 ======================================================================================================================== Average time to get PetscTime(): 4.05312e-07 Average time for MPI_Barrier(): 7.62939e-07 Average time for zero size MPI_Send(): 2.02656e-06 OptionTable: -log_summary The PETSc manual states that ratio should be close to 1. There's quite a few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. So what could be the cause? I wonder if it has to do the way I insert the matrix. My steps are: (cartesian grids, i loop faster than j, fortran) For matrix A and rhs Insert left extreme cells values belonging to myid if (myid==0) then insert corner cells values insert south cells values insert internal cells values else if (myid==num_procs-1) then insert corner cells values insert north cells values insert internal cells values else insert internal cells values end if Insert right extreme cells values belonging to myid All these values are entered into a big_A(size_x*size_y,5) matrix. int_A stores the position of the values. I then do call MatZeroEntries(A_mat,ierr) do k=ksta_p+1,kend_p !for cells belonging to myid do kk=1,5 II=k-1 JJ=int_A(k,kk)-1 call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr) end do end do call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr) call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr) I wonder if the problem lies here.I used the big_A matrix because I was migrating from an old linear solver. Lastly, I was told to widen my window to 120 characters. May I know how do I do it? Thank you very much. Matthew Knepley wrote: > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay wrote: > >> Hi Matthew, >> >> I think you've misunderstood what I meant. What I'm trying to say is >> initially I've got a serial code. I tried to convert to a parallel one. Then >> I tested it and it was pretty slow. Due to some work requirement, I need to >> go back to make some changes to my code. Since the parallel is not working >> well, I updated and changed the serial one. >> >> Well, that was a while ago and now, due to the updates and changes, the >> serial code is different from the old converted parallel code. Some files >> were also deleted and I can't seem to get it working now. So I thought I >> might as well convert the new serial code to parallel. But I'm not very sure >> what I should do 1st. >> >> Maybe I should rephrase my question in that if I just convert my poisson >> equation subroutine from a serial PETSc to a parallel PETSc version, will it >> work? Should I expect a speedup? The rest of my code is still serial. >> > > You should, of course, only expect speedup in the parallel parts > > Matt > > >> Thank you very much. >> >> >> >> Matthew Knepley wrote: >> >> >>> I am not sure why you would ever have two codes. I never do this. PETSc >>> is designed to write one code to run in serial and parallel. The PETSc >>> >> part >> >>> should look identical. To test, run the code yo uhave verified in serial >>> >> and >> >>> output PETSc data structures (like Mat and Vec) using a binary viewer. >>> Then run in parallel with the same code, which will output the same >>> structures. Take the two files and write a small verification code that >>> loads both versions and calls MatEqual and VecEqual. >>> >>> Matt >>> >>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay wrote: >>> >>> >>> >>>> Thank you Matthew. Sorry to trouble you again. >>>> >>>> I tried to run it with -log_summary output and I found that there's >>>> >> some >> >>>> errors in the execution. Well, I was busy with other things and I just >>>> >> came >> >>>> back to this problem. Some of my files on the server has also been >>>> >> deleted. >> >>>> It has been a while and I remember that it worked before, only much >>>> slower. >>>> >>>> Anyway, most of the serial code has been updated and maybe it's easier >>>> >> to >> >>>> convert the new serial code instead of debugging on the old parallel >>>> >> code >> >>>> now. I believe I can still reuse part of the old parallel code. However, >>>> >> I >> >>>> hope I can approach it better this time. >>>> >>>> So supposed I need to start converting my new serial code to parallel. >>>> There's 2 eqns to be solved using PETSc, the momentum and poisson. I >>>> >> also >> >>>> need to parallelize other parts of my code. I wonder which route is the >>>> best: >>>> >>>> 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, >>>> >> modify >> >>>> other parts of my code to parallel e.g. looping, updating of values etc. >>>> Once the execution is fine and speedup is reasonable, then modify the >>>> >> PETSc >> >>>> part - poisson eqn 1st followed by the momentum eqn. >>>> >>>> 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st >>>> followed by the momentum eqn. Then do other parts of my code. >>>> >>>> I'm not sure if the above 2 mtds can work or if there will be >>>> >> conflicts. Of >> >>>> course, an alternative will be: >>>> >>>> 3. Do the poisson, momentum eqns and other parts of the code >>>> >> separately. >> >>>> That is, code a standalone parallel poisson eqn and use samples values >>>> >> to >> >>>> test it. Same for the momentum and other parts of the code. When each of >>>> them is working, combine them to form the full parallel code. However, >>>> >> this >> >>>> will be much more troublesome. >>>> >>>> I hope someone can give me some recommendations. >>>> >>>> Thank you once again. >>>> >>>> >>>> >>>> Matthew Knepley wrote: >>>> >>>> >>>> >>>> >>>>> 1) There is no way to have any idea what is going on in your code >>>>> without -log_summary output >>>>> >>>>> 2) Looking at that output, look at the percentage taken by the solver >>>>> KSPSolve event. I suspect it is not the biggest component, because >>>>> it is very scalable. >>>>> >>>>> Matt >>>>> >>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Hi, >>>>>> >>>>>> I've a serial 2D CFD code. As my grid size requirement increases, >>>>>> >> the >> >>>>>> simulation takes longer. Also, memory requirement becomes a problem. >>>>>> >>>>>> >>>>>> >>>> Grid >>>> >>>> >>>> >>>>>> size 've reached 1200x1200. Going higher is not possible due to >>>>>> >> memory >> >>>>>> problem. >>>>>> >>>>>> I tried to convert my code to a parallel one, following the examples >>>>>> >>>>>> >>>>>> >>>> given. >>>> >>>> >>>> >>>>>> I also need to restructure parts of my code to enable parallel >>>>>> >> looping. >> >>>>>> >>>> I >>>> >>>> >>>> >>>>>> 1st changed the PETSc solver to be parallel enabled and then I >>>>>> >>>>>> >>>>>> >>>> restructured >>>> >>>> >>>> >>>>>> parts of my code. I proceed on as longer as the answer for a simple >>>>>> >> test >> >>>>>> case is correct. I thought it's not really possible to do any speed >>>>>> >>>>>> >>>>>> >>>> testing >>>> >>>> >>>> >>>>>> since the code is not fully parallelized yet. When I finished during >>>>>> >>>>>> >>>>>> >>>> most of >>>> >>>> >>>> >>>>>> the conversion, I found that in the actual run that it is much >>>>>> >> slower, >> >>>>>> although the answer is correct. >>>>>> >>>>>> So what is the remedy now? I wonder what I should do to check what's >>>>>> >>>>>> >>>>>> >>>> wrong. >>>> >>>> >>>> >>>>>> Must I restart everything again? Btw, my grid size is 1200x1200. I >>>>>> >>>>>> >>>>>> >>>> believed >>>> >>>> >>>> >>>>>> it should be suitable for parallel run of 4 processors? Is that so? >>>>>> >>>>>> Thank you. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> >> > > > > From knepley at gmail.com Tue Apr 15 10:46:17 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 10:46:17 -0500 Subject: Slow speed after changing from serial to parallel In-Reply-To: <4804CAC0.6060201@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> Message-ID: 1) Please never cut out parts of the summary. All the information is valuable, and most times, necessary 2) You seem to have huge load imbalance (look at VecNorm). Do you partition the system yourself. How many processes is this? 3) You seem to be setting a huge number of off-process values in the matrix (see MatAssemblyBegin). Is this true? I would reorganize this part. Matt On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay wrote: > Hi, > > I have converted the poisson eqn part of the CFD code to parallel. The grid > size tested is 600x720. For the momentum eqn, I used another serial linear > solver (nspcg) to prevent mixing of results. Here's the output summary: > > --- Event Stage 0: Main Stage > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 > 0.0e+00 10 11100100 0 10 11100100 0 217 > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 > 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 > MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 > 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0* > MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 > 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 > KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 > 1.7e+04 89100100100100 89100100100100 317 > PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 > 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 > VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 > 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 > *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 > 8.8e+03 9 2 0 0 51 9 2 0 0 51 42* > *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 636* > VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 > VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 > 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 > VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0* > *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0* > *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 > 8.8e+03 9 4 0 0 51 9 4 0 0 51 62* > > ------------------------------------------------------------------------------------------------------------------------ > Memory usage is given in bytes: > Object Type Creations Destructions Memory Descendants' Mem. > --- Event Stage 0: Main Stage > Matrix 4 4 49227380 0 > Krylov Solver 2 2 17216 0 > Preconditioner 2 2 256 0 > Index Set 5 5 2596120 0 > Vec 40 40 62243224 0 > Vec Scatter 1 1 0 0 > ======================================================================================================================== > Average time to get PetscTime(): 4.05312e-07 Average time > for MPI_Barrier(): 7.62939e-07 > Average time for zero size MPI_Send(): 2.02656e-06 > OptionTable: -log_summary > > > The PETSc manual states that ratio should be close to 1. There's quite a > few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. So > what could be the cause? > > I wonder if it has to do the way I insert the matrix. My steps are: > (cartesian grids, i loop faster than j, fortran) > > For matrix A and rhs > > Insert left extreme cells values belonging to myid > > if (myid==0) then > > insert corner cells values > > insert south cells values > > insert internal cells values > > else if (myid==num_procs-1) then > > insert corner cells values > > insert north cells values > > insert internal cells values > > else > > insert internal cells values > > end if > > Insert right extreme cells values belonging to myid > > All these values are entered into a big_A(size_x*size_y,5) matrix. int_A > stores the position of the values. I then do > > call MatZeroEntries(A_mat,ierr) > > do k=ksta_p+1,kend_p !for cells belonging to myid > > do kk=1,5 > > II=k-1 > > JJ=int_A(k,kk)-1 > > call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr) > end do > > end do > > call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr) > > call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr) > > > I wonder if the problem lies here.I used the big_A matrix because I was > migrating from an old linear solver. Lastly, I was told to widen my window > to 120 characters. May I know how do I do it? > > > > Thank you very much. > > Matthew Knepley wrote: > > > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay wrote: > > > > > > > Hi Matthew, > > > > > > I think you've misunderstood what I meant. What I'm trying to say is > > > initially I've got a serial code. I tried to convert to a parallel one. > Then > > > I tested it and it was pretty slow. Due to some work requirement, I need > to > > > go back to make some changes to my code. Since the parallel is not > working > > > well, I updated and changed the serial one. > > > > > > Well, that was a while ago and now, due to the updates and changes, the > > > serial code is different from the old converted parallel code. Some > files > > > were also deleted and I can't seem to get it working now. So I thought I > > > might as well convert the new serial code to parallel. But I'm not very > sure > > > what I should do 1st. > > > > > > Maybe I should rephrase my question in that if I just convert my > poisson > > > equation subroutine from a serial PETSc to a parallel PETSc version, > will it > > > work? Should I expect a speedup? The rest of my code is still serial. > > > > > > > > > > You should, of course, only expect speedup in the parallel parts > > > > Matt > > > > > > > > > Thank you very much. > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > I am not sure why you would ever have two codes. I never do this. > PETSc > > > > is designed to write one code to run in serial and parallel. The PETSc > > > > > > > > > > > part > > > > > > > > > > should look identical. To test, run the code yo uhave verified in > serial > > > > > > > > > > > and > > > > > > > > > > output PETSc data structures (like Mat and Vec) using a binary viewer. > > > > Then run in parallel with the same code, which will output the same > > > > structures. Take the two files and write a small verification code > that > > > > loads both versions and calls MatEqual and VecEqual. > > > > > > > > Matt > > > > > > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay wrote: > > > > > > > > > > > > > > > > > > > > > Thank you Matthew. Sorry to trouble you again. > > > > > > > > > > I tried to run it with -log_summary output and I found that there's > > > > > > > > > > > > > > > > > some > > > > > > > > > > > > > > > errors in the execution. Well, I was busy with other things and I > just > > > > > > > > > > > > > > > > > came > > > > > > > > > > > > > > > back to this problem. Some of my files on the server has also been > > > > > > > > > > > > > > > > > deleted. > > > > > > > > > > > > > > > It has been a while and I remember that it worked before, only > much > > > > > slower. > > > > > > > > > > Anyway, most of the serial code has been updated and maybe it's > easier > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > convert the new serial code instead of debugging on the old parallel > > > > > > > > > > > > > > > > > code > > > > > > > > > > > > > > > now. I believe I can still reuse part of the old parallel code. > However, > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > hope I can approach it better this time. > > > > > > > > > > So supposed I need to start converting my new serial code to > parallel. > > > > > There's 2 eqns to be solved using PETSc, the momentum and poisson. I > > > > > > > > > > > > > > > > > also > > > > > > > > > > > > > > > need to parallelize other parts of my code. I wonder which route is > the > > > > > best: > > > > > > > > > > 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, > > > > > > > > > > > > > > > > > modify > > > > > > > > > > > > > > > other parts of my code to parallel e.g. looping, updating of values > etc. > > > > > Once the execution is fine and speedup is reasonable, then modify > the > > > > > > > > > > > > > > > > > PETSc > > > > > > > > > > > > > > > part - poisson eqn 1st followed by the momentum eqn. > > > > > > > > > > 2. Reverse the above order ie modify the PETSc part - poisson eqn > 1st > > > > > followed by the momentum eqn. Then do other parts of my code. > > > > > > > > > > I'm not sure if the above 2 mtds can work or if there will be > > > > > > > > > > > > > > > > > conflicts. Of > > > > > > > > > > > > > > > course, an alternative will be: > > > > > > > > > > 3. Do the poisson, momentum eqns and other parts of the code > > > > > > > > > > > > > > > > > separately. > > > > > > > > > > > > > > > That is, code a standalone parallel poisson eqn and use samples > values > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > test it. Same for the momentum and other parts of the code. When > each of > > > > > them is working, combine them to form the full parallel code. > However, > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > will be much more troublesome. > > > > > > > > > > I hope someone can give me some recommendations. > > > > > > > > > > Thank you once again. > > > > > > > > > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) There is no way to have any idea what is going on in your code > > > > > > without -log_summary output > > > > > > > > > > > > 2) Looking at that output, look at the percentage taken by the > solver > > > > > > KSPSolve event. I suspect it is not the biggest component, > because > > > > > > it is very scalable. > > > > > > > > > > > > Matt > > > > > > > > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I've a serial 2D CFD code. As my grid size requirement > increases, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > simulation takes longer. Also, memory requirement becomes a > problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Grid > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > size 've reached 1200x1200. Going higher is not possible due to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > memory > > > > > > > > > > > > > > > > > > > > > > > > > > > > problem. > > > > > > > > > > > > > > I tried to convert my code to a parallel one, following the > examples > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > given. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also need to restructure parts of my code to enable parallel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > looping. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1st changed the PETSc solver to be parallel enabled and then I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > restructured > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > parts of my code. I proceed on as longer as the answer for a > simple > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > test > > > > > > > > > > > > > > > > > > > > > > > > > > > > case is correct. I thought it's not really possible to do any > speed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > testing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > since the code is not fully parallelized yet. When I finished > during > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > most of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the conversion, I found that in the actual run that it is much > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > slower, > > > > > > > > > > > > > > > > > > > > > > > > > > > > although the answer is correct. > > > > > > > > > > > > > > So what is the remedy now? I wonder what I should do to check > what's > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrong. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Must I restart everything again? Btw, my grid size is 1200x1200. > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > believed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > it should be suitable for parallel run of 4 processors? Is that > so? > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From zonexo at gmail.com Tue Apr 15 10:56:52 2008 From: zonexo at gmail.com (Ben Tay) Date: Tue, 15 Apr 2008 23:56:52 +0800 Subject: Slow speed after changing from serial to parallel In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> Message-ID: <4804D044.2060502@gmail.com> Oh sorry here's the whole information. I'm using 2 processors currently: ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332 Tue Apr 15 23:03:09 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/Min Avg Total Time (sec): 1.114e+03 1.00054 1.114e+03 Objects: 5.400e+01 1.00000 5.400e+01 Flops: 1.574e+11 1.00000 1.574e+11 3.147e+11 Flops/sec: 1.414e+08 1.00054 1.413e+08 2.826e+08 MPI Messages: 8.777e+03 1.00000 8.777e+03 1.755e+04 MPI Message Lengths: 4.213e+07 1.00000 4.800e+03 8.425e+07 MPI Reductions: 8.644e+03 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04 100.0% 4.800e+03 100.0% 1.729e+04 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ ########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # ########################################################## Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 0.0e+00 10 11100100 0 10 11100100 0 217 MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0 MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 1.7e+04 89100100100100 89100100100100 317 PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 8.8e+03 9 2 0 0 51 9 2 0 0 51 42 VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 636 VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 0.0e+00 0 0100100 0 0 0100100 0 0 VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 8.8e+03 9 4 0 0 51 9 4 0 0 51 62 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 4 4 49227380 0 Krylov Solver 2 2 17216 0 Preconditioner 2 2 256 0 Index Set 5 5 2596120 0 Vec 40 40 62243224 0 Vec Scatter 1 1 0 0 ======================================================================================================================== Average time to get PetscTime(): 4.05312e-07 Average time for MPI_Barrier(): 7.62939e-07 Average time for zero size MPI_Send(): 2.02656e-06 OptionTable: -log_summary Compiled without FORTRAN kernels Compiled with full precision matrices (default) Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 Configure run at: Tue Jan 8 22:22:08 2008 Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi-shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 ----------------------------------------- Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 Using PETSc arch: atlas3-mpi ----------------------------------------- Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. -fPIC -O ----------------------------------------- Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include - I/home/enduser/g0306332/lib/hypre/include -I/usr/local/topspin/mpi/mpich/include ------------------------------------------ Using C linker: mpicc -fPIC -O Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib -L/usr/local/topspin/mpi/mpich/lib -lmpich -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc ------------------------------------------ 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (28major+153248minor)pagefaults 0swaps 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (18major+158175minor)pagefaults 0swaps Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME ===== ========== ================ ======================= =================== 00000 atlas3-c05 time ./a.out -lo Done 04/15/2008 23:03:10 00001 atlas3-c05 time ./a.out -lo Done 04/15/2008 23:03:10 I have a cartesian grid 600x720. Since there's 2 processors, it is partitioned to 600x360. I just use: call MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr) call MatSetFromOptions(A_mat,ierr) call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr) call KSPCreate(MPI_COMM_WORLD,ksp,ierr) call VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr) total_k is actually size_x*size_y. Since it's 2d, the maximum values per row is 5. When you says setting off-process values, do you mean I insert values from 1 processor into another? I thought I insert the values into the correct processor... Thank you very much! Matthew Knepley wrote: > 1) Please never cut out parts of the summary. All the information is valuable, > and most times, necessary > > 2) You seem to have huge load imbalance (look at VecNorm). Do you partition > the system yourself. How many processes is this? > > 3) You seem to be setting a huge number of off-process values in the matrix > (see MatAssemblyBegin). Is this true? I would reorganize this part. > > Matt > > On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay wrote: > >> Hi, >> >> I have converted the poisson eqn part of the CFD code to parallel. The grid >> size tested is 600x720. For the momentum eqn, I used another serial linear >> solver (nspcg) to prevent mixing of results. Here's the output summary: >> >> --- Event Stage 0: Main Stage >> >> MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 >> 0.0e+00 10 11100100 0 10 11100100 0 217 >> MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 >> 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 >> MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 >> MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0* >> MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 >> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 >> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 >> 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 >> KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 >> 1.7e+04 89100100100100 89100100100100 317 >> PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 >> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 >> PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 >> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 >> PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 >> 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 >> VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 >> 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 >> *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 >> 8.8e+03 9 2 0 0 51 9 2 0 0 51 42* >> *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 >> 0.0e+00 0 1 0 0 0 0 1 0 0 0 636* >> VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 >> VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 >> 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 >> VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 >> 0.0e+00 0 0100100 0 0 0100100 0 0* >> *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0* >> *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 >> 8.8e+03 9 4 0 0 51 9 4 0 0 51 62* >> >> ------------------------------------------------------------------------------------------------------------------------ >> Memory usage is given in bytes: >> Object Type Creations Destructions Memory Descendants' Mem. >> --- Event Stage 0: Main Stage >> Matrix 4 4 49227380 0 >> Krylov Solver 2 2 17216 0 >> Preconditioner 2 2 256 0 >> Index Set 5 5 2596120 0 >> Vec 40 40 62243224 0 >> Vec Scatter 1 1 0 0 >> ======================================================================================================================== >> Average time to get PetscTime(): 4.05312e-07 Average time >> for MPI_Barrier(): 7.62939e-07 >> Average time for zero size MPI_Send(): 2.02656e-06 >> OptionTable: -log_summary >> >> >> The PETSc manual states that ratio should be close to 1. There's quite a >> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. So >> what could be the cause? >> >> I wonder if it has to do the way I insert the matrix. My steps are: >> (cartesian grids, i loop faster than j, fortran) >> >> For matrix A and rhs >> >> Insert left extreme cells values belonging to myid >> >> if (myid==0) then >> >> insert corner cells values >> >> insert south cells values >> >> insert internal cells values >> >> else if (myid==num_procs-1) then >> >> insert corner cells values >> >> insert north cells values >> >> insert internal cells values >> >> else >> >> insert internal cells values >> >> end if >> >> Insert right extreme cells values belonging to myid >> >> All these values are entered into a big_A(size_x*size_y,5) matrix. int_A >> stores the position of the values. I then do >> >> call MatZeroEntries(A_mat,ierr) >> >> do k=ksta_p+1,kend_p !for cells belonging to myid >> >> do kk=1,5 >> >> II=k-1 >> >> JJ=int_A(k,kk)-1 >> >> call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr) >> end do >> >> end do >> >> call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr) >> >> call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr) >> >> >> I wonder if the problem lies here.I used the big_A matrix because I was >> migrating from an old linear solver. Lastly, I was told to widen my window >> to 120 characters. May I know how do I do it? >> >> >> >> Thank you very much. >> >> Matthew Knepley wrote: >> >> >>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay wrote: >>> >>> >>> >>>> Hi Matthew, >>>> >>>> I think you've misunderstood what I meant. What I'm trying to say is >>>> initially I've got a serial code. I tried to convert to a parallel one. >>>> >> Then >> >>>> I tested it and it was pretty slow. Due to some work requirement, I need >>>> >> to >> >>>> go back to make some changes to my code. Since the parallel is not >>>> >> working >> >>>> well, I updated and changed the serial one. >>>> >>>> Well, that was a while ago and now, due to the updates and changes, the >>>> serial code is different from the old converted parallel code. Some >>>> >> files >> >>>> were also deleted and I can't seem to get it working now. So I thought I >>>> might as well convert the new serial code to parallel. But I'm not very >>>> >> sure >> >>>> what I should do 1st. >>>> >>>> Maybe I should rephrase my question in that if I just convert my >>>> >> poisson >> >>>> equation subroutine from a serial PETSc to a parallel PETSc version, >>>> >> will it >> >>>> work? Should I expect a speedup? The rest of my code is still serial. >>>> >>>> >>>> >>> You should, of course, only expect speedup in the parallel parts >>> >>> Matt >>> >>> >>> >>> >>>> Thank you very much. >>>> >>>> >>>> >>>> Matthew Knepley wrote: >>>> >>>> >>>> >>>> >>>>> I am not sure why you would ever have two codes. I never do this. >>>>> >> PETSc >> >>>>> is designed to write one code to run in serial and parallel. The PETSc >>>>> >>>>> >>>>> >>>> part >>>> >>>> >>>> >>>>> should look identical. To test, run the code yo uhave verified in >>>>> >> serial >> >>>>> >>>> and >>>> >>>> >>>> >>>>> output PETSc data structures (like Mat and Vec) using a binary viewer. >>>>> Then run in parallel with the same code, which will output the same >>>>> structures. Take the two files and write a small verification code >>>>> >> that >> >>>>> loads both versions and calls MatEqual and VecEqual. >>>>> >>>>> Matt >>>>> >>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Thank you Matthew. Sorry to trouble you again. >>>>>> >>>>>> I tried to run it with -log_summary output and I found that there's >>>>>> >>>>>> >>>>>> >>>> some >>>> >>>> >>>> >>>>>> errors in the execution. Well, I was busy with other things and I >>>>>> >> just >> >>>>>> >>>> came >>>> >>>> >>>> >>>>>> back to this problem. Some of my files on the server has also been >>>>>> >>>>>> >>>>>> >>>> deleted. >>>> >>>> >>>> >>>>>> It has been a while and I remember that it worked before, only >>>>>> >> much >> >>>>>> slower. >>>>>> >>>>>> Anyway, most of the serial code has been updated and maybe it's >>>>>> >> easier >> >>>>>> >>>> to >>>> >>>> >>>> >>>>>> convert the new serial code instead of debugging on the old parallel >>>>>> >>>>>> >>>>>> >>>> code >>>> >>>> >>>> >>>>>> now. I believe I can still reuse part of the old parallel code. >>>>>> >> However, >> >>>>>> >>>> I >>>> >>>> >>>> >>>>>> hope I can approach it better this time. >>>>>> >>>>>> So supposed I need to start converting my new serial code to >>>>>> >> parallel. >> >>>>>> There's 2 eqns to be solved using PETSc, the momentum and poisson. I >>>>>> >>>>>> >>>>>> >>>> also >>>> >>>> >>>> >>>>>> need to parallelize other parts of my code. I wonder which route is >>>>>> >> the >> >>>>>> best: >>>>>> >>>>>> 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, >>>>>> >>>>>> >>>>>> >>>> modify >>>> >>>> >>>> >>>>>> other parts of my code to parallel e.g. looping, updating of values >>>>>> >> etc. >> >>>>>> Once the execution is fine and speedup is reasonable, then modify >>>>>> >> the >> >>>>>> >>>> PETSc >>>> >>>> >>>> >>>>>> part - poisson eqn 1st followed by the momentum eqn. >>>>>> >>>>>> 2. Reverse the above order ie modify the PETSc part - poisson eqn >>>>>> >> 1st >> >>>>>> followed by the momentum eqn. Then do other parts of my code. >>>>>> >>>>>> I'm not sure if the above 2 mtds can work or if there will be >>>>>> >>>>>> >>>>>> >>>> conflicts. Of >>>> >>>> >>>> >>>>>> course, an alternative will be: >>>>>> >>>>>> 3. Do the poisson, momentum eqns and other parts of the code >>>>>> >>>>>> >>>>>> >>>> separately. >>>> >>>> >>>> >>>>>> That is, code a standalone parallel poisson eqn and use samples >>>>>> >> values >> >>>>>> >>>> to >>>> >>>> >>>> >>>>>> test it. Same for the momentum and other parts of the code. When >>>>>> >> each of >> >>>>>> them is working, combine them to form the full parallel code. >>>>>> >> However, >> >>>>>> >>>> this >>>> >>>> >>>> >>>>>> will be much more troublesome. >>>>>> >>>>>> I hope someone can give me some recommendations. >>>>>> >>>>>> Thank you once again. >>>>>> >>>>>> >>>>>> >>>>>> Matthew Knepley wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> 1) There is no way to have any idea what is going on in your code >>>>>>> without -log_summary output >>>>>>> >>>>>>> 2) Looking at that output, look at the percentage taken by the >>>>>>> >> solver >> >>>>>>> KSPSolve event. I suspect it is not the biggest component, >>>>>>> >> because >> >>>>>>> it is very scalable. >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I've a serial 2D CFD code. As my grid size requirement >>>>>>>> >> increases, >> >>>>>>>> >>>> the >>>> >>>> >>>> >>>>>>>> simulation takes longer. Also, memory requirement becomes a >>>>>>>> >> problem. >> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> Grid >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> size 've reached 1200x1200. Going higher is not possible due to >>>>>>>> >>>>>>>> >>>>>>>> >>>> memory >>>> >>>> >>>> >>>>>>>> problem. >>>>>>>> >>>>>>>> I tried to convert my code to a parallel one, following the >>>>>>>> >> examples >> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> given. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> I also need to restructure parts of my code to enable parallel >>>>>>>> >>>>>>>> >>>>>>>> >>>> looping. >>>> >>>> >>>> >>>>>>>> >>>>>> I >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> restructured >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> parts of my code. I proceed on as longer as the answer for a >>>>>>>> >> simple >> >>>>>>>> >>>> test >>>> >>>> >>>> >>>>>>>> case is correct. I thought it's not really possible to do any >>>>>>>> >> speed >> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> testing >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> since the code is not fully parallelized yet. When I finished >>>>>>>> >> during >> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> most of >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> the conversion, I found that in the actual run that it is much >>>>>>>> >>>>>>>> >>>>>>>> >>>> slower, >>>> >>>> >>>> >>>>>>>> although the answer is correct. >>>>>>>> >>>>>>>> So what is the remedy now? I wonder what I should do to check >>>>>>>> >> what's >> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> wrong. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> Must I restart everything again? Btw, my grid size is 1200x1200. >>>>>>>> >> I >> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> believed >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> it should be suitable for parallel run of 4 processors? Is that >>>>>>>> >> so? >> >>>>>>>> Thank you. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> >> > > > > From bsmith at mcs.anl.gov Tue Apr 15 11:09:10 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 15 Apr 2008 11:09:10 -0500 Subject: Slow speed after changing from serial to parallel In-Reply-To: <4804D044.2060502@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> Message-ID: It is taking 8776 iterations of GMRES! How many does it take on one process? This is a huge amount. MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e +03 0.0e+00 10 11100100 0 10 11100100 0 217 MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e +00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 One process is spending 2.9 times as long in the embarresingly parallel MatSolve then the other process; this indicates a huge imbalance in the number of nonzeros on each process. As Matt noticed, the partitioning between the two processes is terrible. Barry On Apr 15, 2008, at 10:56 AM, Ben Tay wrote: > Oh sorry here's the whole information. I'm using 2 processors > currently: > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript - > r -fCourier9' to print this document *** > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance > Summary: ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by > g0306332 Tue Apr 15 23:03:09 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST > 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 1.114e+03 1.00054 1.114e+03 > Objects: 5.400e+01 1.00000 5.400e+01 > Flops: 1.574e+11 1.00000 1.574e+11 3.147e+11 > Flops/sec: 1.414e+08 1.00054 1.413e+08 2.826e+08 > MPI Messages: 8.777e+03 1.00000 8.777e+03 1.755e+04 > MPI Message Lengths: 4.213e+07 1.00000 4.800e+03 8.425e+07 > MPI Reductions: 8.644e+03 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length > N --> 2N flops > and VecAXPY() for complex vectors of > length N --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- > Messages --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts > %Total Avg %Total counts %Total > 0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04 > 100.0% 4.800e+03 100.0% 1.729e+04 100.0% > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() > and PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in > this phase > %M - percent messages in this phase %L - percent message > lengths in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > over all processors) > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > Event Count Time (sec) Flops/ > sec --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg > len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e > +03 0.0e+00 10 11100100 0 10 11100100 0 217 > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e > +00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 > MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e > +00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 > 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0 > MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e > +03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e > +00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e > +00 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 > KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e > +03 1.7e+04 89100100100100 89100100100100 317 > PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e > +00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e > +00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e > +00 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 > VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e > +00 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 > VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e > +00 8.8e+03 9 2 0 0 51 9 2 0 0 51 42 > VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e > +00 0.0e+00 0 1 0 0 0 0 1 0 0 0 636 > VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 > VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e > +00 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 > VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e > +00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e > +03 0.0e+00 0 0100100 0 0 0100100 0 0 > VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e > +00 8.8e+03 9 4 0 0 51 9 4 0 0 51 62 > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' > Mem. > > --- Event Stage 0: Main Stage > > Matrix 4 4 49227380 0 > Krylov Solver 2 2 17216 0 > Preconditioner 2 2 256 0 > Index Set 5 5 2596120 0 > Vec 40 40 62243224 0 > Vec Scatter 1 1 0 0 > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > ====================================================================== > Average time to get PetscTime(): 4.05312e-07 > Average time for MPI_Barrier(): 7.62939e-07 > Average time for zero size MPI_Send(): 2.02656e-06 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > Configure options: --with-memcmp-ok --sizeof_char=1 -- > sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 -- > sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 -- > bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with- > vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/ > g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi- > shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include -- > with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with- > mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack- > dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 > ----------------------------------------- > Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed > Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 > Using PETSc arch: atlas3-mpi > ----------------------------------------- > Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. > -fPIC -O ----------------------------------------- > Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/ > nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/ > home/enduser/g0306332/petsc-2.3.3-p8/include - > I/home/enduser/g0306332/lib/hypre/include -I/usr/local/topspin/mpi/ > mpich/include ------------------------------------------ > Using C linker: mpicc -fPIC -O > Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: -Wl,- > rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -L/ > nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts - > lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib -L/home/enduser/ > g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/mvapich/0.9.9/gen2/ > lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ > 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- > rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/ > gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ > 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- > rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib -L/usr/ > local/topspin/mpi/mpich/lib -lmpich -Wl,-rpath,/opt/intel/cmkl/8.1.1/ > lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t > -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/ > lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/ > gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt -Wl,-rpath,/ > opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ > usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat- > linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo - > lirc -lgcc_s -lirc_s -lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/ > lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ > 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- > rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/ > fce/9.1.045/lib -lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/ > gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ > 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- > rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,- > rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib - > Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ > lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib - > Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/ > lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/ > usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ > local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ > usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc+ > + -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ > local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ > usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- > rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl > -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 - > libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/ > lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- > redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- > rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s - > lirc_s -ldl -lc > ------------------------------------------ > 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (28major+153248minor)pagefaults 0swaps > 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (18major+158175minor)pagefaults 0swaps > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > TID HOST_NAME COMMAND_LINE > STATUS TERMINATION_TIME > ===== ========== ================ ======================= > =================== > 00000 atlas3-c05 time ./a.out -lo Done > 04/15/2008 23:03:10 > 00001 atlas3-c05 time ./a.out -lo Done > 04/15/2008 23:03:10 > > > I have a cartesian grid 600x720. Since there's 2 processors, it is > partitioned to 600x360. I just use: > > call > MatCreateMPIAIJ > (MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k, > 5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr) > > call MatSetFromOptions(A_mat,ierr) > > call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr) > > call KSPCreate(MPI_COMM_WORLD,ksp,ierr) > > call > VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr) > > total_k is actually size_x*size_y. Since it's 2d, the maximum values > per row is 5. When you says setting off-process values, do you mean > I insert values from 1 processor into another? I thought I insert > the values into the correct processor... > > Thank you very much! > > > > Matthew Knepley wrote: >> 1) Please never cut out parts of the summary. All the information >> is valuable, >> and most times, necessary >> >> 2) You seem to have huge load imbalance (look at VecNorm). Do you >> partition >> the system yourself. How many processes is this? >> >> 3) You seem to be setting a huge number of off-process values in >> the matrix >> (see MatAssemblyBegin). Is this true? I would reorganize this >> part. >> >> Matt >> >> On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay wrote: >> >>> Hi, >>> >>> I have converted the poisson eqn part of the CFD code to parallel. >>> The grid >>> size tested is 600x720. For the momentum eqn, I used another >>> serial linear >>> solver (nspcg) to prevent mixing of results. Here's the output >>> summary: >>> >>> --- Event Stage 0: Main Stage >>> >>> MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 >>> 4.8e+03 >>> 0.0e+00 10 11100100 0 10 11100100 0 217 >>> MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 >>> MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 >>> MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e >>> +00 >>> 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0* >>> MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 >>> 2.4e+03 >>> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 >>> 0.0e+00 >>> 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 >>> KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 >>> 4.8e+03 >>> 1.7e+04 89100100100100 89100100100100 317 >>> PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 >>> 0.0e+00 >>> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 >>> PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 >>> 0.0e+00 >>> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 >>> PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 >>> VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 >>> 0.0e+00 >>> 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 >>> *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 >>> 0.0e+00 >>> 8.8e+03 9 2 0 0 51 9 2 0 0 51 42* >>> *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 0 1 0 0 0 0 1 0 0 0 636* >>> VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 >>> VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 >>> VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 >>> 4.8e+03 >>> 0.0e+00 0 0100100 0 0 0100100 0 0* >>> *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0* >>> *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 >>> 0.0e+00 >>> 8.8e+03 9 4 0 0 51 9 4 0 0 51 62* >>> >>> ------------------------------------------------------------------------------------------------------------------------ >>> Memory usage is given in bytes: >>> Object Type Creations Destructions Memory >>> Descendants' Mem. >>> --- Event Stage 0: Main Stage >>> Matrix 4 4 49227380 0 >>> Krylov Solver 2 2 17216 0 >>> Preconditioner 2 2 256 0 >>> Index Set 5 5 2596120 0 >>> Vec 40 40 62243224 0 >>> Vec Scatter 1 1 0 0 >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> = >>> ==================================================================== >>> Average time to get PetscTime(): 4.05312e-07 >>> Average time >>> for MPI_Barrier(): 7.62939e-07 >>> Average time for zero size MPI_Send(): 2.02656e-06 >>> OptionTable: -log_summary >>> >>> >>> The PETSc manual states that ratio should be close to 1. There's >>> quite a >>> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very >>> big. So >>> what could be the cause? >>> >>> I wonder if it has to do the way I insert the matrix. My steps are: >>> (cartesian grids, i loop faster than j, fortran) >>> >>> For matrix A and rhs >>> >>> Insert left extreme cells values belonging to myid >>> >>> if (myid==0) then >>> >>> insert corner cells values >>> >>> insert south cells values >>> >>> insert internal cells values >>> >>> else if (myid==num_procs-1) then >>> >>> insert corner cells values >>> >>> insert north cells values >>> >>> insert internal cells values >>> >>> else >>> >>> insert internal cells values >>> >>> end if >>> >>> Insert right extreme cells values belonging to myid >>> >>> All these values are entered into a big_A(size_x*size_y,5) matrix. >>> int_A >>> stores the position of the values. I then do >>> >>> call MatZeroEntries(A_mat,ierr) >>> >>> do k=ksta_p+1,kend_p !for cells belonging to myid >>> >>> do kk=1,5 >>> >>> II=k-1 >>> >>> JJ=int_A(k,kk)-1 >>> >>> call MatSetValues(A_mat,1,II, >>> 1,JJ,big_A(k,kk),ADD_VALUES,ierr) >>> end do >>> >>> end do >>> >>> call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr) >>> >>> call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr) >>> >>> >>> I wonder if the problem lies here.I used the big_A matrix because >>> I was >>> migrating from an old linear solver. Lastly, I was told to widen >>> my window >>> to 120 characters. May I know how do I do it? >>> >>> >>> >>> Thank you very much. >>> >>> Matthew Knepley wrote: >>> >>> >>>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay wrote: >>>> >>>> >>>> >>>>> Hi Matthew, >>>>> >>>>> I think you've misunderstood what I meant. What I'm trying to >>>>> say is >>>>> initially I've got a serial code. I tried to convert to a >>>>> parallel one. >>>>> >>> Then >>> >>>>> I tested it and it was pretty slow. Due to some work >>>>> requirement, I need >>>>> >>> to >>> >>>>> go back to make some changes to my code. Since the parallel is not >>>>> >>> working >>> >>>>> well, I updated and changed the serial one. >>>>> >>>>> Well, that was a while ago and now, due to the updates and >>>>> changes, the >>>>> serial code is different from the old converted parallel code. >>>>> Some >>>>> >>> files >>> >>>>> were also deleted and I can't seem to get it working now. So I >>>>> thought I >>>>> might as well convert the new serial code to parallel. But I'm >>>>> not very >>>>> >>> sure >>> >>>>> what I should do 1st. >>>>> >>>>> Maybe I should rephrase my question in that if I just convert my >>>>> >>> poisson >>> >>>>> equation subroutine from a serial PETSc to a parallel PETSc >>>>> version, >>>>> >>> will it >>> >>>>> work? Should I expect a speedup? The rest of my code is still >>>>> serial. >>>>> >>>>> >>>>> >>>> You should, of course, only expect speedup in the parallel parts >>>> >>>> Matt >>>> >>>> >>>> >>>> >>>>> Thank you very much. >>>>> >>>>> >>>>> >>>>> Matthew Knepley wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> I am not sure why you would ever have two codes. I never do this. >>>>>> >>> PETSc >>> >>>>>> is designed to write one code to run in serial and parallel. >>>>>> The PETSc >>>>>> >>>>>> >>>>>> >>>>> part >>>>> >>>>> >>>>> >>>>>> should look identical. To test, run the code yo uhave verified in >>>>>> >>> serial >>> >>>>>> >>>>> and >>>>> >>>>> >>>>> >>>>>> output PETSc data structures (like Mat and Vec) using a binary >>>>>> viewer. >>>>>> Then run in parallel with the same code, which will output the >>>>>> same >>>>>> structures. Take the two files and write a small verification >>>>>> code >>>>>> >>> that >>> >>>>>> loads both versions and calls MatEqual and VecEqual. >>>>>> >>>>>> Matt >>>>>> >>>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Thank you Matthew. Sorry to trouble you again. >>>>>>> >>>>>>> I tried to run it with -log_summary output and I found that >>>>>>> there's >>>>>>> >>>>>>> >>>>>>> >>>>> some >>>>> >>>>> >>>>> >>>>>>> errors in the execution. Well, I was busy with other things >>>>>>> and I >>>>>>> >>> just >>> >>>>>>> >>>>> came >>>>> >>>>> >>>>> >>>>>>> back to this problem. Some of my files on the server has also >>>>>>> been >>>>>>> >>>>>>> >>>>>>> >>>>> deleted. >>>>> >>>>> >>>>> >>>>>>> It has been a while and I remember that it worked before, only >>>>>>> >>> much >>> >>>>>>> slower. >>>>>>> >>>>>>> Anyway, most of the serial code has been updated and maybe it's >>>>>>> >>> easier >>> >>>>>>> >>>>> to >>>>> >>>>> >>>>> >>>>>>> convert the new serial code instead of debugging on the old >>>>>>> parallel >>>>>>> >>>>>>> >>>>>>> >>>>> code >>>>> >>>>> >>>>> >>>>>>> now. I believe I can still reuse part of the old parallel code. >>>>>>> >>> However, >>> >>>>>>> >>>>> I >>>>> >>>>> >>>>> >>>>>>> hope I can approach it better this time. >>>>>>> >>>>>>> So supposed I need to start converting my new serial code to >>>>>>> >>> parallel. >>> >>>>>>> There's 2 eqns to be solved using PETSc, the momentum and >>>>>>> poisson. I >>>>>>> >>>>>>> >>>>>>> >>>>> also >>>>> >>>>> >>>>> >>>>>>> need to parallelize other parts of my code. I wonder which >>>>>>> route is >>>>>>> >>> the >>> >>>>>>> best: >>>>>>> >>>>>>> 1. Don't change the PETSc part ie continue using >>>>>>> PETSC_COMM_SELF, >>>>>>> >>>>>>> >>>>>>> >>>>> modify >>>>> >>>>> >>>>> >>>>>>> other parts of my code to parallel e.g. looping, updating of >>>>>>> values >>>>>>> >>> etc. >>> >>>>>>> Once the execution is fine and speedup is reasonable, then >>>>>>> modify >>>>>>> >>> the >>> >>>>>>> >>>>> PETSc >>>>> >>>>> >>>>> >>>>>>> part - poisson eqn 1st followed by the momentum eqn. >>>>>>> >>>>>>> 2. Reverse the above order ie modify the PETSc part - poisson >>>>>>> eqn >>>>>>> >>> 1st >>> >>>>>>> followed by the momentum eqn. Then do other parts of my code. >>>>>>> >>>>>>> I'm not sure if the above 2 mtds can work or if there will be >>>>>>> >>>>>>> >>>>>>> >>>>> conflicts. Of >>>>> >>>>> >>>>> >>>>>>> course, an alternative will be: >>>>>>> >>>>>>> 3. Do the poisson, momentum eqns and other parts of the code >>>>>>> >>>>>>> >>>>>>> >>>>> separately. >>>>> >>>>> >>>>> >>>>>>> That is, code a standalone parallel poisson eqn and use samples >>>>>>> >>> values >>> >>>>>>> >>>>> to >>>>> >>>>> >>>>> >>>>>>> test it. Same for the momentum and other parts of the code. When >>>>>>> >>> each of >>> >>>>>>> them is working, combine them to form the full parallel code. >>>>>>> >>> However, >>> >>>>>>> >>>>> this >>>>> >>>>> >>>>> >>>>>>> will be much more troublesome. >>>>>>> >>>>>>> I hope someone can give me some recommendations. >>>>>>> >>>>>>> Thank you once again. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Matthew Knepley wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> 1) There is no way to have any idea what is going on in your >>>>>>>> code >>>>>>>> without -log_summary output >>>>>>>> >>>>>>>> 2) Looking at that output, look at the percentage taken by the >>>>>>>> >>> solver >>> >>>>>>>> KSPSolve event. I suspect it is not the biggest component, >>>>>>>> >>> because >>> >>>>>>>> it is very scalable. >>>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I've a serial 2D CFD code. As my grid size requirement >>>>>>>>> >>> increases, >>> >>>>>>>>> >>>>> the >>>>> >>>>> >>>>> >>>>>>>>> simulation takes longer. Also, memory requirement becomes a >>>>>>>>> >>> problem. >>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> Grid >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> size 've reached 1200x1200. Going higher is not possible due >>>>>>>>> to >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>> memory >>>>> >>>>> >>>>> >>>>>>>>> problem. >>>>>>>>> >>>>>>>>> I tried to convert my code to a parallel one, following the >>>>>>>>> >>> examples >>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> given. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> I also need to restructure parts of my code to enable parallel >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>> looping. >>>>> >>>>> >>>>> >>>>>>>>> >>>>>>> I >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> restructured >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> parts of my code. I proceed on as longer as the answer for a >>>>>>>>> >>> simple >>> >>>>>>>>> >>>>> test >>>>> >>>>> >>>>> >>>>>>>>> case is correct. I thought it's not really possible to do any >>>>>>>>> >>> speed >>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> testing >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> since the code is not fully parallelized yet. When I finished >>>>>>>>> >>> during >>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> most of >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> the conversion, I found that in the actual run that it is much >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>> slower, >>>>> >>>>> >>>>> >>>>>>>>> although the answer is correct. >>>>>>>>> >>>>>>>>> So what is the remedy now? I wonder what I should do to check >>>>>>>>> >>> what's >>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> wrong. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> Must I restart everything again? Btw, my grid size is >>>>>>>>> 1200x1200. >>>>>>>>> >>> I >>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> believed >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> it should be suitable for parallel run of 4 processors? Is >>>>>>>>> that >>>>>>>>> >>> so? >>> >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> >>> >> >> >> >> > From zonexo at gmail.com Tue Apr 15 11:44:17 2008 From: zonexo at gmail.com (Ben Tay) Date: Wed, 16 Apr 2008 00:44:17 +0800 Subject: Slow speed after changing from serial to parallel In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> Message-ID: <4804DB61.3080906@gmail.com> Hi, Here's the summary for 1 processor. Seems like it's also using a long time... Can someone tell me when my mistakes possibly lie? Thank you very much! ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed Apr 16 00:39:22 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/Min Avg Total Time (sec): 1.088e+03 1.00000 1.088e+03 Objects: 4.300e+01 1.00000 4.300e+01 Flops: 2.658e+11 1.00000 2.658e+11 2.658e+11 Flops/sec: 2.444e+08 1.00000 2.444e+08 2.444e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.460e+04 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 1.0877e+03 100.0% 2.6584e+11 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 1.460e+04 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ ########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # # preloading. otherwise timing numbers may be # # meaningless. # ########################################################## Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11 0 0 0 12 11 0 0 0 216 MatSolve 7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 0.0e+00 25 11 0 0 0 25 11 0 0 0 107 MatLUFactorNum 1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 88 MatILUFactorSym 1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatZeroEntries 1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPGMRESOrthog 7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00 7.2e+03 52 72 0 0 49 52 72 0 0 49 341 KSPSetup 1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00 1.5e+04 93100 0 0100 93100 0 0100 262 PCSetUp 1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 44 PCApply 7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 0.0e+00 25 11 0 0 0 25 11 0 0 0 107 VecMDot 7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00 7.2e+03 25 36 0 0 49 25 36 0 0 49 359 VecNorm 7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00 7.4e+03 2 2 0 0 51 2 2 0 0 51 374 VecScale 7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 345 VecCopy 240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 206 VecMAXPY 7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00 0.0e+00 29 38 0 0 0 29 38 0 0 0 324 VecAssemblyBegin 2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecNormalize 7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 7.4e+03 2 4 0 0 51 2 4 0 0 51 364 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 2 2 65632332 0 Krylov Solver 1 1 17216 0 Preconditioner 1 1 168 0 Index Set 3 3 5185032 0 Vec 36 36 120987640 0 ======================================================================================================================== Average time to get PetscTime(): 3.09944e-07 OptionTable: -log_summary Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 Configure run at: Tue Jan 8 22:22:08 2008 sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 Configure run at: Tue Jan 8 22:22:08 2008 Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi-shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with-mpirun=/usr/local/topspin/mpi/mpich/bi n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 ----------------------------------------- Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 Using PETSc arch: atlas3-mpi ----------------------------------------- Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. -fPIC -O ----------------------------------------- Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include - I/home/enduser/g0306332/lib/hypre/include -I/usr/local/topspin/mpi/mpich/include ------------------------------------------ Using C linker: mpicc -fPIC -O Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib -L/usr/local/topspin/mpi/mpich/lib -lmpich -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc ------------------------------------------ 639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (20major+172979minor)pagefaults 0swaps Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME ===== ========== ================ ======================= =================== 00000 atlas3-c45 time ./a.out -lo Done 04/16/2008 00:39:23 Barry Smith wrote: > > It is taking 8776 iterations of GMRES! How many does it take on one > process? This is a huge > amount. > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 > 4.8e+03 0.0e+00 10 11100100 0 10 11100100 0 217 > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 > 0.0e+00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > > One process is spending 2.9 times as long in the embarresingly > parallel MatSolve then the other process; > this indicates a huge imbalance in the number of nonzeros on each > process. As Matt noticed, the partitioning > between the two processes is terrible. > > Barry > > On Apr 15, 2008, at 10:56 AM, Ben Tay wrote: >> Oh sorry here's the whole information. I'm using 2 processors currently: >> >> ************************************************************************************************************************ >> >> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript >> -r -fCourier9' to print this document *** >> ************************************************************************************************************************ >> >> >> ---------------------------------------------- PETSc Performance >> Summary: ---------------------------------------------- >> >> ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by >> g0306332 Tue Apr 15 23:03:09 2008 >> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST >> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b >> >> Max Max/Min Avg Total >> Time (sec): 1.114e+03 1.00054 1.114e+03 >> Objects: 5.400e+01 1.00000 5.400e+01 >> Flops: 1.574e+11 1.00000 1.574e+11 3.147e+11 >> Flops/sec: 1.414e+08 1.00054 1.413e+08 2.826e+08 >> MPI Messages: 8.777e+03 1.00000 8.777e+03 1.755e+04 >> MPI Message Lengths: 4.213e+07 1.00000 4.800e+03 8.425e+07 >> MPI Reductions: 8.644e+03 1.00000 >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> e.g., VecAXPY() for real vectors of length >> N --> 2N flops >> and VecAXPY() for complex vectors of length >> N --> 8N flops >> >> Summary of Stages: ----- Time ------ ----- Flops ----- --- >> Messages --- -- Message Lengths -- -- Reductions -- >> Avg %Total Avg %Total counts >> %Total Avg %Total counts %Total >> 0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04 >> 100.0% 4.800e+03 100.0% 1.729e+04 100.0% >> >> ------------------------------------------------------------------------------------------------------------------------ >> >> See the 'Profiling' chapter of the users' manual for details on >> interpreting output. >> Phase summary info: >> Count: number of times phase was executed >> Time and Flops/sec: Max - maximum over all processors >> Ratio - ratio of maximum to minimum over all >> processors >> Mess: number of messages sent >> Avg. len: average message length >> Reduct: number of global reductions >> Global: entire computation >> Stage: stages of a computation. Set stages with PetscLogStagePush() >> and PetscLogStagePop(). >> %T - percent time in this phase %F - percent flops in >> this phase >> %M - percent messages in this phase %L - percent message >> lengths in this phase >> %R - percent reductions in this phase >> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time >> over all processors) >> ------------------------------------------------------------------------------------------------------------------------ >> >> >> >> ########################################################## >> # # >> # WARNING!!! # >> # # >> # This code was run without the PreLoadBegin() # >> # macros. To get timing results we always recommend # >> # preloading. otherwise timing numbers may be # >> # meaningless. # >> ########################################################## >> >> >> Event Count Time (sec) >> Flops/sec --- Global --- --- Stage --- Total >> Max Ratio Max Ratio Max Ratio Mess Avg >> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >> ------------------------------------------------------------------------------------------------------------------------ >> >> >> --- Event Stage 0: Main Stage >> >> MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 >> 4.8e+03 0.0e+00 10 11100100 0 10 11100100 0 217 >> MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 >> 0.0e+00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 >> MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 >> MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0 >> MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 >> 2.4e+03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 >> 0.0e+00 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 >> KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 >> 4.8e+03 1.7e+04 89100100100100 89100100100100 317 >> PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 >> 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 >> PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 >> 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 >> PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 >> 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 >> VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 >> 0.0e+00 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 >> VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 >> 0.0e+00 8.8e+03 9 2 0 0 51 9 2 0 0 51 42 >> VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 >> 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 636 >> VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 >> VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 >> 0.0e+00 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 >> VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 >> 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 >> 4.8e+03 0.0e+00 0 0100100 0 0 0100100 0 0 >> VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 >> 0.0e+00 8.8e+03 9 4 0 0 51 9 4 0 0 51 62 >> ------------------------------------------------------------------------------------------------------------------------ >> >> >> Memory usage is given in bytes: >> >> Object Type Creations Destructions Memory Descendants' >> Mem. >> >> --- Event Stage 0: Main Stage >> >> Matrix 4 4 49227380 0 >> Krylov Solver 2 2 17216 0 >> Preconditioner 2 2 256 0 >> Index Set 5 5 2596120 0 >> Vec 40 40 62243224 0 >> Vec Scatter 1 1 0 0 >> ======================================================================================================================== >> >> Average time to get PetscTime(): 4.05312e-07 >> Average time for MPI_Barrier(): 7.62939e-07 >> Average time for zero size MPI_Send(): 2.02656e-06 >> OptionTable: -log_summary >> Compiled without FORTRAN kernels >> Compiled with full precision matrices (default) >> Compiled without FORTRAN kernels >> Compiled with full precision matrices (default) >> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >> sizeof(PetscScalar) 8 >> Configure run at: Tue Jan 8 22:22:08 2008 >> Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 >> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 >> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 >> --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with-vendor-compilers=intel >> --with-x=0 --with-hypre-dir=/home/enduser/g0306332/lib/hypre >> --with-debugging=0 --with-batch=1 --with-mpi-shared=0 >> --with-mpi-include=/usr/local/topspin/mpi/mpich/include >> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a >> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun >> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 >> ----------------------------------------- >> Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 >> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed >> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux >> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 >> Using PETSc arch: atlas3-mpi >> ----------------------------------------- >> Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. >> -fPIC -O ----------------------------------------- >> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 >> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi >> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include - >> I/home/enduser/g0306332/lib/hypre/include >> -I/usr/local/topspin/mpi/mpich/include >> ------------------------------------------ >> Using C linker: mpicc -fPIC -O >> Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: >> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi >> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts >> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc >> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib >> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE >> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -Wl,-rpath,/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard >> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -Wl,-rpath,/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib >> -L/usr/local/topspin/mpi/mpich/lib -lmpich >> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t >> -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide >> -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib >> -ldl -lmpich -libverbs -libumad -lpthread -lrt >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 >> -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc >> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -Wl,-rpath,/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib >> -L/opt/intel/fce/9.1.045/lib -lifport -lifcore -lm >> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -Wl,-rpath,/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -Wl,-rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -Wl,-rpath,/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard >> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -Wl,-rpath,/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -Wl,-rpath,/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard >> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -Wl,-rpath,/usr/local/ofed/lib64 >> -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib >> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich >> -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs >> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib >> -L/opt/intel/cce/9.1.049/lib >> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ >> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 >> -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc >> ------------------------------------------ >> 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata >> 0maxresident)k >> 0inputs+0outputs (28major+153248minor)pagefaults 0swaps >> 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata >> 0maxresident)k >> 0inputs+0outputs (18major+158175minor)pagefaults 0swaps >> Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary >> TID HOST_NAME COMMAND_LINE >> STATUS TERMINATION_TIME >> ===== ========== ================ ======================= >> =================== >> 00000 atlas3-c05 time ./a.out -lo Done >> 04/15/2008 23:03:10 >> 00001 atlas3-c05 time ./a.out -lo Done >> 04/15/2008 23:03:10 >> >> >> I have a cartesian grid 600x720. Since there's 2 processors, it is >> partitioned to 600x360. I just use: >> >> call >> MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr) >> >> >> call MatSetFromOptions(A_mat,ierr) >> >> call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr) >> >> call KSPCreate(MPI_COMM_WORLD,ksp,ierr) >> >> call >> VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr) >> >> total_k is actually size_x*size_y. Since it's 2d, the maximum values >> per row is 5. When you says setting off-process values, do you mean I >> insert values from 1 processor into another? I thought I insert the >> values into the correct processor... >> >> Thank you very much! >> >> >> >> Matthew Knepley wrote: >>> 1) Please never cut out parts of the summary. All the information is >>> valuable, >>> and most times, necessary >>> >>> 2) You seem to have huge load imbalance (look at VecNorm). Do you >>> partition >>> the system yourself. How many processes is this? >>> >>> 3) You seem to be setting a huge number of off-process values in the >>> matrix >>> (see MatAssemblyBegin). Is this true? I would reorganize this part. >>> >>> Matt >>> >>> On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay wrote: >>> >>>> Hi, >>>> >>>> I have converted the poisson eqn part of the CFD code to parallel. >>>> The grid >>>> size tested is 600x720. For the momentum eqn, I used another serial >>>> linear >>>> solver (nspcg) to prevent mixing of results. Here's the output >>>> summary: >>>> >>>> --- Event Stage 0: Main Stage >>>> >>>> MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 >>>> 4.8e+03 >>>> 0.0e+00 10 11100100 0 10 11100100 0 217 >>>> MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 >>>> MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 >>>> MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0* >>>> MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 >>>> 2.4e+03 >>>> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 >>>> 0.0e+00 >>>> 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 >>>> KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 >>>> 4.8e+03 >>>> 1.7e+04 89100100100100 89100100100100 317 >>>> PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 >>>> 0.0e+00 >>>> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 >>>> PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 >>>> 0.0e+00 >>>> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 >>>> PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 >>>> VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 >>>> 0.0e+00 >>>> 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 >>>> *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 >>>> 0.0e+00 >>>> 8.8e+03 9 2 0 0 51 9 2 0 0 51 42* >>>> *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 0 1 0 0 0 0 1 0 0 0 636* >>>> VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>>> VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 >>>> VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 >>>> VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 >>>> 4.8e+03 >>>> 0.0e+00 0 0100100 0 0 0100100 0 0* >>>> *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 >>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0* >>>> *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 >>>> 0.0e+00 >>>> 8.8e+03 9 4 0 0 51 9 4 0 0 51 62* >>>> >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> >>>> Memory usage is given in bytes: >>>> Object Type Creations Destructions Memory >>>> Descendants' Mem. >>>> --- Event Stage 0: Main Stage >>>> Matrix 4 4 49227380 0 >>>> Krylov Solver 2 2 17216 0 >>>> Preconditioner 2 2 256 0 >>>> Index Set 5 5 2596120 0 >>>> Vec 40 40 62243224 0 >>>> Vec Scatter 1 1 0 0 >>>> ======================================================================================================================== >>>> >>>> Average time to get PetscTime(): 4.05312e-07 >>>> Average time >>>> for MPI_Barrier(): 7.62939e-07 >>>> Average time for zero size MPI_Send(): 2.02656e-06 >>>> OptionTable: -log_summary >>>> >>>> >>>> The PETSc manual states that ratio should be close to 1. There's >>>> quite a >>>> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very >>>> big. So >>>> what could be the cause? >>>> >>>> I wonder if it has to do the way I insert the matrix. My steps are: >>>> (cartesian grids, i loop faster than j, fortran) >>>> >>>> For matrix A and rhs >>>> >>>> Insert left extreme cells values belonging to myid >>>> >>>> if (myid==0) then >>>> >>>> insert corner cells values >>>> >>>> insert south cells values >>>> >>>> insert internal cells values >>>> >>>> else if (myid==num_procs-1) then >>>> >>>> insert corner cells values >>>> >>>> insert north cells values >>>> >>>> insert internal cells values >>>> >>>> else >>>> >>>> insert internal cells values >>>> >>>> end if >>>> >>>> Insert right extreme cells values belonging to myid >>>> >>>> All these values are entered into a big_A(size_x*size_y,5) matrix. >>>> int_A >>>> stores the position of the values. I then do >>>> >>>> call MatZeroEntries(A_mat,ierr) >>>> >>>> do k=ksta_p+1,kend_p !for cells belonging to myid >>>> >>>> do kk=1,5 >>>> >>>> II=k-1 >>>> >>>> JJ=int_A(k,kk)-1 >>>> >>>> call >>>> MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr) >>>> end do >>>> >>>> end do >>>> >>>> call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr) >>>> >>>> call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr) >>>> >>>> >>>> I wonder if the problem lies here.I used the big_A matrix because I >>>> was >>>> migrating from an old linear solver. Lastly, I was told to widen my >>>> window >>>> to 120 characters. May I know how do I do it? >>>> >>>> >>>> >>>> Thank you very much. >>>> >>>> Matthew Knepley wrote: >>>> >>>> >>>>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay wrote: >>>>> >>>>> >>>>> >>>>>> Hi Matthew, >>>>>> >>>>>> I think you've misunderstood what I meant. What I'm trying to say is >>>>>> initially I've got a serial code. I tried to convert to a >>>>>> parallel one. >>>>>> >>>> Then >>>> >>>>>> I tested it and it was pretty slow. Due to some work requirement, >>>>>> I need >>>>>> >>>> to >>>> >>>>>> go back to make some changes to my code. Since the parallel is not >>>>>> >>>> working >>>> >>>>>> well, I updated and changed the serial one. >>>>>> >>>>>> Well, that was a while ago and now, due to the updates and >>>>>> changes, the >>>>>> serial code is different from the old converted parallel code. Some >>>>>> >>>> files >>>> >>>>>> were also deleted and I can't seem to get it working now. So I >>>>>> thought I >>>>>> might as well convert the new serial code to parallel. But I'm >>>>>> not very >>>>>> >>>> sure >>>> >>>>>> what I should do 1st. >>>>>> >>>>>> Maybe I should rephrase my question in that if I just convert my >>>>>> >>>> poisson >>>> >>>>>> equation subroutine from a serial PETSc to a parallel PETSc version, >>>>>> >>>> will it >>>> >>>>>> work? Should I expect a speedup? The rest of my code is still >>>>>> serial. >>>>>> >>>>>> >>>>>> >>>>> You should, of course, only expect speedup in the parallel parts >>>>> >>>>> Matt >>>>> >>>>> >>>>> >>>>> >>>>>> Thank you very much. >>>>>> >>>>>> >>>>>> >>>>>> Matthew Knepley wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> I am not sure why you would ever have two codes. I never do this. >>>>>>> >>>> PETSc >>>> >>>>>>> is designed to write one code to run in serial and parallel. The >>>>>>> PETSc >>>>>>> >>>>>>> >>>>>>> >>>>>> part >>>>>> >>>>>> >>>>>> >>>>>>> should look identical. To test, run the code yo uhave verified in >>>>>>> >>>> serial >>>> >>>>>>> >>>>>> and >>>>>> >>>>>> >>>>>> >>>>>>> output PETSc data structures (like Mat and Vec) using a binary >>>>>>> viewer. >>>>>>> Then run in parallel with the same code, which will output the same >>>>>>> structures. Take the two files and write a small verification code >>>>>>> >>>> that >>>> >>>>>>> loads both versions and calls MatEqual and VecEqual. >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Thank you Matthew. Sorry to trouble you again. >>>>>>>> >>>>>>>> I tried to run it with -log_summary output and I found that >>>>>>>> there's >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> some >>>>>> >>>>>> >>>>>> >>>>>>>> errors in the execution. Well, I was busy with other things and I >>>>>>>> >>>> just >>>> >>>>>>>> >>>>>> came >>>>>> >>>>>> >>>>>> >>>>>>>> back to this problem. Some of my files on the server has also been >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> deleted. >>>>>> >>>>>> >>>>>> >>>>>>>> It has been a while and I remember that it worked before, only >>>>>>>> >>>> much >>>> >>>>>>>> slower. >>>>>>>> >>>>>>>> Anyway, most of the serial code has been updated and maybe it's >>>>>>>> >>>> easier >>>> >>>>>>>> >>>>>> to >>>>>> >>>>>> >>>>>> >>>>>>>> convert the new serial code instead of debugging on the old >>>>>>>> parallel >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> code >>>>>> >>>>>> >>>>>> >>>>>>>> now. I believe I can still reuse part of the old parallel code. >>>>>>>> >>>> However, >>>> >>>>>>>> >>>>>> I >>>>>> >>>>>> >>>>>> >>>>>>>> hope I can approach it better this time. >>>>>>>> >>>>>>>> So supposed I need to start converting my new serial code to >>>>>>>> >>>> parallel. >>>> >>>>>>>> There's 2 eqns to be solved using PETSc, the momentum and >>>>>>>> poisson. I >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> also >>>>>> >>>>>> >>>>>> >>>>>>>> need to parallelize other parts of my code. I wonder which >>>>>>>> route is >>>>>>>> >>>> the >>>> >>>>>>>> best: >>>>>>>> >>>>>>>> 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> modify >>>>>> >>>>>> >>>>>> >>>>>>>> other parts of my code to parallel e.g. looping, updating of >>>>>>>> values >>>>>>>> >>>> etc. >>>> >>>>>>>> Once the execution is fine and speedup is reasonable, then modify >>>>>>>> >>>> the >>>> >>>>>>>> >>>>>> PETSc >>>>>> >>>>>> >>>>>> >>>>>>>> part - poisson eqn 1st followed by the momentum eqn. >>>>>>>> >>>>>>>> 2. Reverse the above order ie modify the PETSc part - poisson eqn >>>>>>>> >>>> 1st >>>> >>>>>>>> followed by the momentum eqn. Then do other parts of my code. >>>>>>>> >>>>>>>> I'm not sure if the above 2 mtds can work or if there will be >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> conflicts. Of >>>>>> >>>>>> >>>>>> >>>>>>>> course, an alternative will be: >>>>>>>> >>>>>>>> 3. Do the poisson, momentum eqns and other parts of the code >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> separately. >>>>>> >>>>>> >>>>>> >>>>>>>> That is, code a standalone parallel poisson eqn and use samples >>>>>>>> >>>> values >>>> >>>>>>>> >>>>>> to >>>>>> >>>>>> >>>>>> >>>>>>>> test it. Same for the momentum and other parts of the code. When >>>>>>>> >>>> each of >>>> >>>>>>>> them is working, combine them to form the full parallel code. >>>>>>>> >>>> However, >>>> >>>>>>>> >>>>>> this >>>>>> >>>>>> >>>>>> >>>>>>>> will be much more troublesome. >>>>>>>> >>>>>>>> I hope someone can give me some recommendations. >>>>>>>> >>>>>>>> Thank you once again. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Matthew Knepley wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> 1) There is no way to have any idea what is going on in your code >>>>>>>>> without -log_summary output >>>>>>>>> >>>>>>>>> 2) Looking at that output, look at the percentage taken by the >>>>>>>>> >>>> solver >>>> >>>>>>>>> KSPSolve event. I suspect it is not the biggest component, >>>>>>>>> >>>> because >>>> >>>>>>>>> it is very scalable. >>>>>>>>> >>>>>>>>> Matt >>>>>>>>> >>>>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I've a serial 2D CFD code. As my grid size requirement >>>>>>>>>> >>>> increases, >>>> >>>>>>>>>> >>>>>> the >>>>>> >>>>>> >>>>>> >>>>>>>>>> simulation takes longer. Also, memory requirement becomes a >>>>>>>>>> >>>> problem. >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> Grid >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> size 've reached 1200x1200. Going higher is not possible due to >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>> memory >>>>>> >>>>>> >>>>>> >>>>>>>>>> problem. >>>>>>>>>> >>>>>>>>>> I tried to convert my code to a parallel one, following the >>>>>>>>>> >>>> examples >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> given. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> I also need to restructure parts of my code to enable parallel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>> looping. >>>>>> >>>>>> >>>>>> >>>>>>>>>> >>>>>>>> I >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> restructured >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> parts of my code. I proceed on as longer as the answer for a >>>>>>>>>> >>>> simple >>>> >>>>>>>>>> >>>>>> test >>>>>> >>>>>> >>>>>> >>>>>>>>>> case is correct. I thought it's not really possible to do any >>>>>>>>>> >>>> speed >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> testing >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> since the code is not fully parallelized yet. When I finished >>>>>>>>>> >>>> during >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> most of >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> the conversion, I found that in the actual run that it is much >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>> slower, >>>>>> >>>>>> >>>>>> >>>>>>>>>> although the answer is correct. >>>>>>>>>> >>>>>>>>>> So what is the remedy now? I wonder what I should do to check >>>>>>>>>> >>>> what's >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> wrong. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> Must I restart everything again? Btw, my grid size is 1200x1200. >>>>>>>>>> >>>> I >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> believed >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> it should be suitable for parallel run of 4 processors? Is that >>>>>>>>>> >>>> so? >>>> >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> >> > > From knepley at gmail.com Tue Apr 15 12:33:46 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 12:33:46 -0500 Subject: Slow speed after changing from serial to parallel In-Reply-To: <4804DB61.3080906@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> Message-ID: The convergence here is jsut horrendous. Have you tried using LU to check your implementation? All the time is in the solve right now. I would first try a direct method (at least on a small problem) and then try to understand the convergence behavior. MUMPS can actually scale very well for big problems. Matt On Tue, Apr 15, 2008 at 11:44 AM, Ben Tay wrote: > Hi, > > Here's the summary for 1 processor. Seems like it's also using a long > time... Can someone tell me when my mistakes possibly lie? Thank you very > much! > > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed > Apr 16 00:39:22 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG > revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 1.088e+03 1.00000 1.088e+03 > Objects: 4.300e+01 1.00000 4.300e+01 > Flops: 2.658e+11 1.00000 2.658e+11 2.658e+11 > Flops/sec: 2.444e+08 1.00000 2.444e+08 2.444e+08 > MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Reductions: 1.460e+04 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N --> > 2N flops > and VecAXPY() for complex vectors of length N --> > 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 1.0877e+03 100.0% 2.6584e+11 100.0% 0.000e+00 0.0% > 0.000e+00 0.0% 1.460e+04 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths in > this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 11 0 0 0 12 11 0 0 0 216 > MatSolve 7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 25 11 0 0 0 25 11 0 0 0 107 > MatLUFactorNum 1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 88 > MatILUFactorSym 1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatZeroEntries 1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPGMRESOrthog 7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00 > 7.2e+03 52 72 0 0 49 52 72 0 0 49 341 > KSPSetup 1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00 > 1.5e+04 93100 0 0100 93100 0 0100 262 > PCSetUp 1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 44 > PCApply 7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 25 11 0 0 0 25 11 0 0 0 107 > VecMDot 7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00 > 7.2e+03 25 36 0 0 49 25 36 0 0 49 359 > VecNorm 7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00 > 7.4e+03 2 2 0 0 51 2 2 0 0 51 374 > VecScale 7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 1 0 0 0 1 1 0 0 0 345 > VecCopy 240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 206 > VecMAXPY 7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 29 38 0 0 0 29 38 0 0 0 324 > VecAssemblyBegin 2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAssemblyEnd 2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecNormalize 7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 > 7.4e+03 2 4 0 0 51 2 4 0 0 51 364 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 2 2 65632332 0 > Krylov Solver 1 1 17216 0 > Preconditioner 1 1 168 0 > Index Set 3 3 5185032 0 > Vec 36 36 120987640 0 > > ======================================================================================================================== > Average time to get PetscTime(): 3.09944e-07 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 Configure > options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2 > --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4 > --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 > --with-vendor-compilers=intel --with-x=0 > --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 > --with-batch=1 --with-mpi-shared=0 > --with-mpi-include=/usr/local/topspin/mpi/mpich/include > --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a > --with-mpirun=/usr/local/topspin/mpi/mpich/bi > n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t > --with-shared=0 ----------------------------------------- > Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12 > 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 > Using PETSc arch: atlas3-mpi > ----------------------------------------- > Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. -fPIC > -O ----------------------------------------- > Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 > -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi > -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include - > I/home/enduser/g0306332/lib/hypre/include > -I/usr/local/topspin/mpi/mpich/include > ------------------------------------------ > Using C linker: mpicc -fPIC -O > Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: > -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi > -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib > -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib > -L/usr/local/topspin/mpi/mpich/lib -lmpich > -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t > -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 > -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt > -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ > -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 > -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport > -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl > -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs > -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -L/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ > -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 > -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc > ------------------------------------------ > 639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (20major+172979minor)pagefaults 0swaps > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > > TID HOST_NAME COMMAND_LINE STATUS > TERMINATION_TIME > ===== ========== ================ ======================= > =================== > 00000 atlas3-c45 time ./a.out -lo Done 04/16/2008 > 00:39:23 > > > Barry Smith wrote: > > > > > It is taking 8776 iterations of GMRES! How many does it take on one > process? This is a huge > > amount. > > > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 > 0.0e+00 10 11100100 0 10 11100100 0 217 > > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 > 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > > > > One process is spending 2.9 times as long in the embarresingly parallel > MatSolve then the other process; > > this indicates a huge imbalance in the number of nonzeros on each process. > As Matt noticed, the partitioning > > between the two processes is terrible. > > > > Barry > > > > On Apr 15, 2008, at 10:56 AM, Ben Tay wrote: > > > > > Oh sorry here's the whole information. I'm using 2 processors currently: > > > > > > > ************************************************************************************************************************ > > > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > > > ************************************************************************************************************************ > > > > > > ---------------------------------------------- PETSc Performance > Summary: ---------------------------------------------- > > > > > > ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332 > Tue Apr 15 23:03:09 2008 > > > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 > HG revision: 414581156e67e55c761739b0deb119f7590d0f4b > > > > > > Max Max/Min Avg Total > > > Time (sec): 1.114e+03 1.00054 1.114e+03 > > > Objects: 5.400e+01 1.00000 5.400e+01 > > > Flops: 1.574e+11 1.00000 1.574e+11 3.147e+11 > > > Flops/sec: 1.414e+08 1.00054 1.413e+08 2.826e+08 > > > MPI Messages: 8.777e+03 1.00000 8.777e+03 1.755e+04 > > > MPI Message Lengths: 4.213e+07 1.00000 4.800e+03 8.425e+07 > > > MPI Reductions: 8.644e+03 1.00000 > > > > > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > > > e.g., VecAXPY() for real vectors of length N > --> 2N flops > > > and VecAXPY() for complex vectors of length N > --> 8N flops > > > > > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages > --- -- Message Lengths -- -- Reductions -- > > > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > > > 0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04 > 100.0% 4.800e+03 100.0% 1.729e+04 100.0% > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > > > Phase summary info: > > > Count: number of times phase was executed > > > Time and Flops/sec: Max - maximum over all processors > > > Ratio - ratio of maximum to minimum over all > processors > > > Mess: number of messages sent > > > Avg. len: average message length > > > Reduct: number of global reductions > > > Global: entire computation > > > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > > > %T - percent time in this phase %F - percent flops in this > phase > > > %M - percent messages in this phase %L - percent message lengths > in this phase > > > %R - percent reductions in this phase > > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > over all processors) > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > > > > ########################################################## > > > # # > > > # WARNING!!! # > > > # # > > > # This code was run without the PreLoadBegin() # > > > # macros. To get timing results we always recommend # > > > # preloading. otherwise timing numbers may be # > > > # meaningless. # > > > ########################################################## > > > > > > > > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > > > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > --- Event Stage 0: Main Stage > > > > > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 > 0.0e+00 10 11100100 0 10 11100100 0 217 > > > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 > 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > > > MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 > > > MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 > 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0 > > > MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 > 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 > > > KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 > 1.7e+04 89100100100100 89100100100100 317 > > > PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > > > PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > > > PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 > 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 > > > VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 > 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 > > > VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 > 8.8e+03 9 2 0 0 51 9 2 0 0 51 42 > > > VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 636 > > > VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > > VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 > > > VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 > 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 > > > VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0 > > > VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > > VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 > 8.8e+03 9 4 0 0 51 9 4 0 0 51 62 > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > Memory usage is given in bytes: > > > > > > Object Type Creations Destructions Memory Descendants' > Mem. > > > > > > --- Event Stage 0: Main Stage > > > > > > Matrix 4 4 49227380 0 > > > Krylov Solver 2 2 17216 0 > > > Preconditioner 2 2 256 0 > > > Index Set 5 5 2596120 0 > > > Vec 40 40 62243224 0 > > > Vec Scatter 1 1 0 0 > > > > ======================================================================================================================== > > > Average time to get PetscTime(): 4.05312e-07 > > > Average time for MPI_Barrier(): 7.62939e-07 > > > Average time for zero size MPI_Send(): 2.02656e-06 > > > OptionTable: -log_summary > > > Compiled without FORTRAN kernels > > > Compiled with full precision matrices (default) > > > Compiled without FORTRAN kernels Compiled > with full precision matrices (default) > > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > > > Configure run at: Tue Jan 8 22:22:08 2008 > > > Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 > --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 > --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 > --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 > --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 > --with-batch=1 --with-mpi-shared=0 > --with-mpi-include=/usr/local/topspin/mpi/mpich/include > --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a > --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun > --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 > > > ----------------------------------------- > > > Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 > > > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul > 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > > > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 > > > Using PETSc arch: atlas3-mpi > > > ----------------------------------------- > > > Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. > -fPIC -O ----------------------------------------- > > > Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 > -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi > -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include - > > > I/home/enduser/g0306332/lib/hypre/include > -I/usr/local/topspin/mpi/mpich/include > ------------------------------------------ > > > Using C linker: mpicc -fPIC -O > > > Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: > -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi > -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib > -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib > -L/usr/local/topspin/mpi/mpich/lib -lmpich > -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t > -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 > -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt > -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ > -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 > -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport > -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl > -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs > -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -L/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ > -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 > -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc > > > ------------------------------------------ > > > 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata > 0maxresident)k > > > 0inputs+0outputs (28major+153248minor)pagefaults 0swaps > > > 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata > 0maxresident)k > > > 0inputs+0outputs (18major+158175minor)pagefaults 0swaps > > > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > > > TID HOST_NAME COMMAND_LINE STATUS > TERMINATION_TIME > > > ===== ========== ================ ======================= > =================== > > > 00000 atlas3-c05 time ./a.out -lo Done 04/15/2008 > 23:03:10 > > > 00001 atlas3-c05 time ./a.out -lo Done 04/15/2008 > 23:03:10 > > > > > > > > > I have a cartesian grid 600x720. Since there's 2 processors, it is > partitioned to 600x360. I just use: > > > > > > call > MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr) > > > > > > call MatSetFromOptions(A_mat,ierr) > > > > > > call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr) > > > > > > call KSPCreate(MPI_COMM_WORLD,ksp,ierr) > > > > > > call > VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr) > > > > > > total_k is actually size_x*size_y. Since it's 2d, the maximum values per > row is 5. When you says setting off-process values, do you mean I insert > values from 1 processor into another? I thought I insert the values into the > correct processor... > > > > > > Thank you very much! > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > 1) Please never cut out parts of the summary. All the information is > valuable, > > > > and most times, necessary > > > > > > > > 2) You seem to have huge load imbalance (look at VecNorm). Do you > partition > > > > the system yourself. How many processes is this? > > > > > > > > 3) You seem to be setting a huge number of off-process values in the > matrix > > > > (see MatAssemblyBegin). Is this true? I would reorganize this part. > > > > > > > > Matt > > > > > > > > On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > I have converted the poisson eqn part of the CFD code to parallel. > The grid > > > > > size tested is 600x720. For the momentum eqn, I used another serial > linear > > > > > solver (nspcg) to prevent mixing of results. Here's the output > summary: > > > > > > > > > > --- Event Stage 0: Main Stage > > > > > > > > > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 > 4.8e+03 > > > > > 0.0e+00 10 11100100 0 10 11100100 0 217 > > > > > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > > > > > MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 > > > > > MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 > 0.0e+00 > > > > > 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0* > > > > > MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 > 2.4e+03 > > > > > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 > 0.0e+00 > > > > > 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 > > > > > KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 > 4.8e+03 > > > > > 1.7e+04 89100100100100 89100100100100 317 > > > > > PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 > 0.0e+00 > > > > > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > > > > > PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 > 0.0e+00 > > > > > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > > > > > PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 > > > > > VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 > 0.0e+00 > > > > > 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 > > > > > *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 > 0.0e+00 > > > > > 8.8e+03 9 2 0 0 51 9 2 0 0 51 42* > > > > > *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 636* > > > > > VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > > > > VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 > > > > > VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 > > > > > VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 > 4.8e+03 > > > > > 0.0e+00 0 0100100 0 0 0100100 0 0* > > > > > *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0* > > > > > *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 > 0.0e+00 > > > > > 8.8e+03 9 4 0 0 51 9 4 0 0 51 62* > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > Memory usage is given in bytes: > > > > > Object Type Creations Destructions Memory > Descendants' Mem. > > > > > --- Event Stage 0: Main Stage > > > > > Matrix 4 4 49227380 0 > > > > > Krylov Solver 2 2 17216 0 > > > > > Preconditioner 2 2 256 0 > > > > > Index Set 5 5 2596120 0 > > > > > Vec 40 40 62243224 0 > > > > > Vec Scatter 1 1 0 0 > > > > > > ======================================================================================================================== > > > > > Average time to get PetscTime(): 4.05312e-07 > Average time > > > > > for MPI_Barrier(): 7.62939e-07 > > > > > Average time for zero size MPI_Send(): 2.02656e-06 > > > > > OptionTable: -log_summary > > > > > > > > > > > > > > > The PETSc manual states that ratio should be close to 1. There's > quite a > > > > > few *(in bold)* which are >1 and MatAssemblyBegin seems to be very > big. So > > > > > what could be the cause? > > > > > > > > > > I wonder if it has to do the way I insert the matrix. My steps are: > > > > > (cartesian grids, i loop faster than j, fortran) > > > > > > > > > > For matrix A and rhs > > > > > > > > > > Insert left extreme cells values belonging to myid > > > > > > > > > > if (myid==0) then > > > > > > > > > > insert corner cells values > > > > > > > > > > insert south cells values > > > > > > > > > > insert internal cells values > > > > > > > > > > else if (myid==num_procs-1) then > > > > > > > > > > insert corner cells values > > > > > > > > > > insert north cells values > > > > > > > > > > insert internal cells values > > > > > > > > > > else > > > > > > > > > > insert internal cells values > > > > > > > > > > end if > > > > > > > > > > Insert right extreme cells values belonging to myid > > > > > > > > > > All these values are entered into a big_A(size_x*size_y,5) matrix. > int_A > > > > > stores the position of the values. I then do > > > > > > > > > > call MatZeroEntries(A_mat,ierr) > > > > > > > > > > do k=ksta_p+1,kend_p !for cells belonging to myid > > > > > > > > > > do kk=1,5 > > > > > > > > > > II=k-1 > > > > > > > > > > JJ=int_A(k,kk)-1 > > > > > > > > > > call > MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr) > > > > > end do > > > > > > > > > > end do > > > > > > > > > > call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr) > > > > > > > > > > call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr) > > > > > > > > > > > > > > > I wonder if the problem lies here.I used the big_A matrix because I > was > > > > > migrating from an old linear solver. Lastly, I was told to widen my > window > > > > > to 120 characters. May I know how do I do it? > > > > > > > > > > > > > > > > > > > > Thank you very much. > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Matthew, > > > > > > > > > > > > > > I think you've misunderstood what I meant. What I'm trying to > say is > > > > > > > initially I've got a serial code. I tried to convert to a > parallel one. > > > > > > > > > > > > > > > > > > > > > > > > > Then > > > > > > > > > > > > > > > > > > > > > > > I tested it and it was pretty slow. Due to some work > requirement, I need > > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > go back to make some changes to my code. Since the parallel is > not > > > > > > > > > > > > > > > > > > > > > > > > > working > > > > > > > > > > > > > > > > > > > > > > > well, I updated and changed the serial one. > > > > > > > > > > > > > > Well, that was a while ago and now, due to the updates and > changes, the > > > > > > > serial code is different from the old converted parallel code. > Some > > > > > > > > > > > > > > > > > > > > > > > > > files > > > > > > > > > > > > > > > > > > > > > > > were also deleted and I can't seem to get it working now. So I > thought I > > > > > > > might as well convert the new serial code to parallel. But I'm > not very > > > > > > > > > > > > > > > > > > > > > > > > > sure > > > > > > > > > > > > > > > > > > > > > > > what I should do 1st. > > > > > > > > > > > > > > Maybe I should rephrase my question in that if I just convert my > > > > > > > > > > > > > > > > > > > > > > > > > poisson > > > > > > > > > > > > > > > > > > > > > > > equation subroutine from a serial PETSc to a parallel PETSc > version, > > > > > > > > > > > > > > > > > > > > > > > > > will it > > > > > > > > > > > > > > > > > > > > > > > work? Should I expect a speedup? The rest of my code is still > serial. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You should, of course, only expect speedup in the parallel parts > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you very much. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am not sure why you would ever have two codes. I never do > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > PETSc > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > is designed to write one code to run in serial and parallel. > The PETSc > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > part > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > should look identical. To test, run the code yo uhave verified > in > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > serial > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > output PETSc data structures (like Mat and Vec) using a binary > viewer. > > > > > > > > Then run in parallel with the same code, which will output the > same > > > > > > > > structures. Take the two files and write a small verification > code > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > loads both versions and calls MatEqual and VecEqual. > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you Matthew. Sorry to trouble you again. > > > > > > > > > > > > > > > > > > I tried to run it with -log_summary output and I found that > there's > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > some > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > errors in the execution. Well, I was busy with other things > and I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > just > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > came > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > back to this problem. Some of my files on the server has > also been > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > deleted. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It has been a while and I remember that it worked before, > only > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > much > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > slower. > > > > > > > > > > > > > > > > > > Anyway, most of the serial code has been updated and maybe > it's > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > easier > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > convert the new serial code instead of debugging on the old > parallel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > code > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > now. I believe I can still reuse part of the old parallel > code. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > However, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > hope I can approach it better this time. > > > > > > > > > > > > > > > > > > So supposed I need to start converting my new serial code to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > parallel. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There's 2 eqns to be solved using PETSc, the momentum and > poisson. I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > also > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > need to parallelize other parts of my code. I wonder which > route is > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > best: > > > > > > > > > > > > > > > > > > 1. Don't change the PETSc part ie continue using > PETSC_COMM_SELF, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > modify > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > other parts of my code to parallel e.g. looping, updating of > values > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > etc. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Once the execution is fine and speedup is reasonable, then > modify > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > PETSc > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > part - poisson eqn 1st followed by the momentum eqn. > > > > > > > > > > > > > > > > > > 2. Reverse the above order ie modify the PETSc part - > poisson eqn > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1st > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > followed by the momentum eqn. Then do other parts of my > code. > > > > > > > > > > > > > > > > > > I'm not sure if the above 2 mtds can work or if there will > be > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > conflicts. Of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > course, an alternative will be: > > > > > > > > > > > > > > > > > > 3. Do the poisson, momentum eqns and other parts of the code > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > separately. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That is, code a standalone parallel poisson eqn and use > samples > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > values > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > test it. Same for the momentum and other parts of the code. > When > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > each of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > them is working, combine them to form the full parallel > code. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > However, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > will be much more troublesome. > > > > > > > > > > > > > > > > > > I hope someone can give me some recommendations. > > > > > > > > > > > > > > > > > > Thank you once again. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) There is no way to have any idea what is going on in > your code > > > > > > > > > > without -log_summary output > > > > > > > > > > > > > > > > > > > > 2) Looking at that output, look at the percentage taken by > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > solver > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > KSPSolve event. I suspect it is not the biggest component, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > because > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > it is very scalable. > > > > > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I've a serial 2D CFD code. As my grid size requirement > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > increases, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > simulation takes longer. Also, memory requirement > becomes a > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Grid > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > size 've reached 1200x1200. Going higher is not possible > due to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > memory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > problem. > > > > > > > > > > > > > > > > > > > > > > I tried to convert my code to a parallel one, following > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > examples > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > given. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also need to restructure parts of my code to enable > parallel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > looping. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1st changed the PETSc solver to be parallel enabled and > then I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > restructured > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > parts of my code. I proceed on as longer as the answer > for a > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > simple > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > test > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > case is correct. I thought it's not really possible to > do any > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > speed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > testing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > since the code is not fully parallelized yet. When I > finished > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > during > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > most of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the conversion, I found that in the actual run that it > is much > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > slower, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > although the answer is correct. > > > > > > > > > > > > > > > > > > > > > > So what is the remedy now? I wonder what I should do to > check > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > what's > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrong. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Must I restart everything again? Btw, my grid size is > 1200x1200. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > believed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > it should be suitable for parallel run of 4 processors? > Is that > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > so? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From pivello at gmail.com Tue Apr 15 13:46:49 2008 From: pivello at gmail.com (=?ISO-8859-1?Q?M=E1rcio_Ricardo_Pivello?=) Date: Tue, 15 Apr 2008 15:46:49 -0300 Subject: PETSc + HYPRE In-Reply-To: References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> Message-ID: <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com> Hy, Matthew, thanks for your help. Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on FEM, with fluid-structure interaction. In this case, I'm simulating the blood flow inside an aneurysm in an abdominal aorta artery. By not working I mean the error does not decrease with time. Our team is just starting using HYPRE, in fact this is the very first case we run with it. Again, thanks for your help. M?rcio Ricardo. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Apr 15 13:51:26 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 13:51:26 -0500 Subject: PETSc + HYPRE In-Reply-To: <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com> References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com> Message-ID: On Tue, Apr 15, 2008 at 1:46 PM, M?rcio Ricardo Pivello wrote: > Hy, Matthew, thanks for your help. > > Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on FEM, > with fluid-structure interaction. In this case, I'm simulating the blood > flow inside an aneurysm in an abdominal aorta artery. > By not working I mean the error does not decrease with time. Our team is In this case, in addition to my last mail, you want to look at -ksp_monitor -ksp_converged_reason to see what happened in the solver. Matt > just starting using HYPRE, in fact this is the very first case we run with > it. > > > Again, thanks for your help. > > > M?rcio Ricardo. > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From dalcinl at gmail.com Tue Apr 15 18:43:22 2008 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Tue, 15 Apr 2008 20:43:22 -0300 Subject: PETSc + HYPRE In-Reply-To: <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com> References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com> Message-ID: Sorry for my insistence, but... Did you see my previous mail? The code you wrote is not OK. You have to first create the KSP, next extract the PC with KSPGetPC, and then configure the PC to use HYPRE+BoomerAMG To be sure you are actually being using hypre, add -ksp_view to command line. On 4/15/08, M?rcio Ricardo Pivello wrote: > Hy, Matthew, thanks for your help. > > Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on FEM, > with fluid-structure interaction. In this case, I'm simulating the blood > flow inside an aneurysm in an abdominal aorta artery. > By not working I mean the error does not decrease with time. Our team is > just starting using HYPRE, in fact this is the very first case we run with > it. > > > Again, thanks for your help. > > > M?rcio Ricardo. > > > -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From rlmackie862 at gmail.com Tue Apr 15 19:19:14 2008 From: rlmackie862 at gmail.com (Randall Mackie) Date: Tue, 15 Apr 2008 17:19:14 -0700 Subject: general question on speed using quad core Xeons Message-ID: <48054602.9040200@gmail.com> I'm running my PETSc code on a cluster of quad core Xeon's connected by Infiniband. I hadn't much worried about the performance, because everything seemed to be working quite well, but today I was actually comparing performance (wall clock time) for the same problem, but on different combinations of CPUS. I find that my PETSc code is quite scalable until I start to use multiple cores/cpu. For example, the run time doesn't improve by going from 1 core/cpu to 4 cores/cpu, and I find this to be very strange, especially since looking at top or Ganglia, all 4 cpus on each node are running at 100% almost all of the time. I would have thought if the cpus were going all out, that I would still be getting much more scalable results. We are using mvapich-0.9.9 with infiniband. So, I don't know if this is a cluster/Xeon issue, or something else. Anybody with experience on this? Thanks, Randy M. From knepley at gmail.com Tue Apr 15 19:34:08 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 19:34:08 -0500 Subject: general question on speed using quad core Xeons In-Reply-To: <48054602.9040200@gmail.com> References: <48054602.9040200@gmail.com> Message-ID: On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie wrote: > I'm running my PETSc code on a cluster of quad core Xeon's connected > by Infiniband. I hadn't much worried about the performance, because > everything seemed to be working quite well, but today I was actually > comparing performance (wall clock time) for the same problem, but on > different combinations of CPUS. > > I find that my PETSc code is quite scalable until I start to use > multiple cores/cpu. > > For example, the run time doesn't improve by going from 1 core/cpu > to 4 cores/cpu, and I find this to be very strange, especially since > looking at top or Ganglia, all 4 cpus on each node are running at 100% > almost > all of the time. I would have thought if the cpus were going all out, > that I would still be getting much more scalable results. Those a really coarse measures. There is absolutely no way that all cores are going 100%. Its easy to show by hand. Take the peak flop rate and this gives you the bandwidth needed to sustain that computation (if everything is perfect, like axpy). You will find that the chip bandwidth is far below this. A nice analysis is in http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf > We are using mvapich-0.9.9 with infiniband. So, I don't know if > this is a cluster/Xeon issue, or something else. This is actually mathematics! How satisfying. The only way to improve this is to change the data structure (e.g. use blocks) or change the algorithm (e.g. use spectral elements and unassembled structures) Matt > Anybody with experience on this? > > Thanks, Randy M. > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From rlmackie862 at gmail.com Tue Apr 15 19:41:09 2008 From: rlmackie862 at gmail.com (Randall Mackie) Date: Tue, 15 Apr 2008 17:41:09 -0700 Subject: general question on speed using quad core Xeons In-Reply-To: References: <48054602.9040200@gmail.com> Message-ID: <48054B25.5030702@gmail.com> Then what's the point of having 4 and 8 cores per cpu for parallel computations then? I mean, I think I've done all I can to make my code as efficient as possible. I'm not quite sure I understand your comment about using blocks or unassembled structures. Randy Matthew Knepley wrote: > On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie wrote: >> I'm running my PETSc code on a cluster of quad core Xeon's connected >> by Infiniband. I hadn't much worried about the performance, because >> everything seemed to be working quite well, but today I was actually >> comparing performance (wall clock time) for the same problem, but on >> different combinations of CPUS. >> >> I find that my PETSc code is quite scalable until I start to use >> multiple cores/cpu. >> >> For example, the run time doesn't improve by going from 1 core/cpu >> to 4 cores/cpu, and I find this to be very strange, especially since >> looking at top or Ganglia, all 4 cpus on each node are running at 100% >> almost >> all of the time. I would have thought if the cpus were going all out, >> that I would still be getting much more scalable results. > > Those a really coarse measures. There is absolutely no way that all cores > are going 100%. Its easy to show by hand. Take the peak flop rate and > this gives you the bandwidth needed to sustain that computation (if > everything is perfect, like axpy). You will find that the chip bandwidth > is far below this. A nice analysis is in > > http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf > >> We are using mvapich-0.9.9 with infiniband. So, I don't know if >> this is a cluster/Xeon issue, or something else. > > This is actually mathematics! How satisfying. The only way to improve > this is to change the data structure (e.g. use blocks) or change the > algorithm (e.g. use spectral elements and unassembled structures) > > Matt > >> Anybody with experience on this? >> >> Thanks, Randy M. >> >> > > > From knepley at gmail.com Tue Apr 15 19:46:17 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 19:46:17 -0500 Subject: general question on speed using quad core Xeons In-Reply-To: <48054B25.5030702@gmail.com> References: <48054602.9040200@gmail.com> <48054B25.5030702@gmail.com> Message-ID: On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie wrote: > Then what's the point of having 4 and 8 cores per cpu for parallel > computations then? I mean, I think I've done all I can to make > my code as efficient as possible. I really advise reading the paper. It explicitly treats the case of blocking, and uses a simple model to demonstrate all the points I made. With a single, scalar sparse matrix, there is definitely no point at all of having multiple cores. However, this will speed up things like finite element integration. So, for instance, making this integration dominate your cost (like spectral element codes do) will show nice speedup. Ulrich Ruede has a great talk about this on his website. Matt > I'm not quite sure I understand your comment about using blocks > or unassembled structures. > > > Randy > > > > > Matthew Knepley wrote: > > > On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie > wrote: > > > > > I'm running my PETSc code on a cluster of quad core Xeon's connected > > > by Infiniband. I hadn't much worried about the performance, because > > > everything seemed to be working quite well, but today I was actually > > > comparing performance (wall clock time) for the same problem, but on > > > different combinations of CPUS. > > > > > > I find that my PETSc code is quite scalable until I start to use > > > multiple cores/cpu. > > > > > > For example, the run time doesn't improve by going from 1 core/cpu > > > to 4 cores/cpu, and I find this to be very strange, especially since > > > looking at top or Ganglia, all 4 cpus on each node are running at 100% > > > almost > > > all of the time. I would have thought if the cpus were going all out, > > > that I would still be getting much more scalable results. > > > > > > > Those a really coarse measures. There is absolutely no way that all cores > > are going 100%. Its easy to show by hand. Take the peak flop rate and > > this gives you the bandwidth needed to sustain that computation (if > > everything is perfect, like axpy). You will find that the chip bandwidth > > is far below this. A nice analysis is in > > > > http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf > > > > > > > We are using mvapich-0.9.9 with infiniband. So, I don't know if > > > this is a cluster/Xeon issue, or something else. > > > > > > > This is actually mathematics! How satisfying. The only way to improve > > this is to change the data structure (e.g. use blocks) or change the > > algorithm (e.g. use spectral elements and unassembled structures) > > > > Matt > > > > > > > Anybody with experience on this? > > > > > > Thanks, Randy M. > > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From zonexo at gmail.com Tue Apr 15 19:52:19 2008 From: zonexo at gmail.com (Ben Tay) Date: Wed, 16 Apr 2008 08:52:19 +0800 Subject: Slow speed after changing from serial to parallel In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> Message-ID: <48054DC3.8080005@gmail.com> Hi, I was initially using LU and Hypre to solve my serial code. I switched to the default GMRES when I converted the parallel code. I've now redo the test using KSPBCGS and also Hypre BommerAMG. Seems like MatAssemblyBegin, VecAYPX, VecScatterEnd (in bold) are the problems. What should I be checking? Here's the results for 1 and 2 processor for each solver. Thank you so much! *1 processor KSPBCGS * ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed Apr 16 08:32:21 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/Min Avg Total Time (sec): 8.176e+01 1.00000 8.176e+01 Objects: 2.700e+01 1.00000 2.700e+01 Flops: 1.893e+10 1.00000 1.893e+10 1.893e+10 Flops/sec: 2.315e+08 1.00000 2.315e+08 2.315e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 3.743e+03 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 8.1756e+01 100.0% 1.8925e+10 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 3.743e+03 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ ########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # ########################################################## Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 1498 1.0 1.6548e+01 1.0 3.55e+08 1.0 0.0e+00 0.0e+00 0.0e+00 20 31 0 0 0 20 31 0 0 0 355 MatSolve 1500 1.0 3.2228e+01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 0.0e+00 39 31 0 0 0 39 31 0 0 0 183 MatLUFactorNum 2 1.0 2.0642e-01 1.0 1.02e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 102 MatILUFactorSym 2 1.0 2.0250e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 2 1.0 1.7963e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 2 1.0 3.8147e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 2 1.0 2.6301e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatZeroEntries 2 1.0 1.0190e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSetup 2 1.0 2.8230e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 2 1.0 6.7238e+01 1.0 2.81e+08 1.0 0.0e+00 0.0e+00 3.7e+03 82100 0 0100 82100 0 0100 281 PCSetUp 2 1.0 4.3527e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00 6.0e+00 1 0 0 0 0 1 0 0 0 0 48 PCApply 1500 1.0 3.2232e+01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 0.0e+00 39 31 0 0 0 39 31 0 0 0 183 VecDot 2984 1.0 5.3279e+00 1.0 4.84e+08 1.0 0.0e+00 0.0e+00 3.0e+03 7 14 0 0 80 7 14 0 0 80 484 VecNorm 754 1.0 1.1453e+00 1.0 5.74e+08 1.0 0.0e+00 0.0e+00 7.5e+02 1 3 0 0 20 1 3 0 0 20 574 VecCopy 2 1.0 3.2830e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 3 1.0 3.9389e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 2244 1.0 4.8304e+00 1.0 4.02e+08 1.0 0.0e+00 0.0e+00 0.0e+00 6 10 0 0 0 6 10 0 0 0 402 VecAYPX 752 1.0 1.5623e+00 1.0 4.19e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 3 0 0 0 2 3 0 0 0 419 VecWAXPY 1492 1.0 5.0827e+00 1.0 2.54e+08 1.0 0.0e+00 0.0e+00 0.0e+00 6 7 0 0 0 6 7 0 0 0 254 VecAssemblyBegin 2 1.0 2.6703e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 2 1.0 5.2452e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 4 4 300369852 0 Krylov Solver 2 2 8 0 Preconditioner 2 2 336 0 Index Set 6 6 15554064 0 Vec 13 13 44937496 0 ======================================================================================================================== Average time to get PetscTime(): 3.09944e-07 OptionTable: -log_summary Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 Configure run at: Tue Jan 8 22:22:08 2008 Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi-shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 ----------------------------------------- Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 *2 processors KSPBCGS * ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./a.out on a atlas3-mp named atlas3-c25 with 2 processors, by g0306332 Wed Apr 16 08:37:25 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/Min Avg Total Time (sec): 3.795e+02 1.00000 3.795e+02 Objects: 3.800e+01 1.00000 3.800e+01 Flops: 8.592e+09 1.00000 8.592e+09 1.718e+10 Flops/sec: 2.264e+07 1.00000 2.264e+07 4.528e+07 MPI Messages: 1.335e+03 1.00000 1.335e+03 2.670e+03 MPI Message Lengths: 6.406e+06 1.00000 4.798e+03 1.281e+07 MPI Reductions: 1.678e+03 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 3.7950e+02 100.0% 1.7185e+10 100.0% 2.670e+03 100.0% 4.798e+03 100.0% 3.357e+03 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ ########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # ########################################################## Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 1340 1.0 7.4356e+01 1.6 5.87e+07 1.6 2.7e+03 4.8e+03 0.0e+00 16 31100100 0 16 31100100 0 72 MatSolve 1342 1.0 4.3794e+01 1.2 7.08e+07 1.2 0.0e+00 0.0e+00 0.0e+00 11 31 0 0 0 11 31 0 0 0 123 MatLUFactorNum 2 1.0 2.5116e-01 1.0 7.68e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 153 MatILUFactorSym 2 1.0 2.3831e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 *MatAssemblyBegin 2 1.0 7.9380e-0116482.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0* MatAssemblyEnd 2 1.0 2.4782e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 2 1.0 5.0068e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 2 1.0 1.8508e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatZeroEntries 2 1.0 8.6530e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSetup 3 1.0 1.9901e-01 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 2 1.0 3.3575e+02 1.0 2.56e+07 1.0 2.7e+03 4.8e+03 3.3e+03 88100100100100 88100100100100 51 PCSetUp 3 1.0 5.0751e-01 1.0 3.79e+07 1.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 76 PCSetUpOnBlocks 1 1.0 4.4248e-02 1.0 4.39e+07 1.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 88 PCApply 1342 1.0 4.9832e+01 1.2 6.56e+07 1.2 0.0e+00 0.0e+00 0.0e+00 12 31 0 0 0 12 31 0 0 0 108 VecDot 2668 1.0 2.0710e+02 1.2 6.70e+06 1.2 0.0e+00 0.0e+00 2.7e+03 50 13 0 0 79 50 13 0 0 79 11 VecNorm 675 1.0 2.9565e+01 3.3 3.33e+07 3.3 0.0e+00 0.0e+00 6.7e+02 5 3 0 0 20 5 3 0 0 20 20 VecCopy 2 1.0 2.4400e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 1338 1.0 5.9052e+00 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 VecAXPY 2007 1.0 2.2173e+01 2.6 1.03e+08 2.6 0.0e+00 0.0e+00 0.0e+00 4 10 0 0 0 4 10 0 0 0 79 *VecAYPX 673 1.0 2.8062e+00 4.0 4.29e+08 4.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 213* VecWAXPY 1334 1.0 4.8052e+00 2.4 2.84e+08 2.4 0.0e+00 0.0e+00 0.0e+00 1 7 0 0 0 1 7 0 0 0 240 VecAssemblyBegin 2 1.0 1.4091e-04 3.1 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 2 1.0 5.0068e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 *VecScatterBegin 1334 1.0 1.1666e-01 5.9 0.00e+00 0.0 2.7e+03 4.8e+03 0.0e+00 0 0100100 0 0 0100100 0 0* VecScatterEnd 1334 1.0 5.2569e+01 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 6 6 283964900 0 Krylov Solver 3 3 8 0 Preconditioner 3 3 424 0 Index Set 8 8 12965152 0 Vec 17 17 34577080 0 Vec Scatter 1 1 0 0 ======================================================================================================================== Average time to get PetscTime(): 8.10623e-07 Average time for MPI_Barrier(): 5.72205e-07 Average time for zero size MPI_Send(): 1.90735e-06 OptionTable: -log_summary Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 Configure run at: Tue Jan 8 22:22:08 2008 @ @ *1 processor Hypre * ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed Apr 16 08:45:38 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/Min Avg Total Time (sec): 2.059e+01 1.00000 2.059e+01 Objects: 3.400e+01 1.00000 3.400e+01 Flops: 3.151e+08 1.00000 3.151e+08 3.151e+08 Flops/sec: 1.530e+07 1.00000 1.530e+07 1.530e+07 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 2.400e+01 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 2.0590e+01 100.0% 3.1512e+08 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.400e+01 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ ########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # ########################################################## Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 12 1.0 2.6237e-01 1.0 4.24e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 35 0 0 0 1 35 0 0 0 424 MatSolve 7 1.0 4.5932e-01 1.0 2.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 33 0 0 0 2 33 0 0 0 223 MatLUFactorNum 1 1.0 1.2635e-01 1.0 1.36e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 5 0 0 0 1 5 0 0 0 136 MatILUFactorSym 1 1.0 1.3007e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 1 0 0 0 4 1 0 0 0 4 0 MatConvert 1 1.0 4.1277e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 MatAssemblyBegin 2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 2 1.0 1.3946e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatGetRow 432000 1.0 8.4685e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 2 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 1.6376e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 8 0 0 0 0 8 0 MatZeroEntries 2 1.0 8.2422e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPGMRESOrthog 6 1.0 1.0955e-01 1.0 3.31e+08 1.0 0.0e+00 0.0e+00 6.0e+00 1 12 0 0 25 1 12 0 0 25 331 KSPSetup 2 1.0 2.5418e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 2 1.0 5.9363e+00 1.0 5.31e+07 1.0 0.0e+00 0.0e+00 1.8e+01 29100 0 0 75 29100 0 0 75 53 PCSetUp 2 1.0 1.5691e+00 1.0 1.10e+07 1.0 0.0e+00 0.0e+00 5.0e+00 8 5 0 0 21 8 5 0 0 21 11 PCApply 14 1.0 3.7548e+00 1.0 2.73e+07 1.0 0.0e+00 0.0e+00 0.0e+00 18 33 0 0 0 18 33 0 0 0 27 VecMDot 6 1.0 7.7139e-02 1.0 2.35e+08 1.0 0.0e+00 0.0e+00 6.0e+00 0 6 0 0 25 0 6 0 0 25 235 VecNorm 14 1.0 9.9192e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 7.0e+00 0 6 0 0 29 0 6 0 0 29 183 VecScale 7 1.0 5.4052e-03 1.0 5.59e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 559 VecCopy 1 1.0 2.0301e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 9 1.0 1.1883e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 7 1.0 2.8702e-02 1.0 3.91e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 391 VecAYPX 6 1.0 2.8528e-02 1.0 3.63e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 363 VecMAXPY 7 1.0 4.1699e-02 1.0 5.59e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 7 0 0 0 0 7 0 0 0 559 VecAssemblyBegin 2 1.0 2.3842e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 25 0 0 0 0 25 0 VecAssemblyEnd 2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecNormalize 7 1.0 1.3958e-02 1.0 6.50e+08 1.0 0.0e+00 0.0e+00 7.0e+00 0 3 0 0 29 0 3 0 0 29 650 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 3 3 267569524 0 Krylov Solver 2 2 17224 0 Preconditioner 2 2 440 0 Index Set 3 3 10369032 0 Vec 24 24 82961752 0 ======================================================================================================================== Average time to get PetscTime(): 1.90735e-07 OptionTable: -log_summary Compiled without FORTRAN kernels *2 processors Hypre* ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./a.out on a atlas3-mp named atlas3-c48 with 2 processors, by g0306332 Wed Apr 16 08:46:56 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/Min Avg Total Time (sec): 9.614e+01 1.02903 9.478e+01 Objects: 4.100e+01 1.00000 4.100e+01 Flops: 2.778e+08 1.00000 2.778e+08 5.555e+08 Flops/sec: 2.973e+06 1.02903 2.931e+06 5.862e+06 MPI Messages: 7.000e+00 1.00000 7.000e+00 1.400e+01 MPI Message Lengths: 3.120e+04 1.00000 4.457e+03 6.240e+04 MPI Reductions: 1.650e+01 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 9.4784e+01 100.0% 5.5553e+08 100.0% 1.400e+01 100.0% 4.457e+03 100.0% 3.300e+01 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ ########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # ########################################################## Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s --- Event Stage 0: Main Stage MatMult 12 1.0 4.5412e-01 2.0 4.34e+08 2.0 1.2e+01 4.8e+03 0.0e+00 0 36 86 92 0 0 36 86 92 0 438 MatSolve 7 1.0 5.0386e-01 1.1 2.28e+08 1.1 0.0e+00 0.0e+00 0.0e+00 1 37 0 0 0 1 37 0 0 0 407 MatLUFactorNum 1 1.0 9.5120e-01 1.6 2.98e+07 1.6 0.0e+00 0.0e+00 0.0e+00 1 6 0 0 0 1 6 0 0 0 36 MatILUFactorSym 1 1.0 1.1285e+01 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 9 0 0 0 3 9 0 0 0 3 0 MatConvert 1 1.0 6.2023e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 *MatAssemblyBegin 2 1.0 3.1003e+01246.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 16 0 0 0 6 16 0 0 0 6 0* MatAssemblyEnd 2 1.0 2.2413e+00 1.9 0.00e+00 0.0 2.0e+00 2.4e+03 7.0e+00 2 0 14 8 21 2 0 14 8 21 0 MatGetRow 216000 1.0 9.2643e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 3 1.0 5.9605e-06 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 2.4464e-01 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 6 0 0 0 0 6 0 MatZeroEntries 2 1.0 6.1072e+00 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 5 0 0 0 0 5 0 0 0 0 0 KSPGMRESOrthog 6 1.0 4.4529e-02 1.3 5.26e+08 1.3 0.0e+00 0.0e+00 6.0e+00 0 7 0 0 18 0 7 0 0 18 815 KSPSetup 2 1.0 1.8315e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 KSPSolve 2 1.0 3.0572e+01 1.1 9.64e+06 1.1 1.2e+01 4.8e+03 1.8e+01 31100 86 92 55 31100 86 92 55 18 PCSetUp 2 1.0 2.0424e+01 1.3 1.07e+06 1.3 0.0e+00 0.0e+00 5.0e+00 19 6 0 0 15 19 6 0 0 15 2 PCApply 14 1.0 2.9443e+00 1.0 3.56e+07 1.0 0.0e+00 0.0e+00 0.0e+00 3 37 0 0 0 3 37 0 0 0 70 VecMDot 6 1.0 2.7561e-02 1.6 5.15e+08 1.6 0.0e+00 0.0e+00 6.0e+00 0 3 0 0 18 0 3 0 0 18 658 *VecNorm 14 1.0 1.4223e+00 5.1 5.45e+07 5.1 0.0e+00 0.0e+00 7.0e+00 1 5 0 0 21 1 5 0 0 21 21* VecScale 7 1.0 1.8604e-02 1.0 8.25e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 163 VecCopy 1 1.0 3.0069e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 9 1.0 3.2693e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 7 1.0 3.0581e-02 1.1 3.98e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 706 *VecAYPX 6 1.0 4.4344e+00147.6 3.45e+08147.6 0.0e+00 0.0e+00 0.0e+00 2 4 0 0 0 2 4 0 0 0 5* VecMAXPY 7 1.0 2.1892e-02 1.0 5.34e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 1066 VecAssemblyBegin 2 1.0 9.2602e-0412.5 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 18 0 0 0 0 18 0 VecAssemblyEnd 2 1.0 7.8678e-06 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecScatterBegin 6 1.0 9.3222e-05 1.1 0.00e+00 0.0 1.2e+01 4.8e+03 0.0e+00 0 0 86 92 0 0 0 86 92 0 0 *VecScatterEnd 6 1.0 1.9959e-011404.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0* VecNormalize 7 1.0 2.3088e-02 1.0 1.98e+08 1.0 0.0e+00 0.0e+00 7.0e+00 0 2 0 0 21 0 2 0 0 21 393 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 5 5 267571932 0 Krylov Solver 2 2 17224 0 Preconditioner 2 2 440 0 Index Set 5 5 10372120 0 Vec 26 26 53592184 0 Vec Scatter 1 1 0 0 ======================================================================================================================== Average time to get PetscTime(): 2.14577e-07 Average time for MPI_Barrier(): 8.10623e-07 Average time for zero size MPI_Send(): 1.43051e-06 OptionTable: -log_summary Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 Configure run at: Tue Jan 8 22:22:08 2008 Matthew Knepley wrote: > The convergence here is jsut horrendous. Have you tried using LU to check > your implementation? All the time is in the solve right now. I would first > try a direct method (at least on a small problem) and then try to understand > the convergence behavior. MUMPS can actually scale very well for big problems. > > Matt > > >>>>> >>>> >>> >>> >> > > > > From rlmackie862 at gmail.com Tue Apr 15 21:03:15 2008 From: rlmackie862 at gmail.com (Randall Mackie) Date: Tue, 15 Apr 2008 19:03:15 -0700 Subject: general question on speed using quad core Xeons In-Reply-To: References: <48054602.9040200@gmail.com> <48054B25.5030702@gmail.com> Message-ID: <48055E63.5070606@gmail.com> Okay, but if I'm stuck with a big 3D finite difference code, written in PETSc using Distributed Arrays, with 3 dof per node, then you're saying there is really nothing I can do, except using blocking, to improve things on quad core cpus? They talk about blocking using BAIJ format, and so is this the same thing as creating MPIBAIJ matrices in PETSc? And is creating MPIBAIJ matrices in PETSc going to make a substantial difference in the speed? I'm sorry if I'm being dense, I'm just trying to understand if there is some simple way I can utilize those extra cores on each cpu easily, and since I'm not a computer scientist, some of these concepts are difficult. Thanks, Randy Matthew Knepley wrote: > On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie wrote: >> Then what's the point of having 4 and 8 cores per cpu for parallel >> computations then? I mean, I think I've done all I can to make >> my code as efficient as possible. > > I really advise reading the paper. It explicitly treats the case of > blocking, and uses > a simple model to demonstrate all the points I made. > > With a single, scalar sparse matrix, there is definitely no point at > all of having > multiple cores. However, this will speed up things like finite element > integration. > So, for instance, making this integration dominate your cost (like > spectral element > codes do) will show nice speedup. Ulrich Ruede has a great talk about this on > his website. > > Matt > >> I'm not quite sure I understand your comment about using blocks >> or unassembled structures. >> >> >> Randy >> >> >> >> >> Matthew Knepley wrote: >> >>> On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie >> wrote: >>>> I'm running my PETSc code on a cluster of quad core Xeon's connected >>>> by Infiniband. I hadn't much worried about the performance, because >>>> everything seemed to be working quite well, but today I was actually >>>> comparing performance (wall clock time) for the same problem, but on >>>> different combinations of CPUS. >>>> >>>> I find that my PETSc code is quite scalable until I start to use >>>> multiple cores/cpu. >>>> >>>> For example, the run time doesn't improve by going from 1 core/cpu >>>> to 4 cores/cpu, and I find this to be very strange, especially since >>>> looking at top or Ganglia, all 4 cpus on each node are running at 100% >>>> almost >>>> all of the time. I would have thought if the cpus were going all out, >>>> that I would still be getting much more scalable results. >>>> >>> Those a really coarse measures. There is absolutely no way that all cores >>> are going 100%. Its easy to show by hand. Take the peak flop rate and >>> this gives you the bandwidth needed to sustain that computation (if >>> everything is perfect, like axpy). You will find that the chip bandwidth >>> is far below this. A nice analysis is in >>> >>> http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf >>> >>> >>>> We are using mvapich-0.9.9 with infiniband. So, I don't know if >>>> this is a cluster/Xeon issue, or something else. >>>> >>> This is actually mathematics! How satisfying. The only way to improve >>> this is to change the data structure (e.g. use blocks) or change the >>> algorithm (e.g. use spectral elements and unassembled structures) >>> >>> Matt >>> >>> >>>> Anybody with experience on this? >>>> >>>> Thanks, Randy M. >>>> >>>> >>>> >>> >>> >>> >> > > > From zonexo at gmail.com Tue Apr 15 21:08:45 2008 From: zonexo at gmail.com (Ben Tay) Date: Wed, 16 Apr 2008 10:08:45 +0800 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> Message-ID: <48055FAD.3000105@gmail.com> An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Apr 15 21:20:02 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 21:20:02 -0500 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <48055FAD.3000105@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> Message-ID: On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay wrote: > > Hi, > > I just tested the ex2f.F example, changing m and n to 600. Here's the > result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin, > MatGetOrdering and KSPSetup have ratios >>1. The time taken seems to be > faster as the processor increases, although speedup is not 1:1. I thought > that this example should scale well, shouldn't it? Is there something wrong > with my installation then? 1) Notice that the events that are unbalanced take 0.01% of the time. Not important. 2) The speedup really stinks. Even though this is a small problem. Are you sure that you are actually running on two processors with separate memory pipes and not on 1 dual core? Matt > Thank you. > > 1 processor: > > Norm of error 0.3371E+01 iterations 1153 > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed > Apr 16 10:03:12 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG > revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 1.222e+02 1.00000 1.222e+02 > Objects: 4.400e+01 1.00000 4.400e+01 > Flops: 3.547e+10 1.00000 3.547e+10 3.547e+10 > Flops/sec: 2.903e+08 1.00000 2.903e+08 2.903e+08 > MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Reductions: 2.349e+03 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 1.2216e+02 100.0% 3.5466e+10 100.0% 0.000e+00 0.0% > 0.000e+00 0.0% 2.349e+03 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 13 11 0 0 0 13 11 0 0 0 239 > MatSolve 1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 25 11 0 0 0 25 11 0 0 0 124 > MatLUFactorNum 1 1.0 3.6166e-02 1.0 8.94e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 89 > MatILUFactorSym 1 1.0 1.9690e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 2.6258e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 5.4259e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 1153 1.0 3.2664e+01 1.0 3.92e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 27 36 0 0 49 27 36 0 0 49 392 > VecNorm 1193 1.0 2.0344e+00 1.0 4.22e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 2 2 0 0 51 2 2 0 0 51 422 > VecScale 1192 1.0 6.9107e-01 1.0 6.21e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 1 0 0 0 1 1 0 0 0 621 > VecCopy 39 1.0 3.4571e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 41 1.0 1.1397e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 78 1.0 6.9354e-01 1.0 8.10e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 81 > VecMAXPY 1192 1.0 3.7492e+01 1.0 3.63e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 31 38 0 0 0 31 38 0 0 0 363 > VecNormalize 1192 1.0 2.7284e+00 1.0 4.72e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 2 4 0 0 51 2 4 0 0 51 472 > KSPGMRESOrthog 1153 1.0 6.7939e+01 1.0 3.76e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 56 72 0 0 49 56 72 0 0 49 376 > KSPSetup 1 1.0 1.1651e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 > 2.3e+03100100 0 0100 100100 0 0100 292 > PCSetUp 1 1.0 2.3852e-01 1.0 1.36e+07 1.0 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 14 > PCApply 1192 1.0 3.1021e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 25 11 0 0 0 25 11 0 0 0 124 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 2 2 54691212 0 > Index Set 3 3 4321032 0 > Vec 37 37 103708408 0 > Krylov Solver 1 1 17216 0 > Preconditioner 1 1 168 0 > > ======================================================================================================================== > Average time to get PetscTime(): 1.90735e-07 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 > --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 > --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 > --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 > --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 > --with-batch=1 --with-mpi-shared=0 > --with-mpi-include=/usr/local/topspin/mpi/mpich/include > --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a > --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun > --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 > ----------------------------------------- > Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12 > 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 > Using PETSc arch: atlas3-mpi > ----------------------------------------- > 85.53user 1.22system 2:02.65elapsed 70%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (16major+46429minor)pagefaults 0swaps > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > > > 2 processors: > > Norm of error 0.3231E+01 iterations 1177 > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c58 with 2 processors, by g0306332 Wed > Apr 16 09:48:37 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG > revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 1.034e+02 1.00000 1.034e+02 > Objects: 5.500e+01 1.00000 5.500e+01 > Flops: 1.812e+10 1.00000 1.812e+10 3.625e+10 > Flops/sec: 1.752e+08 1.00000 1.752e+08 3.504e+08 > MPI Messages: 1.218e+03 1.00000 1.218e+03 2.436e+03 > MPI Message Lengths: 5.844e+06 1.00000 4.798e+03 1.169e+07 > MPI Reductions: 1.204e+03 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 1.0344e+02 100.0% 3.6250e+10 100.0% 2.436e+03 100.0% > 4.798e+03 100.0% 2.407e+03 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 > 0.0e+00 11 11100100 0 11 11100100 0 315 > MatSolve 1217 1.0 2.1088e+01 1.2 1.10e+08 1.2 0.0e+00 0.0e+00 > 0.0e+00 19 11 0 0 0 19 11 0 0 0 187 > MatLUFactorNum 1 1.0 8.2862e-02 2.9 5.58e+07 2.9 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 39 > MatILUFactorSym 1 1.0 3.3310e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 1.5567e-011854.8 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 1.0352e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 5.0953e-0210.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 1177 1.0 4.0427e+01 1.1 1.85e+08 1.1 0.0e+00 0.0e+00 > 1.2e+03 37 36 0 0 49 37 36 0 0 49 323 > VecNorm 1218 1.0 1.5475e+01 1.9 5.25e+07 1.9 0.0e+00 0.0e+00 > 1.2e+03 12 2 0 0 51 12 2 0 0 51 57 > VecScale 1217 1.0 5.7866e-01 1.0 3.97e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 1 0 0 0 1 1 0 0 0 757 > VecCopy 40 1.0 6.6697e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 1259 1.0 1.5276e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecAXPY 80 1.0 2.1163e-01 2.4 3.21e+08 2.4 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 272 > VecMAXPY 1217 1.0 2.2980e+01 1.4 4.28e+08 1.4 0.0e+00 0.0e+00 > 0.0e+00 19 38 0 0 0 19 38 0 0 0 606 > VecScatterBegin 1217 1.0 3.6620e-02 1.4 0.00e+00 0.0 2.4e+03 4.8e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0 > VecScatterEnd 1217 1.0 8.1980e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecNormalize 1217 1.0 1.6030e+01 1.8 7.36e+07 1.8 0.0e+00 0.0e+00 > 1.2e+03 12 4 0 0 51 12 4 0 0 51 82 > KSPGMRESOrthog 1177 1.0 5.7248e+01 1.0 2.35e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 55 72 0 0 49 55 72 0 0 49 457 > KSPSetup 2 1.0 1.0363e-0110.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 > 2.4e+03 99100100100100 99100100100100 352 > PCSetUp 2 1.0 1.5685e-01 2.3 2.40e+07 2.3 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 21 > PCSetUpOnBlocks 1 1.0 1.5668e-01 2.3 2.41e+07 2.3 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 21 > PCApply 1217 1.0 2.2625e+01 1.2 1.02e+08 1.2 0.0e+00 0.0e+00 > 0.0e+00 20 11 0 0 0 20 11 0 0 0 174 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 4 4 34540820 0 > Index Set 5 5 2164120 0 > Vec 41 41 53315992 0 > Vec Scatter 1 1 0 0 > Krylov Solver 2 2 17216 0 > Preconditioner 2 2 256 0 > > ======================================================================================================================== > Average time to get PetscTime(): 1.90735e-07 > Average time for MPI_Barrier(): 8.10623e-07 > Average time for zero size MPI_Send(): 2.98023e-06 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > > 42.64user 0.28system 1:08.08elapsed 63%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (18major+28609minor)pagefaults 0swaps > 1:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (18major+23666minor)pagefaults 0swaps > > > 4 processors: > > Norm of error 0.3090E+01 iterations 937 > 63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (16major+13520minor)pagefaults 0swaps > 53.13user 0.06system 1:04.31elapsed 82%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (15major+13414minor)pagefaults 0swaps > 58.55user 0.23system 1:04.31elapsed 91%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (17major+18383minor)pagefaults 0swaps > 20.36user 0.67system 1:04.33elapsed 32%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (14major+18392minor)pagefaults 0swaps > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > > > > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c45 with 4 processors, by g0306332 Wed > Apr 16 09:55:16 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG > revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 6.374e+01 1.00001 6.374e+01 > Objects: 5.500e+01 1.00000 5.500e+01 > Flops: 7.209e+09 1.00016 7.208e+09 2.883e+10 > Flops/sec: 1.131e+08 1.00017 1.131e+08 4.524e+08 > MPI Messages: 1.940e+03 2.00000 1.455e+03 5.820e+03 > MPI Message Lengths: 9.307e+06 2.00000 4.798e+03 2.792e+07 > MPI Reductions: 4.798e+02 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 6.3737e+01 100.0% 2.8832e+10 100.0% 5.820e+03 100.0% > 4.798e+03 100.0% 1.919e+03 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 > 0.0e+00 8 11100100 0 8 11100100 0 321 > MatSolve 969 1.0 1.4244e+01 3.3 1.79e+08 3.3 0.0e+00 0.0e+00 > 0.0e+00 11 11 0 0 0 11 11 0 0 0 220 > MatLUFactorNum 1 1.0 5.2070e-02 6.2 9.63e+07 6.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 62 > MatILUFactorSym 1 1.0 1.7911e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 2.1741e-01164.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 3.5663e-02 1.0 0.00e+00 0.0 6.0e+00 2.4e+03 > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 2.1458e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 1.2779e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 937 1.0 3.5634e+01 2.1 1.52e+08 2.1 0.0e+00 0.0e+00 > 9.4e+02 48 36 0 0 49 48 36 0 0 49 292 > VecNorm 970 1.0 1.4387e+01 2.9 3.55e+07 2.9 0.0e+00 0.0e+00 > 9.7e+02 18 2 0 0 51 18 2 0 0 51 49 > VecScale 969 1.0 1.5714e-01 2.1 1.14e+09 2.1 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 2220 > VecCopy 32 1.0 1.8988e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 1003 1.0 1.1690e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecAXPY 64 1.0 2.1091e-02 1.1 6.07e+08 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 2185 > VecMAXPY 969 1.0 1.4823e+01 3.4 6.26e+08 3.4 0.0e+00 0.0e+00 > 0.0e+00 11 38 0 0 0 11 38 0 0 0 747 > VecScatterBegin 969 1.0 2.3238e-02 2.1 0.00e+00 0.0 5.8e+03 4.8e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0 > VecScatterEnd 969 1.0 1.4613e+0083.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecNormalize 969 1.0 1.4468e+01 2.8 5.15e+07 2.8 0.0e+00 0.0e+00 > 9.7e+02 18 4 0 0 50 18 4 0 0 50 72 > KSPGMRESOrthog 937 1.0 3.9924e+01 1.3 1.68e+08 1.3 0.0e+00 0.0e+00 > 9.4e+02 59 72 0 0 49 59 72 0 0 49 521 > KSPSetup 2 1.0 2.6190e-02 8.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 > 1.9e+03 98100100100 99 98100100100 99 461 > PCSetUp 2 1.0 7.1320e-02 4.1 4.59e+07 4.1 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 45 > PCSetUpOnBlocks 1 1.0 7.1230e-02 4.1 4.62e+07 4.1 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 45 > PCApply 969 1.0 1.5379e+01 3.3 1.66e+08 3.3 0.0e+00 0.0e+00 > 0.0e+00 12 11 0 0 0 12 11 0 0 0 203 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 4 4 17264420 0 > Index Set 5 5 1084120 0 > Vec 41 41 26675992 0 > Vec Scatter 1 1 0 0 > Krylov Solver 2 2 17216 0 > Preconditioner 2 2 256 0 > > ======================================================================================================================== > Average time to get PetscTime(): 1.90735e-07 > Average time for MPI_Barrier(): 6.00815e-06 > Average time for zero size MPI_Send(): 5.42402e-05 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > > > > Matthew Knepley wrote: > The convergence here is jsut horrendous. Have you tried using LU to check > your implementation? All the time is in the solve right now. I would first > try a direct method (at least on a small problem) and then try to understand > the convergence behavior. MUMPS can actually scale very well for big > problems. > > Matt > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From knepley at gmail.com Tue Apr 15 21:34:33 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 21:34:33 -0500 Subject: general question on speed using quad core Xeons In-Reply-To: <48055E63.5070606@gmail.com> References: <48054602.9040200@gmail.com> <48054B25.5030702@gmail.com> <48055E63.5070606@gmail.com> Message-ID: On Tue, Apr 15, 2008 at 9:03 PM, Randall Mackie wrote: > Okay, but if I'm stuck with a big 3D finite difference code, written in > PETSc > using Distributed Arrays, with 3 dof per node, then you're saying there is > really nothing I can do, except using blocking, to improve things on quad > core cpus? They talk about blocking using BAIJ format, and so is this the Yes, just about. > same thing as creating MPIBAIJ matrices in PETSc? And is creating MPIBAIJ Yes. > matrices in PETSc going to make a substantial difference in the speed? That is the hope. You can just give MPIBAIJ as the argument to DAGetMatrix(). > I'm sorry if I'm being dense, I'm just trying to understand if there is some > simple way I can utilize those extra cores on each cpu easily, and since > I'm not a computer scientist, some of these concepts are difficult. I really believe extra cores are currently a con for scientific computing. There are real mathematical barriers to their effective use. Matt > Thanks, Randy > Matthew Knepley wrote: > > > On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie > wrote: > > > > > Then what's the point of having 4 and 8 cores per cpu for parallel > > > computations then? I mean, I think I've done all I can to make > > > my code as efficient as possible. > > > > > > > I really advise reading the paper. It explicitly treats the case of > > blocking, and uses > > a simple model to demonstrate all the points I made. > > > > With a single, scalar sparse matrix, there is definitely no point at > > all of having > > multiple cores. However, this will speed up things like finite element > > integration. > > So, for instance, making this integration dominate your cost (like > > spectral element > > codes do) will show nice speedup. Ulrich Ruede has a great talk about this > on > > his website. > > > > Matt > > > > > > > I'm not quite sure I understand your comment about using blocks > > > or unassembled structures. > > > > > > > > > Randy > > > > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie > > > > > > > > wrote: > > > > > > > > > > > > I'm running my PETSc code on a cluster of quad core Xeon's connected > > > > > by Infiniband. I hadn't much worried about the performance, because > > > > > everything seemed to be working quite well, but today I was > actually > > > > > comparing performance (wall clock time) for the same problem, but > on > > > > > different combinations of CPUS. > > > > > > > > > > I find that my PETSc code is quite scalable until I start to use > > > > > multiple cores/cpu. > > > > > > > > > > For example, the run time doesn't improve by going from 1 core/cpu > > > > > to 4 cores/cpu, and I find this to be very strange, especially > since > > > > > looking at top or Ganglia, all 4 cpus on each node are running at > 100% > > > > > almost > > > > > all of the time. I would have thought if the cpus were going all > out, > > > > > that I would still be getting much more scalable results. > > > > > > > > > > > > > > Those a really coarse measures. There is absolutely no way that all > cores > > > > are going 100%. Its easy to show by hand. Take the peak flop rate and > > > > this gives you the bandwidth needed to sustain that computation (if > > > > everything is perfect, like axpy). You will find that the chip > bandwidth > > > > is far below this. A nice analysis is in > > > > > > > > http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf > > > > > > > > > > > > > > > > > We are using mvapich-0.9.9 with infiniband. So, I don't know if > > > > > this is a cluster/Xeon issue, or something else. > > > > > > > > > > > > > > This is actually mathematics! How satisfying. The only way to improve > > > > this is to change the data structure (e.g. use blocks) or change the > > > > algorithm (e.g. use spectral elements and unassembled structures) > > > > > > > > Matt > > > > > > > > > > > > > > > > > Anybody with experience on this? > > > > > > > > > > Thanks, Randy M. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From zonexo at gmail.com Tue Apr 15 22:01:28 2008 From: zonexo at gmail.com (Ben Tay) Date: Wed, 16 Apr 2008 11:01:28 +0800 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> Message-ID: <48056C08.6030903@gmail.com> An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Apr 15 22:08:02 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 15 Apr 2008 22:08:02 -0500 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <48056C08.6030903@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> Message-ID: On Tue, Apr 15, 2008 at 10:01 PM, Ben Tay wrote: > > Hi Matthew, > > You mention that the unbalanced events take 0.01% of the time and speedup > is terrible. Where did you get this information? Are you referring to Global 1) Look at the time of the events you point out (1.0e-2s) and the total time or time for KSPSolve(1.0e2) 2) Look at the time for KSPSolve on 1 and 2 procs > %T? As for the speedup, do you look at the time reported by the "time" > command ie 63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata > 0maxresident)? > > I think you may be right. My school uses : > > The Supercomputing & Visualisation Unit, Computer Centre is pleased to > announce the addition of a new cluster of Linux-based compute servers, > consisting of a total of 64 servers (60 dual-core and 4 quad-core systems). > Each of the compute nodes in the cluster is equipped with the following > configurations: > > No of Nodes Processors Qty per node Total cores per node Memory per node > 4 Quad-Core Intel Xeon X5355 2 8 16 GB > 60 Dual-Core Intel Xeon 5160 2 4 8 GB > When I run on 2 processors, it states I'm running on 2*atlas3-c45. So does > it mean I running on shared memory bandwidth? So does it mean if I run on 4 > processors, is it equivalent to using 2 memory pipes? > > I also got a reply from my school's engineer: > > For queue mcore_parallel, LSF will assign the compute nodes automatically. > To most of applications, running with 2*atlas3-c45 and 2*atlas3-c50 may be > faster. However, it is not sure if 2*atlas3-c45 means to run the job within > one CPU on dual core, or with two CPUs on two separate cores. This is not > controllable. > > So what can I do on my side to ensure speedup? I hope I do not have to > switch from PETSc to other solvers. Switching solvers will do you no good at all. The easiest thing to do is get these guys to improve the scheduler. Every half decent scheduler can assure that you get separate processors. There is no excuse for forcing you into dual cores. Matt > Thanks lot! > > > > Matthew Knepley wrote: > On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay wrote: > > > Hi, > > I just tested the ex2f.F example, changing m and n to 600. Here's the > result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin, > MatGetOrdering and KSPSetup have ratios >>1. The time taken seems to be > faster as the processor increases, although speedup is not 1:1. I thought > that this example should scale well, shouldn't it? Is there something wrong > with my installation then? > > 1) Notice that the events that are unbalanced take 0.01% of the time. > Not important. > > 2) The speedup really stinks. Even though this is a small problem. Are > you sure that > you are actually running on two processors with separate memory > pipes and not > on 1 dual core? > > Matt > > > > Thank you. > > 1 processor: > > Norm of error 0.3371E+01 iterations 1153 > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed > Apr 16 10:03:12 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG > revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 1.222e+02 1.00000 1.222e+02 > Objects: 4.400e+01 1.00000 4.400e+01 > Flops: 3.547e+10 1.00000 3.547e+10 3.547e+10 > Flops/sec: 2.903e+08 1.00000 2.903e+08 2.903e+08 > MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Reductions: 2.349e+03 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 1.2216e+02 100.0% 3.5466e+10 100.0% 0.000e+00 0.0% > 0.000e+00 0.0% 2.349e+03 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 13 11 0 0 0 13 11 0 0 0 239 > MatSolve 1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 25 11 0 0 0 25 11 0 0 0 124 > MatLUFactorNum 1 1.0 3.6166e-02 1.0 8.94e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 89 > MatILUFactorSym 1 1.0 1.9690e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 2.6258e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 5.4259e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 1153 1.0 3.2664e+01 1.0 3.92e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 27 36 0 0 49 27 36 0 0 49 392 > VecNorm 1193 1.0 2.0344e+00 1.0 4.22e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 2 2 0 0 51 2 2 0 0 51 422 > VecScale 1192 1.0 6.9107e-01 1.0 6.21e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 1 0 0 0 1 1 0 0 0 621 > VecCopy 39 1.0 3.4571e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 41 1.0 1.1397e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 78 1.0 6.9354e-01 1.0 8.10e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 81 > VecMAXPY 1192 1.0 3.7492e+01 1.0 3.63e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 31 38 0 0 0 31 38 0 0 0 363 > VecNormalize 1192 1.0 2.7284e+00 1.0 4.72e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 2 4 0 0 51 2 4 0 0 51 472 > KSPGMRESOrthog 1153 1.0 6.7939e+01 1.0 3.76e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 56 72 0 0 49 56 72 0 0 49 376 > KSPSetup 1 1.0 1.1651e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 > 2.3e+03100100 0 0100 100100 0 0100 292 > PCSetUp 1 1.0 2.3852e-01 1.0 1.36e+07 1.0 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 14 > PCApply 1192 1.0 3.1021e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 25 11 0 0 0 25 11 0 0 0 124 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 2 2 54691212 0 > Index Set 3 3 4321032 0 > Vec 37 37 103708408 0 > Krylov Solver 1 1 17216 0 > Preconditioner 1 1 168 0 > > ======================================================================================================================== > Average time to get PetscTime(): 1.90735e-07 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 > --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 > --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 > --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 > --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 > --with-batch=1 --with-mpi-shared=0 > --with-mpi-include=/usr/local/topspin/mpi/mpich/include > --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a > --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun > --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 > ----------------------------------------- > Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12 > 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 > Using PETSc arch: atlas3-mpi > ----------------------------------------- > 85.53user 1.22system 2:02.65elapsed 70%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (16major+46429minor)pagefaults 0swaps > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > > > 2 processors: > > Norm of error 0.3231E+01 iterations 1177 > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c58 with 2 processors, by g0306332 Wed > Apr 16 09:48:37 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG > revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 1.034e+02 1.00000 1.034e+02 > Objects: 5.500e+01 1.00000 5.500e+01 > Flops: 1.812e+10 1.00000 1.812e+10 3.625e+10 > Flops/sec: 1.752e+08 1.00000 1.752e+08 3.504e+08 > MPI Messages: 1.218e+03 1.00000 1.218e+03 2.436e+03 > MPI Message Lengths: 5.844e+06 1.00000 4.798e+03 1.169e+07 > MPI Reductions: 1.204e+03 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 1.0344e+02 100.0% 3.6250e+10 100.0% 2.436e+03 100.0% > 4.798e+03 100.0% 2.407e+03 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 > 0.0e+00 11 11100100 0 11 11100100 0 315 > MatSolve 1217 1.0 2.1088e+01 1.2 1.10e+08 1.2 0.0e+00 0.0e+00 > 0.0e+00 19 11 0 0 0 19 11 0 0 0 187 > MatLUFactorNum 1 1.0 8.2862e-02 2.9 5.58e+07 2.9 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 39 > MatILUFactorSym 1 1.0 3.3310e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 1.5567e-011854.8 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 1.0352e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 5.0953e-0210.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 1177 1.0 4.0427e+01 1.1 1.85e+08 1.1 0.0e+00 0.0e+00 > 1.2e+03 37 36 0 0 49 37 36 0 0 49 323 > VecNorm 1218 1.0 1.5475e+01 1.9 5.25e+07 1.9 0.0e+00 0.0e+00 > 1.2e+03 12 2 0 0 51 12 2 0 0 51 57 > VecScale 1217 1.0 5.7866e-01 1.0 3.97e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 1 0 0 0 1 1 0 0 0 757 > VecCopy 40 1.0 6.6697e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 1259 1.0 1.5276e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecAXPY 80 1.0 2.1163e-01 2.4 3.21e+08 2.4 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 272 > VecMAXPY 1217 1.0 2.2980e+01 1.4 4.28e+08 1.4 0.0e+00 0.0e+00 > 0.0e+00 19 38 0 0 0 19 38 0 0 0 606 > VecScatterBegin 1217 1.0 3.6620e-02 1.4 0.00e+00 0.0 2.4e+03 4.8e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0 > VecScatterEnd 1217 1.0 8.1980e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecNormalize 1217 1.0 1.6030e+01 1.8 7.36e+07 1.8 0.0e+00 0.0e+00 > 1.2e+03 12 4 0 0 51 12 4 0 0 51 82 > KSPGMRESOrthog 1177 1.0 5.7248e+01 1.0 2.35e+08 1.0 0.0e+00 0.0e+00 > 1.2e+03 55 72 0 0 49 55 72 0 0 49 457 > KSPSetup 2 1.0 1.0363e-0110.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 > 2.4e+03 99100100100100 99100100100100 352 > PCSetUp 2 1.0 1.5685e-01 2.3 2.40e+07 2.3 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 21 > PCSetUpOnBlocks 1 1.0 1.5668e-01 2.3 2.41e+07 2.3 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 21 > PCApply 1217 1.0 2.2625e+01 1.2 1.02e+08 1.2 0.0e+00 0.0e+00 > 0.0e+00 20 11 0 0 0 20 11 0 0 0 174 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 4 4 34540820 0 > Index Set 5 5 2164120 0 > Vec 41 41 53315992 0 > Vec Scatter 1 1 0 0 > Krylov Solver 2 2 17216 0 > Preconditioner 2 2 256 0 > > ======================================================================================================================== > Average time to get PetscTime(): 1.90735e-07 > Average time for MPI_Barrier(): 8.10623e-07 > Average time for zero size MPI_Send(): 2.98023e-06 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > > 42.64user 0.28system 1:08.08elapsed 63%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (18major+28609minor)pagefaults 0swaps > 1:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (18major+23666minor)pagefaults 0swaps > > > 4 processors: > > Norm of error 0.3090E+01 iterations 937 > 63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (16major+13520minor)pagefaults 0swaps > 53.13user 0.06system 1:04.31elapsed 82%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (15major+13414minor)pagefaults 0swaps > 58.55user 0.23system 1:04.31elapsed 91%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (17major+18383minor)pagefaults 0swaps > 20.36user 0.67system 1:04.33elapsed 32%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (14major+18392minor)pagefaults 0swaps > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > > > > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c45 with 4 processors, by g0306332 Wed > Apr 16 09:55:16 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG > revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 6.374e+01 1.00001 6.374e+01 > Objects: 5.500e+01 1.00000 5.500e+01 > Flops: 7.209e+09 1.00016 7.208e+09 2.883e+10 > Flops/sec: 1.131e+08 1.00017 1.131e+08 4.524e+08 > MPI Messages: 1.940e+03 2.00000 1.455e+03 5.820e+03 > MPI Message Lengths: 9.307e+06 2.00000 4.798e+03 2.792e+07 > MPI Reductions: 4.798e+02 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 6.3737e+01 100.0% 2.8832e+10 100.0% 5.820e+03 100.0% > 4.798e+03 100.0% 1.919e+03 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 > 0.0e+00 8 11100100 0 8 11100100 0 321 > MatSolve 969 1.0 1.4244e+01 3.3 1.79e+08 3.3 0.0e+00 0.0e+00 > 0.0e+00 11 11 0 0 0 11 11 0 0 0 220 > MatLUFactorNum 1 1.0 5.2070e-02 6.2 9.63e+07 6.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 62 > MatILUFactorSym 1 1.0 1.7911e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 2.1741e-01164.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 3.5663e-02 1.0 0.00e+00 0.0 6.0e+00 2.4e+03 > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 2.1458e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 1.2779e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 937 1.0 3.5634e+01 2.1 1.52e+08 2.1 0.0e+00 0.0e+00 > 9.4e+02 48 36 0 0 49 48 36 0 0 49 292 > VecNorm 970 1.0 1.4387e+01 2.9 3.55e+07 2.9 0.0e+00 0.0e+00 > 9.7e+02 18 2 0 0 51 18 2 0 0 51 49 > VecScale 969 1.0 1.5714e-01 2.1 1.14e+09 2.1 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 2220 > VecCopy 32 1.0 1.8988e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 1003 1.0 1.1690e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecAXPY 64 1.0 2.1091e-02 1.1 6.07e+08 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 2185 > VecMAXPY 969 1.0 1.4823e+01 3.4 6.26e+08 3.4 0.0e+00 0.0e+00 > 0.0e+00 11 38 0 0 0 11 38 0 0 0 747 > VecScatterBegin 969 1.0 2.3238e-02 2.1 0.00e+00 0.0 5.8e+03 4.8e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0 > VecScatterEnd 969 1.0 1.4613e+0083.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecNormalize 969 1.0 1.4468e+01 2.8 5.15e+07 2.8 0.0e+00 0.0e+00 > 9.7e+02 18 4 0 0 50 18 4 0 0 50 72 > KSPGMRESOrthog 937 1.0 3.9924e+01 1.3 1.68e+08 1.3 0.0e+00 0.0e+00 > 9.4e+02 59 72 0 0 49 59 72 0 0 49 521 > KSPSetup 2 1.0 2.6190e-02 8.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 > 1.9e+03 98100100100 99 98100100100 99 461 > PCSetUp 2 1.0 7.1320e-02 4.1 4.59e+07 4.1 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 45 > PCSetUpOnBlocks 1 1.0 7.1230e-02 4.1 4.62e+07 4.1 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 45 > PCApply 969 1.0 1.5379e+01 3.3 1.66e+08 3.3 0.0e+00 0.0e+00 > 0.0e+00 12 11 0 0 0 12 11 0 0 0 203 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 4 4 17264420 0 > Index Set 5 5 1084120 0 > Vec 41 41 26675992 0 > Vec Scatter 1 1 0 0 > Krylov Solver 2 2 17216 0 > Preconditioner 2 2 256 0 > > ======================================================================================================================== > Average time to get PetscTime(): 1.90735e-07 > Average time for MPI_Barrier(): 6.00815e-06 > Average time for zero size MPI_Send(): 5.42402e-05 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > > > > Matthew Knepley wrote: > The convergence here is jsut horrendous. Have you tried using LU to check > your implementation? All the time is in the solve right now. I would first > try a direct method (at least on a small problem) and then try to understand > the convergence behavior. MUMPS can actually scale very well for big > problems. > > Matt > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From balay at mcs.anl.gov Tue Apr 15 22:45:25 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Tue, 15 Apr 2008 22:45:25 -0500 (CDT) Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <48056C08.6030903@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> Message-ID: On Wed, 16 Apr 2008, Ben Tay wrote: > I think you may be right. My school uses : > ? No of Nodes Processors Qty per node Total cores per node Memory per node ? > ? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ? > ? 60 Dual-Core Intel Xeon 5160 2 4 8 GB I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355 machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I get the following: << Logs for my run are attached >> asterix:/home/balay/download-pine>grep MatMult * ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397 ex2f-600-2p.log:MatMult 1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 632 ex2f-600-4p.log:MatMult 969 1.0 4.3311e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 15 11100100 0 15 11100100 0 724 ex2f-600-8p.log:MatMult 1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 1.8e+04 4.8e+03 0.0e+00 16 11100100 0 16 11100100 0 749 asterix:/home/balay/download-pine>grep KSPSolve * ex2f-600-1p.log:KSPSolve 1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100 0 0100 100100 0 0100 513 ex2f-600-2p.log:KSPSolve 1 1.0 4.4005e+01 1.0 1.81e+10 1.0 2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100 824 ex2f-600-4p.log:KSPSolve 1 1.0 2.8139e+01 1.0 7.21e+09 1.0 5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99 1024 ex2f-600-8p.log:KSPSolve 1 1.0 3.6260e+01 1.0 4.90e+09 1.0 1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100 1081 asterix:/home/balay/download-pine> You get the following [with intel compilers?]: asterix:/home/balay/download-pine/x>grep MatMult * log.1:MatMult???????????? 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11? 0? 0? 0? 13 11? 0? 0? 0?? 239 log.2:MatMult???????????? 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 0.0e+00 11 11100100? 0? 11 11100100? 0?? 315 log.4:MatMult????????????? 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 0.0e+00? 8 11100100? 0?? 8 11100100? 0?? 321 asterix:/home/balay/download-pine/x>grep KSPSolve * log.1:KSPSolve?????????????? 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 2.3e+03100100? 0? 0100 100100? 0? 0100?? 292 log.2:KSPSolve?????????????? 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 2.4e+03 99100100100100? 99100100100100?? 352 log.4:KSPSolve?????????????? 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 1.9e+03 98100100100 99? 98100100100 99?? 461 asterix:/home/balay/download-pine/x> What exact CPU was this run on? A couple of comments: - my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 3.6 for 2,4 proc runs [so higher load imbalance on your machine] - The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 for you - Speedups I see for MatMult are: np me you 2 1.59 1.32 4 1.82 1.34 8 1.88 -------------------------- The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores. As Matt indicated perhaps in "Subject: general question on speed using quad core Xeons" thread, for sparse linear algebra - the performance is limited by memory bandwidth - not CPU So one have to look at the hardware memory architecture of the machine if you expect scalability. The 2x quad-core has a memory architecture that gives 11GB/s if one CPU-socket is used, but 22GB/s when both CPUs-sockets are used [irrespective of the number of cores in each CPU socket]. One inference is - max of 2 speedup can be obtained from such machine [due to 2 memory bank architecture]. So if you have 2 such machines [i.e 4 memory banks] - then you can expect a theoretical max speedup of 4. We are generally used to evaluating performance/cpu [or core]. Here the scalability numbers suck. However if you do performance/number-of-memory-banks - then things look better. Its just that we are used to always expecting scalability per node and assume it translates to scalability per core. [however the scalability per node - was more about scalability per memory bank - before multicore cpus took over] There is also another measure - performance/dollar spent. Generally the extra cores are practically free - so here this measure also holds up ok. Satish -------------- next part -------------- ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./ex2f on a linux-tes named intel-loaner1 with 1 processor, by balay Tue Apr 15 22:02:38 2008 Using Petsc Development Version 2.3.3, Patch 12, unknown HG revision: unknown Max Max/Min Avg Total Time (sec): 6.936e+01 1.00000 6.936e+01 Objects: 4.400e+01 1.00000 4.400e+01 Flops: 3.547e+10 1.00000 3.547e+10 3.547e+10 Flops/sec: 5.113e+08 1.00000 5.113e+08 5.113e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 2.349e+03 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 6.9359e+01 100.0% 3.5466e+10 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.349e+03 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flops --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397 MatSolve 1192 1.0 1.8658e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 11 0 0 0 27 11 0 0 0 207 MatLUFactorNum 1 1.0 4.1455e-02 1.0 3.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 78 MatILUFactorSym 1 1.0 2.9251e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 1 1.0 3.1618e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 5.1751e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecMDot 1153 1.0 1.6326e+01 1.0 1.28e+10 1.0 0.0e+00 0.0e+00 1.2e+03 24 36 0 0 49 24 36 0 0 49 783 VecNorm 1193 1.0 5.0365e+00 1.0 8.59e+08 1.0 0.0e+00 0.0e+00 1.2e+03 7 2 0 0 51 7 2 0 0 51 171 VecScale 1192 1.0 5.4950e-01 1.0 4.29e+08 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 781 VecCopy 39 1.0 6.6555e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 41 1.0 3.4185e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 78 1.0 1.2492e-01 1.0 5.62e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 450 VecMAXPY 1192 1.0 1.8493e+01 1.0 1.36e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27 38 0 0 0 27 38 0 0 0 736 VecNormalize 1192 1.0 5.5843e+00 1.0 1.29e+09 1.0 0.0e+00 0.0e+00 1.2e+03 8 4 0 0 51 8 4 0 0 51 231 KSPGMRESOrthog 1153 1.0 3.3669e+01 1.0 2.56e+10 1.0 0.0e+00 0.0e+00 1.2e+03 49 72 0 0 49 49 72 0 0 49 760 KSPSetup 1 1.0 1.1875e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100 0 0100 100100 0 0100 513 PCSetUp 1 1.0 7.5919e-02 1.0 3.23e+06 1.0 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 0 43 PCApply 1192 1.0 1.8661e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 11 0 0 0 27 11 0 0 0 207 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 2 2 54695580 0 Vec 37 37 106606176 0 Krylov Solver 1 1 18016 0 Preconditioner 1 1 720 0 Index Set 3 3 4321464 0 ======================================================================================================================== Average time to get PetscTime(): 9.53674e-08 OptionTable: -log_summary ex2f-600-1p.log OptionTable: -m 600 OptionTable: -n 600 Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 Configure run at: Tue Apr 15 21:39:17 2008 Configure options: --with-mpi-dir=/home/balay/mpich2-svn --with-debugging=0 --download-f-blas-lapack=1 PETSC_ARCH=linux-test --with-shared=0 ----------------------------------------- Libraries compiled on Tue Apr 15 21:45:29 CDT 2008 on intel-loaner1 Machine characteristics: Linux intel-loaner1 2.6.20-16-generic #2 SMP Tue Feb 12 02:11:24 UTC 2008 x86_64 GNU/Linux Using PETSc directory: /home/balay/petsc-dev Using PETSc arch: linux-test ----------------------------------------- Using C compiler: /home/balay/mpich2-svn/bin/mpicc -fPIC -O Using Fortran compiler: /home/balay/mpich2-svn/bin/mpif90 -I. -fPIC -O ----------------------------------------- Using include paths: -I/home/balay/petsc-dev -I/home/balay/petsc-dev/linux-test/include -I/home/balay/petsc-dev/include -I/home/balay/mpich2-svn/include -I. -I/home/balay/mpich2-svn/src/include -I/home/balay/mpich2-svn/src/binding/f90 ------------------------------------------ Using C linker: /home/balay/mpich2-svn/bin/mpicc -fPIC -O Using Fortran linker: /home/balay/mpich2-svn/bin/mpif90 -I. -fPIC -O Using libraries: -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lflapack -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lfblas -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -L/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -L/usr/lib/../lib64 -ldl -lgcc_s -lgfortranbegin -lgfortran -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -L/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -L/usr/lib/../lib64 -ldl -lgcc_s -ldl ------------------------------------------ -------------- next part -------------- A non-text attachment was scrubbed... Name: ex2f-600-2p.log Type: application/octet-stream Size: 9562 bytes Desc: URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex2f-600-4p.log Type: application/octet-stream Size: 9563 bytes Desc: URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ex2f-600-8p.log Type: application/octet-stream Size: 9562 bytes Desc: URL: From zonexo at gmail.com Tue Apr 15 23:35:24 2008 From: zonexo at gmail.com (Ben Tay) Date: Wed, 16 Apr 2008 12:35:24 +0800 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> Message-ID: <4805820C.8030803@gmail.com> Hi Satish, thank you very much for helping me run the ex2f.F code. I think I've a clearer picture now. I believe I'm running on Dual-Core Intel Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of them. I guess that the lower peak is because I'm using Xeon 5160, while you are using Xeon X5355. You mention about the speedups for MatMult and compare between KSPSolve. Are these the only things we have to look at? Because I see that some other event such as VecMAXPY also takes up a sizable % of the time. To get an accurate speedup, do I just compare the time taken by KSPSolve between different no. of processors or do I have to look at other events such as MatMult as well? In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just send your results to my school's engineer and see if they could do anything. For my part, I guess I'll just 've to wait? Thank alot! Satish Balay wrote: > On Wed, 16 Apr 2008, Ben Tay wrote: > > >> I think you may be right. My school uses : >> > > >> No of Nodes Processors Qty per node Total cores per node Memory per node >> 4 Quad-Core Intel Xeon X5355 2 8 16 GB >> 60 Dual-Core Intel Xeon 5160 2 4 8 GB >> > > > I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355 > machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I get the following: > > << Logs for my run are attached >> > > asterix:/home/balay/download-pine>grep MatMult * > ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397 > ex2f-600-2p.log:MatMult 1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 632 > ex2f-600-4p.log:MatMult 969 1.0 4.3311e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 15 11100100 0 15 11100100 0 724 > ex2f-600-8p.log:MatMult 1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 1.8e+04 4.8e+03 0.0e+00 16 11100100 0 16 11100100 0 749 > asterix:/home/balay/download-pine>grep KSPSolve * > ex2f-600-1p.log:KSPSolve 1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100 0 0100 100100 0 0100 513 > ex2f-600-2p.log:KSPSolve 1 1.0 4.4005e+01 1.0 1.81e+10 1.0 2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100 824 > ex2f-600-4p.log:KSPSolve 1 1.0 2.8139e+01 1.0 7.21e+09 1.0 5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99 1024 > ex2f-600-8p.log:KSPSolve 1 1.0 3.6260e+01 1.0 4.90e+09 1.0 1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100 1081 > asterix:/home/balay/download-pine> > > > You get the following [with intel compilers?]: > > asterix:/home/balay/download-pine/x>grep MatMult * > log.1:MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11 0 0 0 13 11 0 0 0 239 > log.2:MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 0.0e+00 11 11100100 0 11 11100100 0 315 > log.4:MatMult 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 0.0e+00 8 11100100 0 8 11100100 0 321 > asterix:/home/balay/download-pine/x>grep KSPSolve * > log.1:KSPSolve 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 2.3e+03100100 0 0100 100100 0 0100 292 > log.2:KSPSolve 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 2.4e+03 99100100100100 99100100100100 352 > log.4:KSPSolve 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 1.9e+03 98100100100 99 98100100100 99 461 > asterix:/home/balay/download-pine/x> > > What exact CPU was this run on? > > A couple of comments: > - my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 3.6 for 2,4 proc runs [so higher > load imbalance on your machine] > - The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 for you > - Speedups I see for MatMult are: > > np me you > > 2 1.59 1.32 > 4 1.82 1.34 > 8 1.88 > > -------------------------- > > The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores. > > As Matt indicated perhaps in "Subject: general question on speed using quad core Xeons" thread, > for sparse linear algebra - the performance is limited by memory bandwidth - not CPU > > So one have to look at the hardware memory architecture of the machine > if you expect scalability. > > The 2x quad-core has a memory architecture that gives 11GB/s if one > CPU-socket is used, but 22GB/s when both CPUs-sockets are used > [irrespective of the number of cores in each CPU socket]. One > inference is - max of 2 speedup can be obtained from such machine [due > to 2 memory bank architecture]. > > So if you have 2 such machines [i.e 4 memory banks] - then you can > expect a theoretical max speedup of 4. > > We are generally used to evaluating performance/cpu [or core]. Here > the scalability numbers suck. > > However if you do performance/number-of-memory-banks - then things look better. > > Its just that we are used to always expecting scalability per node and > assume it translates to scalability per core. [however the scalability > per node - was more about scalability per memory bank - before > multicore cpus took over] > > > There is also another measure - performance/dollar spent. Generally > the extra cores are practically free - so here this measure also holds > up ok. > > Satish From balay at mcs.anl.gov Wed Apr 16 00:25:45 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 16 Apr 2008 00:25:45 -0500 (CDT) Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <4805820C.8030803@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> Message-ID: On Wed, 16 Apr 2008, Ben Tay wrote: > Hi Satish, thank you very much for helping me run the ex2f.F code. > > I think I've a clearer picture now. I believe I'm running on Dual-Core Intel > Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of > them. I guess that the lower peak is because I'm using Xeon 5160, while you > are using Xeon X5355. I'm still a bit puzzled. I just ran the same binary on a 2 dualcore xeon 5130 machine [which should be similar to your 5160 machine] and get the following: [balay at n001 ~]$ grep MatMult log* log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 364 log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 615 log.4:MatMult 969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 656 [balay at n001 ~]$ > You mention about the speedups for MatMult and compare between KSPSolve. Are > these the only things we have to look at? Because I see that some other event > such as VecMAXPY also takes up a sizable % of the time. To get an accurate > speedup, do I just compare the time taken by KSPSolve between different no. of > processors or do I have to look at other events such as MatMult as well? Sometimes we look at individual components like MatMult() VecMAXPY() to understand whats hapenning in each stage - and at KSPSolve() to look at the agregate performance for the whole solve [which includes MatMult VecMAXPY etc..]. Perhaps I should have also looked at VecMDot() aswell - at 48% of runtime - its the biggest contributor to KSPSolve() for your run. Its easy to get lost in the details of log_summary. Looking for anamolies is one thing. Plotting scalability charts for the solver is something else.. > In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just > send your results to my school's engineer and see if they could do anything. > For my part, I guess I'll just 've to wait? Yes - load imbalance at MatMult level is bad. On 4 proc run you have ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6 times slower than the other task [so all speedup is lost here] You could try the latest mpich2 [1.0.7] - just for this SMP experiment, and see if it makes a difference. I've built mpich2 with [default gcc/gfortran and]: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker There could be something else going on on this machine thats messing up load-balance for basic petsc example.. Satish From bsmith at mcs.anl.gov Wed Apr 16 07:14:37 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 16 Apr 2008 07:14:37 -0500 Subject: general question on speed using quad core Xeons In-Reply-To: <48054602.9040200@gmail.com> References: <48054602.9040200@gmail.com> Message-ID: <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> Randy, Please see http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers Essentially what has happened is that chip hardware designers (Intel, IBM, AMD) hit a wall on how high they can make their clock speed. They then needed some other way to try to increase the "performance" of their chips; since they could continue to make smaller circuits they came up on putting multiple cores on a single chip, then they can "double" or "quad" the claimed performance very easily. Unfortunately the whole multicore "solution" is really half-assed since it is difficult to effectively use all the cores, especially since the memory bandwidth did not improve as fast. Now when a company comes out with a half-assed product, do they say, "this is a half-assed product"? Did Microsoft say Vista was "half-assed". No, they emphasis the positive parts of their product and hide the limitations. This has been true since Grog made his first stone wheel in front of this cave. So Intel mislead everyone on how great multi-cores are. When you buy earlier dual or quad products you are NOT gettting a parallel system (even though it has 2 cores) because the memory is NOT parallel. Things are getting a bit better, Intel now has systems with higher memory bandwidth. The thing you have to look for is MEMORY BANDWDITH PER CORE, the higher that is the better performance you get. Note this doesn't have anything to do with PETSc, any sparse solver has the exact same issues. Barry On Apr 15, 2008, at 7:19 PM, Randall Mackie wrote: > I'm running my PETSc code on a cluster of quad core Xeon's connected > by Infiniband. I hadn't much worried about the performance, because > everything seemed to be working quite well, but today I was actually > comparing performance (wall clock time) for the same problem, but on > different combinations of CPUS. > > I find that my PETSc code is quite scalable until I start to use > multiple cores/cpu. > > For example, the run time doesn't improve by going from 1 core/cpu > to 4 cores/cpu, and I find this to be very strange, especially since > looking at top or Ganglia, all 4 cpus on each node are running at > 100% almost > all of the time. I would have thought if the cpus were going all out, > that I would still be getting much more scalable results. > > We are using mvapich-0.9.9 with infiniband. So, I don't know if > this is a cluster/Xeon issue, or something else. > > Anybody with experience on this? > > Thanks, Randy M. > From pivello at gmail.com Wed Apr 16 06:51:05 2008 From: pivello at gmail.com (=?ISO-8859-1?Q?M=E1rcio_Ricardo_Pivello?=) Date: Wed, 16 Apr 2008 08:51:05 -0300 Subject: PETSc + HYPRE In-Reply-To: References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com> Message-ID: <7d6158b80804160451w2b964b32m6401e655e0ec6a4@mail.gmail.com> Hi, Lisandro. I must apologize for not answering. I read your email and changed my code, but then I went through a different path. I'm trying to call the preconditioner from the command line, without any mention to it in the source code. It should take a couple of hours to get some results and then, if it doesn't work, I'll change the code. Thank you very much. M?rcio Ricardo On Tue, Apr 15, 2008 at 8:43 PM, Lisandro Dalcin wrote: > Sorry for my insistence, but... Did you see my previous mail? The code > you wrote is not OK. You have to first create the KSP, next extract > the PC with KSPGetPC, and then configure the PC to use HYPRE+BoomerAMG > > To be sure you are actually being using hypre, add -ksp_view to command > line. > > > On 4/15/08, M?rcio Ricardo Pivello wrote: > > Hy, Matthew, thanks for your help. > > > > Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on > FEM, > > with fluid-structure interaction. In this case, I'm simulating the blood > > flow inside an aneurysm in an abdominal aorta artery. > > By not working I mean the error does not decrease with time. Our team > is > > just starting using HYPRE, in fact this is the very first case we run > with > > it. > > > > > > Again, thanks for your help. > > > > > > M?rcio Ricardo. > > > > > > > > > -- > Lisandro Dalc?n > --------------- > Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > Tel/Fax: +54-(0)342-451.1594 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zonexo at gmail.com Wed Apr 16 08:44:15 2008 From: zonexo at gmail.com (Ben Tay) Date: Wed, 16 Apr 2008 21:44:15 +0800 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> Message-ID: <480602AF.5060802@gmail.com> Hi, Am I right to say that despite all the hype about multi-core processors, they can't speed up solving of linear eqns? It's not possible to get a 2x speedup when using 2 cores. And is this true for all types of linear equation solver besides PETSc? What about parallel direct solvers (e.g. MUMPS) or those which uses openmp instead of mpich? Well, I just can't help feeling disappointed if that's the case... Also, with a smart enough LSF scheduler, I will be assured of getting separate processors ie 1 core from each different processor instead of 2-4 cores from just 1 processor. In that case, if I use 1 core from processor A and 1 core from processor B, I should be able to get a decent speedup of more than 1, is that so? This option is also better than using 2 or even 4 cores from the same processor. Thank you very much. Satish Balay wrote: > On Wed, 16 Apr 2008, Ben Tay wrote: > > >> Hi Satish, thank you very much for helping me run the ex2f.F code. >> >> I think I've a clearer picture now. I believe I'm running on Dual-Core Intel >> Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of >> them. I guess that the lower peak is because I'm using Xeon 5160, while you >> are using Xeon X5355. >> > > I'm still a bit puzzled. I just ran the same binary on a 2 dualcore > xeon 5130 machine [which should be similar to your 5160 machine] and > get the following: > > [balay at n001 ~]$ grep MatMult log* > log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 364 > log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 615 > log.4:MatMult 969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 656 > [balay at n001 ~]$ > > >> You mention about the speedups for MatMult and compare between KSPSolve. Are >> these the only things we have to look at? Because I see that some other event >> such as VecMAXPY also takes up a sizable % of the time. To get an accurate >> speedup, do I just compare the time taken by KSPSolve between different no. of >> processors or do I have to look at other events such as MatMult as well? >> > > Sometimes we look at individual components like MatMult() VecMAXPY() > to understand whats hapenning in each stage - and at KSPSolve() to > look at the agregate performance for the whole solve [which includes > MatMult VecMAXPY etc..]. Perhaps I should have also looked at > VecMDot() aswell - at 48% of runtime - its the biggest contributor to > KSPSolve() for your run. > > Its easy to get lost in the details of log_summary. Looking for > anamolies is one thing. Plotting scalability charts for the solver is > something else.. > > >> In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just >> send your results to my school's engineer and see if they could do anything. >> For my part, I guess I'll just 've to wait? >> > > Yes - load imbalance at MatMult level is bad. On 4 proc run you have > ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6 > times slower than the other task [so all speedup is lost here] > > You could try the latest mpich2 [1.0.7] - just for this SMP > experiment, and see if it makes a difference. I've built mpich2 with > [default gcc/gfortran and]: > > ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker > > There could be something else going on on this machine thats messing > up load-balance for basic petsc example.. > > Satish > > > From knepley at gmail.com Wed Apr 16 08:48:37 2008 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 16 Apr 2008 08:48:37 -0500 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <480602AF.5060802@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> <480602AF.5060802@gmail.com> Message-ID: On Wed, Apr 16, 2008 at 8:44 AM, Ben Tay wrote: > Hi, > > Am I right to say that despite all the hype about multi-core processors, > they can't speed up solving of linear eqns? It's not possible to get a 2x > speedup when using 2 cores. And is this true for all types of linear > equation solver besides PETSc? What about parallel direct solvers (e.g. > MUMPS) or those which uses openmp instead of mpich? Well, I just can't help > feeling disappointed if that's the case... Notice that Satish got much much better scaling than you did on our box here. I think something is really wrong either with the installation of MPI on that box or something hardware-wise. Matt > Also, with a smart enough LSF scheduler, I will be assured of getting > separate processors ie 1 core from each different processor instead of 2-4 > cores from just 1 processor. In that case, if I use 1 core from processor A > and 1 core from processor B, I should be able to get a decent speedup of > more than 1, is that so? This option is also better than using 2 or even 4 > cores from the same processor. > > Thank you very much. > > Satish Balay wrote: > > > On Wed, 16 Apr 2008, Ben Tay wrote: > > > > > > > > > Hi Satish, thank you very much for helping me run the ex2f.F code. > > > > > > I think I've a clearer picture now. I believe I'm running on Dual-Core > Intel > > > Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 > of > > > them. I guess that the lower peak is because I'm using Xeon 5160, while > you > > > are using Xeon X5355. > > > > > > > > > > I'm still a bit puzzled. I just ran the same binary on a 2 dualcore > > xeon 5130 machine [which should be similar to your 5160 machine] and > > get the following: > > > > [balay at n001 ~]$ grep MatMult log* > > log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 > 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 364 > > log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 > 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 615 > > log.4:MatMult 969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 > 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 656 > > [balay at n001 ~]$ > > > > > > > You mention about the speedups for MatMult and compare between KSPSolve. > Are > > > these the only things we have to look at? Because I see that some other > event > > > such as VecMAXPY also takes up a sizable % of the time. To get an > accurate > > > speedup, do I just compare the time taken by KSPSolve between different > no. of > > > processors or do I have to look at other events such as MatMult as well? > > > > > > > > > > Sometimes we look at individual components like MatMult() VecMAXPY() > > to understand whats hapenning in each stage - and at KSPSolve() to > > look at the agregate performance for the whole solve [which includes > > MatMult VecMAXPY etc..]. Perhaps I should have also looked at > > VecMDot() aswell - at 48% of runtime - its the biggest contributor to > > KSPSolve() for your run. > > > > Its easy to get lost in the details of log_summary. Looking for > > anamolies is one thing. Plotting scalability charts for the solver is > > something else.. > > > > > > > > > In summary, due to load imbalance, my speedup is quite bad. So maybe > I'll just > > > send your results to my school's engineer and see if they could do > anything. > > > For my part, I guess I'll just 've to wait? > > > > > > > > > > Yes - load imbalance at MatMult level is bad. On 4 proc run you have > > ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6 > > times slower than the other task [so all speedup is lost here] > > > > You could try the latest mpich2 [1.0.7] - just for this SMP > > experiment, and see if it makes a difference. I've built mpich2 with > > [default gcc/gfortran and]: > > > > ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker > > > > There could be something else going on on this machine thats messing > > up load-balance for basic petsc example.. > > > > Satish > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From rlmackie862 at gmail.com Wed Apr 16 09:13:26 2008 From: rlmackie862 at gmail.com (Randall Mackie) Date: Wed, 16 Apr 2008 07:13:26 -0700 Subject: general question on speed using quad core Xeons In-Reply-To: <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> Message-ID: <48060986.8050102@gmail.com> Thanks Barry - very informative, and gave me a chuckle :-) Randy Barry Smith wrote: > > Randy, > > Please see > http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers > > Essentially what has happened is that chip hardware designers > (Intel, IBM, AMD) hit a wall > on how high they can make their clock speed. They then needed some other > way to try to > increase the "performance" of their chips; since they could continue to > make smaller circuits > they came up on putting multiple cores on a single chip, then they can > "double" or "quad" the > claimed performance very easily. Unfortunately the whole multicore > "solution" is really > half-assed since it is difficult to effectively use all the cores, > especially since the memory > bandwidth did not improve as fast. > > Now when a company comes out with a half-assed product, do they say, > "this is a half-assed product"? > Did Microsoft say Vista was "half-assed". No, they emphasis the positive > parts of their product and > hide the limitations. This has been true since Grog made his first > stone wheel in front of this cave. > So Intel mislead everyone on how great multi-cores are. > > When you buy earlier dual or quad products you are NOT gettting a > parallel system (even > though it has 2 cores) because the memory is NOT parallel. > > Things are getting a bit better, Intel now has systems with higher > memory bandwidth. > The thing you have to look for is MEMORY BANDWDITH PER CORE, the higher > that is the > better performance you get. > > Note this doesn't have anything to do with PETSc, any sparse solver has > the exact same > issues. > > Barry > > > > On Apr 15, 2008, at 7:19 PM, Randall Mackie wrote: >> I'm running my PETSc code on a cluster of quad core Xeon's connected >> by Infiniband. I hadn't much worried about the performance, because >> everything seemed to be working quite well, but today I was actually >> comparing performance (wall clock time) for the same problem, but on >> different combinations of CPUS. >> >> I find that my PETSc code is quite scalable until I start to use >> multiple cores/cpu. >> >> For example, the run time doesn't improve by going from 1 core/cpu >> to 4 cores/cpu, and I find this to be very strange, especially since >> looking at top or Ganglia, all 4 cpus on each node are running at 100% >> almost >> all of the time. I would have thought if the cpus were going all out, >> that I would still be getting much more scalable results. >> >> We are using mvapich-0.9.9 with infiniband. So, I don't know if >> this is a cluster/Xeon issue, or something else. >> >> Anybody with experience on this? >> >> Thanks, Randy M. >> > From bsmith at mcs.anl.gov Wed Apr 16 09:17:18 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 16 Apr 2008 09:17:18 -0500 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <480602AF.5060802@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> <480602AF.5060802@gmail.com> Message-ID: <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> On Apr 16, 2008, at 8:44 AM, Ben Tay wrote: > Hi, > > Am I right to say that despite all the hype about multi-core > processors, they can't speed up solving of linear eqns? It's not > possible to get a 2x speedup when using 2 cores. And is this true > for all types of linear equation solver besides PETSc? It will basically be the same for any iterative solver package. > What about parallel direct solvers (e.g. MUMPS) direct solvers are a bit less memory bandwidth limited, so scaling will be a bit better. But the time spent for problems where iterative solvers work well will likely be much higher for direct solver. > or those which uses openmp instead of mpich? openmp will give no benefit, this is a hardware limitation, not software. > Well, I just can't help feeling disappointed if that's the case... If you are going to do parallel computing you need to get use to disappointment. At this point in time (especially first generation dual/quad core systems) memory bandwidth is the fundamental limitation (not number of flops your hardware can do) to speed. Barry > > > Also, with a smart enough LSF scheduler, I will be assured of > getting separate processors ie 1 core from each different processor > instead of 2-4 cores from just 1 processor. In that case, if I use 1 > core from processor A and 1 core from processor B, I should be able > to get a decent speedup of more than 1, is that so? So long as your iterative solver ALGORITHM scales well, then you should see very good speedup (and most people do). Algorithm scaling means if you increase the number of processes the number of iterations should not increase much. > This option is also better than using 2 or even 4 cores from the > same processor. Two cores out of the four will likely not be so bad either; all four will be bad. Barry > > > Thank you very much. > > Satish Balay wrote: >> On Wed, 16 Apr 2008, Ben Tay wrote: >> >> >>> Hi Satish, thank you very much for helping me run the ex2f.F code. >>> >>> I think I've a clearer picture now. I believe I'm running on Dual- >>> Core Intel >>> Xeon 5160. The quad core is only on atlas3-01 to 04 and there's >>> only 4 of >>> them. I guess that the lower peak is because I'm using Xeon 5160, >>> while you >>> are using Xeon X5355. >>> >> >> I'm still a bit puzzled. I just ran the same binary on a 2 dualcore >> xeon 5130 machine [which should be similar to your 5160 machine] and >> get the following: >> >> [balay at n001 ~]$ grep MatMult log* >> log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e >> +00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 364 >> log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e >> +03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 615 >> log.4:MatMult 969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e >> +03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 656 >> [balay at n001 ~]$ >> >>> You mention about the speedups for MatMult and compare between >>> KSPSolve. Are >>> these the only things we have to look at? Because I see that some >>> other event >>> such as VecMAXPY also takes up a sizable % of the time. To get an >>> accurate >>> speedup, do I just compare the time taken by KSPSolve between >>> different no. of >>> processors or do I have to look at other events such as MatMult as >>> well? >>> >> >> Sometimes we look at individual components like MatMult() VecMAXPY() >> to understand whats hapenning in each stage - and at KSPSolve() to >> look at the agregate performance for the whole solve [which includes >> MatMult VecMAXPY etc..]. Perhaps I should have also looked at >> VecMDot() aswell - at 48% of runtime - its the biggest contributor to >> KSPSolve() for your run. >> >> Its easy to get lost in the details of log_summary. Looking for >> anamolies is one thing. Plotting scalability charts for the solver is >> something else.. >> >> >>> In summary, due to load imbalance, my speedup is quite bad. So >>> maybe I'll just >>> send your results to my school's engineer and see if they could do >>> anything. >>> For my part, I guess I'll just 've to wait? >>> >> >> Yes - load imbalance at MatMult level is bad. On 4 proc run you have >> ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6 >> times slower than the other task [so all speedup is lost here] >> >> You could try the latest mpich2 [1.0.7] - just for this SMP >> experiment, and see if it makes a difference. I've built mpich2 with >> [default gcc/gfortran and]: >> >> ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker >> >> There could be something else going on on this machine thats messing >> up load-balance for basic petsc example.. >> >> Satish >> >> >> > From dalcinl at gmail.com Wed Apr 16 09:24:03 2008 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Wed, 16 Apr 2008 11:24:03 -0300 Subject: PETSc + HYPRE In-Reply-To: <7d6158b80804160451w2b964b32m6401e655e0ec6a4@mail.gmail.com> References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com> <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com> <7d6158b80804160451w2b964b32m6401e655e0ec6a4@mail.gmail.com> Message-ID: OK, You said you are trying to solve NS eqs. Are you using a pressure projection-like method? In that case, is the matrix of your pressure problem much different than the Laplacian one? How do you handle the pressure 'rigid-body' mode? On 4/16/08, M?rcio Ricardo Pivello wrote: > Hi, Lisandro. I must apologize for not answering. I read your email and > changed my code, but then I went through a different path. I'm trying to > call the preconditioner from the command line, without any mention to it in > the source code. It should take a couple of hours to get some results and > then, if it doesn't work, I'll change the code. > > > Thank you very much. > > > M?rcio Ricardo > > > > > > > > On Tue, Apr 15, 2008 at 8:43 PM, Lisandro Dalcin wrote: > > > Sorry for my insistence, but... Did you see my previous mail? The code > > you wrote is not OK. You have to first create the KSP, next extract > > the PC with KSPGetPC, and then configure the PC to use HYPRE+BoomerAMG > > > > To be sure you are actually being using hypre, add -ksp_view to command > line. > > > > > > > > > > > > On 4/15/08, M?rcio Ricardo Pivello wrote: > > > Hy, Matthew, thanks for your help. > > > > > > Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on > FEM, > > > with fluid-structure interaction. In this case, I'm simulating the blood > > > flow inside an aneurysm in an abdominal aorta artery. > > > By not working I mean the error does not decrease with time. Our team > is > > > just starting using HYPRE, in fact this is the very first case we run > with > > > it. > > > > > > > > > Again, thanks for your help. > > > > > > > > > M?rcio Ricardo. > > > > > > > > > > > > > > > -- > > > > > > > > Lisandro Dalc?n > > --------------- > > Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) > > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > Tel/Fax: +54-(0)342-451.1594 > > > > > > -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From balay at mcs.anl.gov Wed Apr 16 09:27:41 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 16 Apr 2008 09:27:41 -0500 (CDT) Subject: general question on speed using quad core Xeons In-Reply-To: <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> Message-ID: Just a note: Intel does publish benchmarks for their chips. http://www.intel.com/performance/server/xeon/hpcapp.htm Satish From gsanjay at ethz.ch Wed Apr 16 09:27:33 2008 From: gsanjay at ethz.ch (Sanjay Govindjee) Date: Wed, 16 Apr 2008 16:27:33 +0200 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> Message-ID: <48060CD5.1010308@ethz.ch> >> >> Also, with a smart enough LSF scheduler, I will be assured of getting >> separate processors ie 1 core from each different processor instead >> of 2-4 cores from just 1 processor. In that case, if I use 1 core >> from processor A and 1 core from processor B, I should be able to get >> a decent speedup of more than 1, is that so? > > You still need to be careful with the hardware you choose. If the processor's live on the same motherboard then you still need to make sure that they each have their own memory bus. Otherwise you will still face memory bottlenecks as each single core, from the different processors, fights for bandwidth on the bus. It all depends on the memory bus architecture of your system. In this regard, I recommend staying away from Intel style systems. -sg From bsmith at mcs.anl.gov Wed Apr 16 09:59:31 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 16 Apr 2008 09:59:31 -0500 Subject: general question on speed using quad core Xeons In-Reply-To: References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> Message-ID: <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> Cool. The pages to look at are http://www.intel.com/performance/server/xeon/hpc_ansys.htm http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm these are the two benchmarks that reflect the bottlenecks of memory bandwidth. When going from dual to quad they get 1.2 times the performance, when one would like 2 times the performance. Barry On Apr 16, 2008, at 9:27 AM, Satish Balay wrote: > Just a note: > > Intel does publish benchmarks for their chips. > > http://www.intel.com/performance/server/xeon/hpcapp.htm > > Satish > From berend at chalmers.se Wed Apr 16 10:10:32 2008 From: berend at chalmers.se (Berend van Wachem) Date: Wed, 16 Apr 2008 17:10:32 +0200 Subject: general question on speed using quad core Xeons In-Reply-To: <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> Message-ID: <480616E8.9020205@chalmers.se> Hi Barry, > http://www.intel.com/performance/server/xeon/hpc_ansys.htm > http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm Aren't both benchmarks run on Quads? The difference just being the cache per processor? Or am I mistaken? Berend. > these are the two benchmarks that reflect the bottlenecks of memory > bandwidth. > When going from dual to quad they get 1.2 times the performance, when > one would > like 2 times the performance. > > Barry > > > On Apr 16, 2008, at 9:27 AM, Satish Balay wrote: >> Just a note: >> >> Intel does publish benchmarks for their chips. >> >> http://www.intel.com/performance/server/xeon/hpcapp.htm >> >> Satish >> > From balay at mcs.anl.gov Wed Apr 16 10:38:18 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 16 Apr 2008 10:38:18 -0500 (CDT) Subject: general question on speed using quad core Xeons In-Reply-To: <480616E8.9020205@chalmers.se> References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <480616E8.9020205@chalmers.se> Message-ID: On Wed, 16 Apr 2008, Berend van Wachem wrote: > Hi Barry, > > > > http://www.intel.com/performance/server/xeon/hpc_ansys.htm > > http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm > > > Aren't both benchmarks run on Quads? The difference just being the cache per > processor? Or am I mistaken? Yes - both are quads, and yes - the cache sizes are different. But I think the primary feature that contributes to the performance difference is memory bandwidth. The first one is 1333 FSB, the second one is 1600FSB - i.e 20% improvement in memory bandwidth => 20% improvement in performance for the above benchmarks. Satish From rlmackie862 at gmail.com Wed Apr 16 10:42:15 2008 From: rlmackie862 at gmail.com (Randall Mackie) Date: Wed, 16 Apr 2008 08:42:15 -0700 Subject: general question on speed using quad core Xeons In-Reply-To: References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <480616E8.9020205@chalmers.se> Message-ID: <48061E57.3090200@gmail.com> I just want to say that I really have appreciated this discussion - issues like this tend to get lost or not addressed when we're working on our codes, and it's been very enlightening for me. Randy Satish Balay wrote: > On Wed, 16 Apr 2008, Berend van Wachem wrote: > >> Hi Barry, >> >> >>> http://www.intel.com/performance/server/xeon/hpc_ansys.htm >>> http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm >> >> Aren't both benchmarks run on Quads? The difference just being the cache per >> processor? Or am I mistaken? > > Yes - both are quads, and yes - the cache sizes are different. > > But I think the primary feature that contributes to the performance > difference is memory bandwidth. The first one is 1333 FSB, the second > one is 1600FSB - i.e 20% improvement in memory bandwidth => 20% > improvement in performance for the above benchmarks. > > Satish > From tribur at vision.ee.ethz.ch Thu Apr 17 05:23:13 2008 From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch) Date: Thu, 17 Apr 2008 12:23:13 +0200 Subject: Hypre Message-ID: <20080417122313.ws2qnzcgg8co480w@email.ee.ethz.ch> Dear Petsc experts, Another, more basic problem when using Hypre: When I try ex2_hdf5 -f data/sphereInBlockFiner_a_cd3t.h5 -pc_type hypre -pc_hypre_type pilut -log_summary where sphereInBlock_a_cd3t.h5 contains a 102464 x 102464 matrix, the program seems to hang (it stops with the error "=>> PBS: job killed: walltime 384 exceeded limit 360"). Adding the option -ksp_max_it 1 to be sure that it is not iterating until 10000000000000 doesn't change anything. The same happens also if I use -pc_hypre_type boomeramg. It is neither the problem of my program nor of the matrix, because ex2_hdf5 -f data/sphereInBlockFiner_a_cd3t.h5 -pc_type jacobi -log_summary -ksp_rtol 0.0000000001 takes only 5s and gives me the correct solution. What do I do wrong? Looking forward to your answer, Kathrin From knepley at gmail.com Thu Apr 17 07:01:34 2008 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 17 Apr 2008 07:01:34 -0500 Subject: Hypre In-Reply-To: <20080417122313.ws2qnzcgg8co480w@email.ee.ethz.ch> References: <20080417122313.ws2qnzcgg8co480w@email.ee.ethz.ch> Message-ID: On Thu, Apr 17, 2008 at 5:23 AM, wrote: > Dear Petsc experts, > > Another, more basic problem when using Hypre: > > When I try > > ex2_hdf5 -f data/sphereInBlockFiner_a_cd3t.h5 -pc_type hypre -pc_hypre_type > pilut -log_summary > > where sphereInBlock_a_cd3t.h5 contains a 102464 x 102464 matrix, > the program seems to hang (it stops with the error "=>> PBS: job killed: > walltime 384 exceeded limit 360"). > Adding the option -ksp_max_it 1 to be sure that it is not iterating until > 10000000000000 doesn't change anything. The same happens also if I use > -pc_hypre_type boomeramg. If you really think it is hanging, I would attach gdb and get a stack trace. You can either run with -start_in_debugger, or attach gdb to the running process with gdb . It is conceivable to me that pilut just takes a really long time to factor the matrix. For boomeramg this is less likely, but still believeable. Matt > It is neither the problem of my program nor of the matrix, because > > ex2_hdf5 -f data/sphereInBlockFiner_a_cd3t.h5 -pc_type jacobi -log_summary > -ksp_rtol 0.0000000001 > > takes only 5s and gives me the correct solution. > > What do I do wrong? > > Looking forward to your answer, > Kathrin > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From zonexo at gmail.com Thu Apr 17 22:55:39 2008 From: zonexo at gmail.com (Ben Tay) Date: Fri, 18 Apr 2008 11:55:39 +0800 Subject: Slow speed after changing from serial to parallel In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> Message-ID: <48081BBB.5050004@gmail.com> An HTML attachment was scrubbed... URL: From balay at mcs.anl.gov Fri Apr 18 00:52:14 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Fri, 18 Apr 2008 00:52:14 -0500 (CDT) Subject: Slow speed after changing from serial to parallel In-Reply-To: <48081BBB.5050004@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48081BBB.5050004@gmail.com> Message-ID: On Fri, 18 Apr 2008, Ben Tay wrote: > Hi, > > I've email my school super computing staff and they told me that the queue which I'm using is one meant for testing, hence, it's > handling of work load is not good. I've sent my job to another queue and it's run on 4 processors. It's my own code because there seems > to be something wrong with the server displaying the summary when using -log_summary with ex2f.F. I'm trying it again. Thats wierd. We should first make sure ex2f [or ex2] are running properly before looking at your code. > > Anyway comparing just kspsolve between the two, the speedup is about 2.7. However, I noticed that for the 4 processors one, its > MatAssemblyBegin is? 1.5158e+02, which is more than KSPSolve's 4.7041e+00. So is MatAssemblyBegin's time included in KSPSolve? If not, > does it mean that there's something wrong about my MatAssemblyBegin? MatAssemblyBegin is not included in KSPSolve(). Something wierd is going here. There are 2 possibilities. - whatever code you have before matrix assembly is unbalanced, so MatAssemblyBegin() acts as a barrier . - MPI communication is not optimal within the node. Its best to first make sure ex2 or ex2f runs fine. As recommended earlier - you should try latest mpich2 with --with-device=ch3:nemesis:newtcp and compare ex2/ex2f performance with your current MPI. Satish From bsmith at mcs.anl.gov Fri Apr 18 07:08:46 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 18 Apr 2008 07:08:46 -0500 Subject: Slow speed after changing from serial to parallel In-Reply-To: <48081BBB.5050004@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <480336BE.3070507@gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48081BBB.5050004@gmail.com> Message-ID: On Apr 17, 2008, at 10:55 PM, Ben Tay wrote: > Hi, > > I've email my school super computing staff and they told me that the > queue which I'm using is one meant for testing, hence, it's handling > of work load is not good. I've sent my job to another queue and it's > run on 4 processors. It's my own code because there seems to be > something wrong with the server displaying the summary when using - > log_summary with ex2f.F. I'm trying it again. > > Anyway comparing just kspsolve between the two, the speedup is about > 2.7. However, I noticed that for the 4 processors one, its > MatAssemblyBegin is 1.5158e+02, which is more than KSPSolve's > 4.7041e+00. You have a huge load imbalance in setting the values in the matrix (the load imbalance is 2254.7). Are you sure each process is setting about the same amount of matrix entries? Also are you doing an accurate matrix preallocation (see the detailed manual pages for MatMPIAIJSetPreallocation() and MatCreateMPIAIJ()). You can run with - info and grep for malloc to see if the MatSetValues() is allocating additional memory. If you get the matrix preallocation correct you will see a HUGE speed improvement. Barry > So is MatAssemblyBegin's time included in KSPSolve? If not, does it > mean that there's something wrong about my MatAssemblyBegin? > > Thank you > > For 1 processor: > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript - > r -fCourier9' to print this document *** > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance > Summary: ---------------------------------------------- > > ./a.out on a atlas3 named atlas3-c28 with 1 processor, by g0306332 > Fri Apr 18 08:46:11 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST > 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 1.322e+02 1.00000 1.322e+02 > Objects: 2.200e+01 1.00000 2.200e+01 > Flops: 2.242e+08 1.00000 2.242e+08 2.242e+08 > Flops/sec: 1.696e+06 1.00000 1.696e+06 1.696e+06 > MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Reductions: 2.100e+01 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of > length N --> 2N flops > and VecAXPY() for complex vectors of > length N --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- > Messages --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts > %Total Avg %Total counts %Total > 0: Main Stage: 1.3217e+02 100.0% 2.2415e+08 100.0% 0.000e > +00 0.0% 0.000e+00 0.0% 2.100e+01 100.0% > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with > PetscLogStagePush() and PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in > this phase > %M - percent messages in this phase %L - percent message > lengths in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max > time over all processors) > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > > Event Count Time (sec) Flops/ > sec --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg > len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 6 1.0 1.8572e-01 1.0 3.77e+08 1.0 0.0e+00 0.0e > +00 0.0e+00 0 31 0 0 0 0 31 0 0 0 377 > MatConvert 1 1.0 1.1636e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatAssemblyBegin 1 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 8.8531e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRow 1296000 1.0 2.6576e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatZeroEntries 1 1.0 4.4700e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPGMRESOrthog 6 1.0 2.1104e-01 1.0 5.16e+08 1.0 0.0e+00 0.0e > +00 6.0e+00 0 49 0 0 29 0 49 0 0 29 516 > KSPSetup 1 1.0 6.5601e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 1.2883e+01 1.0 1.74e+07 1.0 0.0e+00 0.0e > +00 1.5e+01 10100 0 0 71 10100 0 0 71 17 > PCSetUp 1 1.0 4.4342e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 2.0e+00 3 0 0 0 10 3 0 0 0 10 0 > PCApply 7 1.0 7.7337e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 > VecMDot 6 1.0 9.8586e-02 1.0 5.52e+08 1.0 0.0e+00 0.0e > +00 6.0e+00 0 24 0 0 29 0 24 0 0 29 552 > VecNorm 7 1.0 6.9757e-02 1.0 2.60e+08 1.0 0.0e+00 0.0e > +00 7.0e+00 0 8 0 0 33 0 8 0 0 33 260 > VecScale 7 1.0 2.9803e-02 1.0 3.04e+08 1.0 0.0e+00 0.0e > +00 0.0e+00 0 4 0 0 0 0 4 0 0 0 304 > VecCopy 1 1.0 6.1009e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 9 1.0 3.1438e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 1 1.0 7.5161e-03 1.0 3.45e+08 1.0 0.0e+00 0.0e > +00 0.0e+00 0 1 0 0 0 0 1 0 0 0 345 > VecMAXPY 7 1.0 1.4444e-01 1.0 4.85e+08 1.0 0.0e+00 0.0e > +00 0.0e+00 0 31 0 0 0 0 31 0 0 0 485 > VecAssemblyBegin 2 1.0 4.2915e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 6.0e+00 0 0 0 0 29 0 0 0 0 29 0 > VecAssemblyEnd 2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecNormalize 7 1.0 9.9603e-02 1.0 2.73e+08 1.0 0.0e+00 0.0e > +00 7.0e+00 0 12 0 0 33 0 12 0 0 33 273 > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' > Mem. > > --- Event Stage 0: Main Stage > > Matrix 1 1 98496004 0 > Krylov Solver 1 1 17216 0 > Preconditioner 1 1 272 0 > Vec 19 19 186638392 0 > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > ====================================================================== > Average time to get PetscTime(): 9.53674e-08 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Wed Jan 9 14:33:02 2008 > Configure options: --with-cc=icc --with-fc=ifort --with-x=0 --with- > blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared --with- > mpi-dir=/lsftmp/g0306332/mpich2/ --with-debugging=0 --with-hypre- > dir=/home/enduser/g0306332/lib/hypre_shared > ----------------------------------------- > Libraries compiled on Wed Jan 9 14:33:36 SGT 2008 on atlas3-c01 > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed > Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > Using PETSc directory: /home/enduser/g0306332/petsc-2.3.3-p8 > Using PETSc arch: atlas3 > ----------------------------------------- > Using C compiler: icc -fPIC -O > > for 4 processors > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript - > r -fCourier9' to print this document *** > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance > Summary: ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c23 with 4 processors, by > g0306332 Fri Apr 18 08:22:11 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST > 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > 0.000000000000000E+000 58.1071298622710 > 0.000000000000000E+000 58.1071298622710 > 0.000000000000000E+000 58.1071298622710 > 0.000000000000000E+000 58.1071298622710 > Time (sec): 3.308e+02 1.00177 3.305e+02 > Objects: 2.900e+01 1.00000 2.900e+01 > Flops: 5.605e+07 1.00026 5.604e+07 2.242e+08 > Flops/sec: 1.697e+05 1.00201 1.695e+05 6.782e+05 > MPI Messages: 1.400e+01 2.00000 1.050e+01 4.200e+01 > MPI Message Lengths: 1.248e+05 2.00000 8.914e+03 3.744e+05 > MPI Reductions: 7.500e+00 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of > length N --> 2N flops > and VecAXPY() for complex vectors of > length N --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- > Messages --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts > %Total Avg %Total counts %Total > 0: Main Stage: 3.3051e+02 100.0% 2.2415e+08 100.0% 4.200e+01 > 100.0% 8.914e+03 100.0% 3.000e+01 100.0% > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with > PetscLogStagePush() and PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in > this phase > %M - percent messages in this phase %L - percent message > lengths in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max > time over all processors) > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > Event Count Time (sec) Flops/ > sec --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg > len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 6 1.0 8.2640e-02 1.6 3.37e+08 1.6 3.6e+01 9.6e > +03 0.0e+00 0 31 86 92 0 0 31 86 92 0 846 > MatConvert 1 1.0 2.1472e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 1.5158e+022254.7 0.00e+00 0.0 0.0e+00 > 0.0e+00 2.0e+00 22 0 0 0 7 22 0 0 0 7 0 > MatAssemblyEnd 1 1.0 1.5766e-01 1.1 0.00e+00 0.0 6.0e+00 4.8e > +03 7.0e+00 0 0 14 8 23 0 0 14 8 23 0 > MatGetRow 324000 1.0 8.9608e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 2 1.0 5.9605e-06 2.8 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatZeroEntries 1 1.0 5.8902e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPGMRESOrthog 6 1.0 1.1247e-01 1.7 4.11e+08 1.7 0.0e+00 0.0e > +00 6.0e+00 0 49 0 0 20 0 49 0 0 20 968 > KSPSetup 1 1.0 1.5483e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 4.7041e+00 1.0 1.19e+07 1.0 3.6e+01 9.6e > +03 1.5e+01 1100 86 92 50 1100 86 92 50 48 > PCSetUp 1 1.0 1.5953e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 2.0e+00 0 0 0 0 7 0 0 0 0 7 0 > PCApply 7 1.0 2.6580e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecMDot 6 1.0 7.3443e-02 2.2 4.13e+08 2.2 0.0e+00 0.0e > +00 6.0e+00 0 24 0 0 20 0 24 0 0 20 741 > VecNorm 7 1.0 2.5193e-01 1.1 1.94e+07 1.1 0.0e+00 0.0e > +00 7.0e+00 0 8 0 0 23 0 8 0 0 23 72 > VecScale 7 1.0 6.6319e-03 2.8 9.64e+08 2.8 0.0e+00 0.0e > +00 0.0e+00 0 4 0 0 0 0 4 0 0 0 1368 > VecCopy 1 1.0 2.3100e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 9 1.0 1.4173e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 1 1.0 2.9502e-03 1.7 3.72e+08 1.7 0.0e+00 0.0e > +00 0.0e+00 0 1 0 0 0 0 1 0 0 0 879 > VecMAXPY 7 1.0 4.9046e-02 1.4 5.09e+08 1.4 0.0e+00 0.0e > +00 0.0e+00 0 31 0 0 0 0 31 0 0 0 1427 > VecAssemblyBegin 2 1.0 4.3297e-04 3.1 0.00e+00 0.0 0.0e+00 0.0e > +00 6.0e+00 0 0 0 0 20 0 0 0 0 20 0 > VecAssemblyEnd 2 1.0 5.2452e-06 1.4 0.00e+00 0.0 0.0e+00 0.0e > +00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecScatterBegin 6 1.0 6.9666e-04 6.3 0.00e+00 0.0 3.6e+01 9.6e > +03 0.0e+00 0 0 86 92 0 0 0 86 92 0 0 > VecScatterEnd 6 1.0 1.4806e-02102.6 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecNormalize 7 1.0 2.5431e-01 1.1 2.86e+07 1.1 0.0e+00 0.0e > +00 7.0e+00 0 12 0 0 23 0 12 0 0 23 107 > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' > Mem. > > --- Event Stage 0: Main Stage > > Matrix 3 3 49252812 0 > Krylov Solver 1 1 17216 0 > Preconditioner 1 1 272 0 > Index Set 2 2 5488 0 > Vec 21 21 49273624 0 > Vec Scatter 1 1 0 0 > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > = > ====================================================================== > Average time to get PetscTime(): 1.90735e-07 > Average time for MPI_Barrier(): 5.62668e-06 > Average time for zero size MPI_Send(): 6.73532e-06 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > > >>>> >>>> >>> >> >> >> From recrusader at gmail.com Fri Apr 18 20:40:04 2008 From: recrusader at gmail.com (Yujie) Date: Fri, 18 Apr 2008 18:40:04 -0700 Subject: how to combine several matrice into one matrix Message-ID: <7ff0ee010804181840i2195c9e1wf6757ce6faff5a72@mail.gmail.com> Hi, everyone Assuming there are A1(M*N) A2(M*N) A3(M*N), I want to get A1 A=A2 A3 My method is MatGetArray(A1,&a1); MatSetValues(A,a1); MatGetArray(A2,&a2); MatSetValues(A,a2); MatGetArray(A3,&a3); MatSetValues(A,a3); Is there any better methods for it? The above codes are slow. thanks a lot. Regards, Yujie -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Fri Apr 18 21:12:00 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 18 Apr 2008 21:12:00 -0500 Subject: how to combine several matrice into one matrix In-Reply-To: <7ff0ee010804181840i2195c9e1wf6757ce6faff5a72@mail.gmail.com> References: <7ff0ee010804181840i2195c9e1wf6757ce6faff5a72@mail.gmail.com> Message-ID: <23BC215B-E85A-401B-A1BE-96A4CACF183F@mcs.anl.gov> For dense matrices only. You can call MatGetArray() on A and then do direct copies of the arrays. Barry On Apr 18, 2008, at 8:40 PM, Yujie wrote: > Hi, everyone > > Assuming there are A1(M*N) A2(M*N) A3(M*N), I want to get > A1 > A=A2 > A3 > > My method is > > MatGetArray(A1,&a1); > MatSetValues(A,a1); > MatGetArray(A2,&a2); > MatSetValues(A,a2); > MatGetArray(A3,&a3); > MatSetValues(A,a3); > > Is there any better methods for it? The above codes are slow. thanks > a lot. > > Regards, > Yujie > > From zonexo at gmail.com Fri Apr 18 23:11:34 2008 From: zonexo at gmail.com (Ben Tay) Date: Sat, 19 Apr 2008 12:11:34 +0800 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <48060CD5.1010308@ethz.ch> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> <48060CD5.1010308@ethz.ch> Message-ID: <480970F6.5060007@gmail.com> An HTML attachment was scrubbed... URL: From balay at mcs.anl.gov Sat Apr 19 08:52:51 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Sat, 19 Apr 2008 08:52:51 -0500 (CDT) Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <480970F6.5060007@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> <48060CD5.1010308@ethz.ch> <480970F6.5060007@gmail.com> Message-ID: On Sat, 19 Apr 2008, Ben Tay wrote: > Btw, I'm not able to try the latest mpich2 because I do not have the > administrator rights. I was told that some special configuration is > required. You don't need admin rights to install/use MPICH with the options I mentioned. I was sugesting just running in SMP mode on a single machine [from 1-8 procs on Quad-Core Intel Xeon X5355, to compare with my SMP runs] with: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker > Btw, should there be any different in speed whether I use mpiuni and > ifort or mpi and mpif90? I tried on ex2f (below) and there's only a > small difference. If there is a large difference (mpi being slower), > then it mean there's something wrong in the code? For one - you are not using MPIUNI. You are using --with-mpi-dir=/lsftmp/g0306332/mpich2. However - if compilers are the same & compiler options are the same, I would expect the same performance in both the cases. Do you get such different times for different runs of the same binary? MatMult 384 vs 423 What if you run both of the binaries on the same machine? [as a single job?]. If you are using pbs scheduler - sugest doing: - squb -I [to get interactive access to thenodes] - login to each node - to check no one else is using the scheduled nodes. - run multiple jobs during this single allocation for comparision. These are general tips to help you debug performance on your cluster. BTW: I get: ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397 You get: log.1:MatMult???????????? 1879 1.0 2.8137e+01 1.0 3.84e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11? 0? 0? 0? 12 11? 0? 0? 0?? 384 There is a difference in number of iterations. Are you sure you are using the same ex2f with -m 600 -n 600 options? Satish From zonexo at gmail.com Sat Apr 19 10:18:49 2008 From: zonexo at gmail.com (Ben Tay) Date: Sat, 19 Apr 2008 23:18:49 +0800 Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> <48060CD5.1010308@ethz.ch> <480970F6.5060007@gmail.com> Message-ID: <480A0D59.9050804@gmail.com> Hi Satish, 1st of all, I forgot to inform u that I've changed the m and n to 800. I would like to see if the larger value can make the scaling better. If req, I can redo the test with m,n=600. I can install MPICH but I don't think I can choose to run on a single machine using from 1 to 8 procs. In order to run the code, I usually have to use the command bsub -o log -q linux64 ./a.out for single procs bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where $=no. of procs. for multiple procs After that, when the job is running, I'll be given the server which my job runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 procs) or 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 procs). I was told that 2*atlas3-c10 doesn't mean that it is running on a dual core single cpu. Btw, are you saying that I should 1st install the latest MPICH2 build with the option : ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then install PETSc with the MPICH2? So after that do you know how to do what you've suggest for my servers? I don't really understand what you mean. May I supposed to run 4 jobs on 1 quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that atlas3-c00 to c03 are the location of the quad cores. I can force to use them by bsub -o log -q mcore_parallel -n $ -m quadcore -a mvapich mpirun.lsf ./a.out Lastly, I make a mistake in the different times reported by the same compiler. Sorry abt that. Thank you very much. Satish Balay wrote: > On Sat, 19 Apr 2008, Ben Tay wrote: > > >> Btw, I'm not able to try the latest mpich2 because I do not have the >> administrator rights. I was told that some special configuration is >> required. >> > > You don't need admin rights to install/use MPICH with the options I > mentioned. I was sugesting just running in SMP mode on a single > machine [from 1-8 procs on Quad-Core Intel Xeon X5355, to compare with > my SMP runs] with: > > ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker > > >> Btw, should there be any different in speed whether I use mpiuni and >> ifort or mpi and mpif90? I tried on ex2f (below) and there's only a >> small difference. If there is a large difference (mpi being slower), >> then it mean there's something wrong in the code? >> > > For one - you are not using MPIUNI. You are using > --with-mpi-dir=/lsftmp/g0306332/mpich2. However - if compilers are the > same & compiler options are the same, I would expect the same > performance in both the cases. Do you get such different times for > different runs of the same binary? > > MatMult 384 vs 423 > > What if you run both of the binaries on the same machine? [as a single > job?]. > > If you are using pbs scheduler - sugest doing: > - squb -I [to get interactive access to thenodes] > - login to each node - to check no one else is using the scheduled nodes. > - run multiple jobs during this single allocation for comparision. > > These are general tips to help you debug performance on your cluster. > > BTW: I get: > ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397 > > You get: > log.1:MatMult 1879 1.0 2.8137e+01 1.0 3.84e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11 0 0 0 12 11 0 0 0 384 > > > There is a difference in number of iterations. Are you sure you are > using the same ex2f with -m 600 -n 600 options? > > Satish From balay at mcs.anl.gov Sat Apr 19 13:19:34 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Sat, 19 Apr 2008 13:19:34 -0500 (CDT) Subject: Slow speed after changing from serial to parallel (with ex2f.F) In-Reply-To: <480A0D59.9050804@gmail.com> References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <48035F88.2080003@gmail.com> <4804CAC0.6060201@gmail.com> <4804D044.2060502@gmail.com> <4804DB61.3080906@gmail.com> <48055FAD.3000105@gmail.com> <48056C08.6030903@gmail.com> <4805820C.8030803@gmail.com> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> <48060CD5.1010308@ethz.ch> <480970F6.5060007@gmail.com> <480A0D59.9050804@gmail.com> Message-ID: Ben, This conversation is getting long and winding. And we are are getting into your cluster adminstration - which is not PETSc related. I'll sugest you figureout about using the cluster from your system admin and how to use bsub. http://www.vub.ac.be/BFUCC/LSF/man/bsub.1.html However I'll point out the following things. - I'll sugest learning about scheduling an interactive job on your cluster. This will help you with running multiple jobs on the same machine. - When making comparisions, have minimum changes between thing you compare runs. * For eg: you are comparing runs between different queues '-q linux64' '-q mcore_parallel'. There might be differences here that can result in different performance. * If you are getting part of the machine [for -n 1 jobs] - verify if you are sharing the other part with some other job. Without this verification - your numbers are not meaningful. [depending upon how the queue is configured - it can either allocate part of the node or full node] * you should be able to request 4procs [i.e 1 complete machine] but be able to run either -np 1, 2 or 4 on the allocation. [This is easier to do in interactive mode]. This ensures nobody else is using the machine. And you can run your code multiple times - to see if you are getting consistant results. Regarding the primary issue you've had - with performance debugging your PETSc appliation in *SMP-mode*, we've observed performance anamolies in your log_summary for both your code, and ex2.f.F This could be due one or more of the following: - issues in your code - issues with MPI you are using - isues with the cluster you are using. To narrow down - the comparisions I sugest: - compare my ex2f.F with the *exact* same runs on your machine [You've claimed that you also hav access to a 2-quad-core Intel Xeon X5355 machine]. So you should be able to reproduce the exact same experiment as me - and compare the results. This should keep both software same - and show differences in system software etc.. >>>>> ? No of Nodes Processors Qty per node Total cores per node Memory per node ? ? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ? ^^^ ? 60 Dual-Core Intel Xeon 5160 2 4 8 GB <<<<< i.e configure latest mpich2 with [default compilers gcc/gfortran]: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker Build PETSc with this MPI [and same compilers] ./config/configure.py --with-mpi-dir= --with-debugging=0 And run ex2f.F 600x600 on 1, 2, 4, 8 procs on a *single* X5355 machine. [it might have a different queue name] - Now compare ex2f.F performance wtih MPICH [as built above] and the current MPI you are using. This should identify the performance differences between MPI implemenations within the box [within the SMP box] - Now compare runs between ex2f.F and your application. At each of the above steps of comparision - we are hoping to identify the reason for differences and rectify. Perhaps this is not possible on your cluster and you can't improve on what you already have.. If you can't debug the SMP performance issues, you can avoid SMP completely, and use 1 MPI task per machine [or 1 MPI task per memory bank => 2 per machine]. But you'll still have to do similar analysis to make sure there are no performance anamolies in the tool chain. [i.e hardware, system software, MPI, application] If you are willing to do the above steps, we can help with the comparisions. As mentioned - this is getting long and windy. If you have futher questions in this regard - we should contiune it at petsc-maint at mcs.anl.gov Satish On Sat, 19 Apr 2008, Ben Tay wrote: > Hi Satish, > > 1st of all, I forgot to inform u that I've changed the m and n to 800. I would > like to see if the larger value can make the scaling better. If req, I can > redo the test with m,n=600. > > I can install MPICH but I don't think I can choose to run on a single machine > using from 1 to 8 procs. In order to run the code, I usually have to use the > command > > bsub -o log -q linux64 ./a.out for single procs > > bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where $=no. > of procs. for multiple procs > > After that, when the job is running, I'll be given the server which my job > runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 procs) or > 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 procs). I was told > that 2*atlas3-c10 doesn't mean that it is running on a dual core single cpu. > > Btw, are you saying that I should 1st install the latest MPICH2 build with the > option : > > ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then install > PETSc with the MPICH2? > > So after that do you know how to do what you've suggest for my servers? I > don't really understand what you mean. May I supposed to run 4 jobs on 1 > quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that > atlas3-c00 to c03 are the location of the quad cores. I can force to use them > by > > bsub -o log -q mcore_parallel -n $ -m quadcore -a mvapich mpirun.lsf ./a.out > > Lastly, I make a mistake in the different times reported by the same compiler. > Sorry abt that. > > Thank you very much. From recrusader at gmail.com Sat Apr 19 18:08:50 2008 From: recrusader at gmail.com (Yujie) Date: Sat, 19 Apr 2008 16:08:50 -0700 Subject: how to combine several matrice into one matrix In-Reply-To: <23BC215B-E85A-401B-A1BE-96A4CACF183F@mcs.anl.gov> References: <7ff0ee010804181840i2195c9e1wf6757ce6faff5a72@mail.gmail.com> <23BC215B-E85A-401B-A1BE-96A4CACF183F@mcs.anl.gov> Message-ID: <7ff0ee010804191608t120e5fa2hbafbaf243b22440b@mail.gmail.com> Dear Barry: Regarding my method, On 4/18/08, Barry Smith wrote: > > > For dense matrices only. > > You can call MatGetArray() on A and then do direct copies of the arrays. > > Barry > > On Apr 18, 2008, at 8:40 PM, Yujie wrote: > > Hi, everyone > > > > Assuming there are A1(M*N) A2(M*N) A3(M*N), I want to get > > A1 > > A=A2 > > A3 > > > > My method is > > > > MatGetArray(A1,&a1); > > MatSetValues(A,a1); > > MatGetArray(A2,&a2); > > MatSetValues(A,a2); > > MatGetArray(A3,&a3); > > MatSetValues(A,a3); > > > > Is there any better methods for it? The above codes are slow. thanks a > > lot. > > > > Regards, > > Yujie > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From w_subber at yahoo.com Sun Apr 20 11:43:25 2008 From: w_subber at yahoo.com (Waad Subber) Date: Sun, 20 Apr 2008 09:43:25 -0700 (PDT) Subject: MatMatMult Message-ID: <3922.29302.qm@web38202.mail.mud.yahoo.com> Hi I want to multiply two sparse sequential matrices. In order to do that I think I should use MatMatMult ; however, I need the expected fill ratio which I don't know in advance. For matrix A and matrix B I might get the nnz(A) and nnz(B) from MatGetInfo. What about nnz(C) ? Thanks Waad --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Sun Apr 20 12:59:29 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sun, 20 Apr 2008 12:59:29 -0500 Subject: MatMatMult In-Reply-To: <3922.29302.qm@web38202.mail.mud.yahoo.com> References: <3922.29302.qm@web38202.mail.mud.yahoo.com> Message-ID: <4838A11C-49C9-4B40-9A3C-3A96C06696C0@mcs.anl.gov> Waad, There is no way to compute this in advance. Use PETSC_DEFAULT to use the default estimate. You can run the program with -info and search for "Fill ratio" this will give you needed value which will give you an idea of what to use in the future. I have added this info to the manual page. Barry Note that the needed ratio does depend on the matrix size so you may need to adjust the value for larger matrices. On Apr 20, 2008, at 11:43 AM, Waad Subber wrote: > Hi > > I want to multiply two sparse sequential matrices. In order to do > that I think I should use MatMatMult ; however, I need the expected > fill ratio which I don't know in advance. > > For matrix A and matrix B I might get the nnz(A) and nnz(B) from > MatGetInfo. What about nnz(C) ? > > Thanks > > Waad > > > > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. > Try it now. From amjad11 at gmail.com Mon Apr 21 00:34:07 2008 From: amjad11 at gmail.com (amjad ali) Date: Mon, 21 Apr 2008 10:34:07 +0500 Subject: general question on speed using quad core Xeons In-Reply-To: <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> Message-ID: <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> Hello Petsc team (especially Satish and Barry). YOU SAID: FOR Better performance (1) high per-CPU memory performance. Each CPU (core in dual core systems) needs to have its own memory bandwith of roughly 2 or more gigabytes. (2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you get. >From these points I started to look for RAM Sticks with higher MHz rates (and obviously CPUs and motherboards supporting this speed). But you also reflected to: http://www.intel.com/performance/server/xeon/hpc_ansys.htm http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm On these pages you pointed out that: systems with CPUs of 20% higher FSB speed are performing 20% better. But you see also RAM speed is 20% higher for the better performing system (i.e 800MHz vs 667 MHz). So my question is that which is the actual indicator of "memory bandwidth"per core? Whether it is (1) CPU's FSB speed (2) RAM speed (3) Motherboard's System Bus Speed. How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus Speed). With best regards, Amjad Ali. On 4/16/08, Barry Smith wrote: > > > Cool. The pages to look at are > > http://www.intel.com/performance/server/xeon/hpc_ansys.htm > http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm > > these are the two benchmarks that reflect the bottlenecks of memory > bandwidth. > When going from dual to quad they get 1.2 times the performance, when one > would > like 2 times the performance. > > Barry > > > On Apr 16, 2008, at 9:27 AM, Satish Balay wrote: > > > Just a note: > > > > Intel does publish benchmarks for their chips. > > > > http://www.intel.com/performance/server/xeon/hpcapp.htm > > > > Satish > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tribur at vision.ee.ethz.ch Mon Apr 21 05:54:55 2008 From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch) Date: Mon, 21 Apr 2008 12:54:55 +0200 Subject: Schur system + MatShell Message-ID: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch> Dear all, Sorry for switching from Schur to Hypre and back, but I'm trying two approaches at the same time to find the optimal solution for our convection-diffusion/Stokes problems: a) solving the global stiffness matrix directly and in parallel using Petsc and a suitable preconditioner (???) and b) applying first non-overlapping domain decomposition and than solving the Schur complement system. Being concerned with b in the moment, I managed to set up and solve the global Schur system using MATDENSE. The solving works well with, e.g., gmres+jacobi, but the assembling of the global Schur matrix takes too long. Therefore, I'm trying to use the matrix in unassembled form using MatShell. Not very successfully, however: 1) When I use KSPGMRES, I got the error [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c [1]PETSC ERROR: PCApplyBAorAB() line 584 in src/ksp/pc/interface/precon.c [1]PETSC ERROR: GMREScycle() line 159 in src/ksp/ksp/impls/gmres/gmres.c [1]PETSC ERROR: KSPSolve_GMRES() line 241 in src/ksp/ksp/impls/gmres/gmres.c [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c 2) Using KSPBICG, it iterates without error message, but the result is wrong (norm of residual 1.42768 instead of something like 1.0e-10), although my Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose seem to be correct. I tested the latter comparing the vectors y1 and y2 computed by, e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < e-15 for both functions. Could you please have a look at my code snippet below? Thank you very much! Kathrin PS: My Code: Vec gtot, x; ... Mat Stot; IS is; ISCreateGeneral(PETSC_COMM_SELF, NPb, &uBId_global[0], &is); localData ctx; ctx.NPb = NPb; //size of local Schur system S ctx.Sloc = &S[0]; ctx.is = is; MatCreateShell(PETSC_COMM_WORLD,m,n,NPb_tot,NPb_tot,&ctx,&Stot); MatShellSetOperation(Stot,MATOP_MULT,(void(*)(void)) PETSC_SchurMatMult); MatShellSetOperation(Stot,MATOP_MULT_TRANSPOSE,(void(*)(void))PETSC_SchurMatMultTranspose); KSP ksp; KSPCreate(PETSC_COMM_WORLD,&ksp); PC prec; KSPSetOperators(ksp,Stot,Stot,DIFFERENT_NONZERO_PATTERN); KSPGetPC(ksp,&prec); PCSetType(prec, PCNONE); KSPSetType(ksp, KSPBICG); KSPSetTolerances(ksp, 1.e-10, 1.e-50,PETSC_DEFAULT,PETSC_DEFAULT); KSPSolve(ksp,gtot,x); ... From petsc-maint at mcs.anl.gov Mon Apr 21 09:18:47 2008 From: petsc-maint at mcs.anl.gov (Satish Balay) Date: Mon, 21 Apr 2008 09:18:47 -0500 (CDT) Subject: general question on speed using quad core Xeons In-Reply-To: <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> Message-ID: On Mon, 21 Apr 2008, amjad ali wrote: > Hello Petsc team (especially Satish and Barry). > > YOU SAID: FOR Better performance > > (1) high per-CPU memory performance. Each CPU (core in dual core systems) > needs to have its own memory bandwith of roughly 2 or more gigabytes. This 2GB/core number is a rabbit out of the hat. We just put some reference point out - a few years back for SMP machines [when the age of multi-core chips hasn't yet begun]. Now Intel has chipsets that can give 25GB/s. They now put 4 cores or 8 cores on this machine. [i.e 6Gb/s for 4core and 3Gb/s for the 8core machine] But the trend now is to cram more and more cores - so expect the number of cores to increase faster than the chipset memory-bandwidth. [i.e badwidth per core is likely to get smaller and smaller] > > (2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you > get. > > From these points I started to look for RAM Sticks with higher MHz rates > (and obviously CPUs and motherboards supporting this speed). > > But you also reflected to: > > http://www.intel.com/performance/server/xeon/hpc_ansys.htm > http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm > > On these pages you pointed out that: systems with CPUs of 20% higher FSB > speed are performing 20% better. But you see also RAM speed is 20% higher > for the better performing system (i.e 800MHz vs 667 MHz). > > So my question is that which is the actual indicator of "memory > bandwidth"per core? > Whether it is > (1) CPU's FSB speed > (2) RAM speed > (3) Motherboard's System Bus Speed. The answer is a bit complicated here. It depends upon the system architure. CPU Chip[s] <-----> chipset <-----> memory [banks] - Is the bandwidth on the CPU-Chip side is same as on the memory side? [there are machines where this is different, but most macines use *synchronous* buses - so that the 'memory chipset' does not have to do translation/buffering] For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]: bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec * = 25.6GByte/sec The othe CPU side - its balanced by FSB1600 => Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se So generally all the 3 things you've listed has to *match* correctly. [Some CPUs and chipsets support multiple FSB frequencies - so have to check what freq is set for the machine you are buying.] This choice can have *cost* implications.. Is it worth it to spend 20% more to get 20%more bandwidth? Perhaps yes for sparse-matrix appliations - but not for others.. > How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU > core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus > Speed). As mentioned 2GB/core is a approximate nubmer we thought off a few years back - when there were no multi-core machine [just SMP chipsets]. All we can do is eavalue the memorybandwidth number for a given machine. We can't *ensure* it - as this is a choice made by and other chip designers.[intel, amd, ibm etc..] The choice for the currently available products was probably made a few years back. There is another component to this memory bandwidth debate. Which of the following do we want? 1. best scalability chip? [when comparing the performance from 1-N cores] 2. overall best performance on 1-core. or N cores [i.e node]. And from the system architecture issues - mentioned above - there are a couple of other issues that influcene this. - are the CPU-Chips sharing bandwidth or spliting bandwidth? - within the CPU-Chip [multi-core] is the memory bus shared or split? The first one can achieved by the hardware spliting up 1/Nth total available bandwidth per core. So it shows scalable results. But the 1-core performance can be low. The second choice could happen by not spliting - but sharing at the core level. For eg: Intel machines - memory bandwidth is divided at the CPU-chip level. For the example case MatMult from ex2 on 8-core intel machine had the following performance on 1,2,4,8 cores: 397, 632, 724, 749 [MFlop/s] To me - its not clear which architecture is better. For publishing scalability results - the above numbers don't look good. [but it could be the best performance you can squeze out any sequential job - or out of any 8-core architecture] Satish From jed at 59A2.org Mon Apr 21 09:53:30 2008 From: jed at 59A2.org (Jed Brown) Date: Mon, 21 Apr 2008 16:53:30 +0200 Subject: flexible block matrix Message-ID: <20080421145330.GA1994@brakk.ethz.ch> I am solving a Stokes problem with nonlinear slip boundary conditions. I don't think I can take advantage of block structure since the normal component of velocity has a Dirichlet constraint and this must be built into the velocity space in order to preserve conditioning. An alternative formulation involves a Lagrange multiplier for the constraint, but even with clever preconditioning, this system is still more expensive to solve according to [1]. In solving the (velocity-pressure) saddle point problem, many approximate solves with the velocity system is needed in the preconditioner, hence I need a strong preconditioner for the velocity system. Currently, I am using algebraic multigrid on a low-order discretization which works fairly well. Since Hypre and ML only take AIJ matrices, perhaps I shouldn't worry about blocking after all. Is there a way to use MATBAIJ when some nodes have fewer degrees of freedom? Should I bother? Note that my method (currently just a single element) uses a high order discretization on some elements and low order on others. The global matrix for the low order elements is assembled, but it is applied locally for the high order elements taking advantage of the tensor product basis. For the preconditioner, a low order discretization on the nodes of the high order elements is globally assembled and added to the global matrix from the low-order elements. Experiments with a single element (spectral rather than spectral/hp element) show this to be effective, converging in a constant number of iterations independent of polynomial order when using a V-cycle of AMG as a preconditioner. Thanks. Jed [1] B?nsch, H?hn 2000, `Numerical treatment of the Navier-Stokes equations with slip boundary conditions', SIAM J. Sci. Comput. From balay at mcs.anl.gov Mon Apr 21 10:33:39 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 21 Apr 2008 10:33:39 -0500 (CDT) Subject: flexible block matrix In-Reply-To: <20080421145330.GA1994@brakk.ethz.ch> References: <20080421145330.GA1994@brakk.ethz.ch> Message-ID: On Mon, 21 Apr 2008, Jed Brown wrote: > I am solving a Stokes problem with nonlinear slip boundary conditions. I don't > think I can take advantage of block structure since the normal component of > velocity has a Dirichlet constraint and this must be built into the velocity > space in order to preserve conditioning. An alternative formulation involves a > Lagrange multiplier for the constraint, but even with clever preconditioning, > this system is still more expensive to solve according to [1]. > > In solving the (velocity-pressure) saddle point problem, many approximate solves > with the velocity system is needed in the preconditioner, hence I need a strong > preconditioner for the velocity system. Currently, I am using algebraic > multigrid on a low-order discretization which works fairly well. Since Hypre > and ML only take AIJ matrices, perhaps I shouldn't worry about blocking after > all. Is there a way to use MATBAIJ when some nodes have fewer degrees of > freedom? Should I bother? I'll say - don't bother. BAIJ can't support varing block size. The code that supports it is INODE code - which is already part of AIJ type - and is the default for AIJ. You can run your code with -mat_no_inode to see the performance difference between basic AIJ and INODE-AIJ. [The primary thing to look for in -log_summary is MatMult()] Inode code looks for consequitive *rows* with same column indices, and marks them as a single inode. For each inode - [i.e say 5 rows] the column indices are loaded only once, and used for all 5 rows - thus improving the performance. A matrix can have an inode structure of [2,2,3,3,1,3] etc.. i.e 14x14 matrix. Satish > Note that my method (currently just a single element) uses a high order > discretization on some elements and low order on others. The global matrix for > the low order elements is assembled, but it is applied locally for the high order > elements taking advantage of the tensor product basis. For the preconditioner, > a low order discretization on the nodes of the high order elements is globally > assembled and added to the global matrix from the low-order elements. > Experiments with a single element (spectral rather than spectral/hp element) > show this to be effective, converging in a constant number of iterations > independent of polynomial order when using a V-cycle of AMG as a preconditioner. > > Thanks. > > Jed > > > [1] B?nsch, H?hn 2000, `Numerical treatment of the Navier-Stokes equations with > slip boundary conditions', SIAM J. Sci. Comput. > > From petsc-maint at mcs.anl.gov Mon Apr 21 09:18:47 2008 From: petsc-maint at mcs.anl.gov (Satish Balay) Date: Mon, 21 Apr 2008 09:18:47 -0500 (CDT) Subject: general question on speed using quad core Xeons In-Reply-To: <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> Message-ID: On Mon, 21 Apr 2008, amjad ali wrote: > Hello Petsc team (especially Satish and Barry). > > YOU SAID: FOR Better performance > > (1) high per-CPU memory performance. Each CPU (core in dual core systems) > needs to have its own memory bandwith of roughly 2 or more gigabytes. This 2GB/core number is a rabbit out of the hat. We just put some reference point out - a few years back for SMP machines [when the age of multi-core chips hasn't yet begun]. Now Intel has chipsets that can give 25GB/s. They now put 4 cores or 8 cores on this machine. [i.e 6Gb/s for 4core and 3Gb/s for the 8core machine] But the trend now is to cram more and more cores - so expect the number of cores to increase faster than the chipset memory-bandwidth. [i.e badwidth per core is likely to get smaller and smaller] > > (2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you > get. > > From these points I started to look for RAM Sticks with higher MHz rates > (and obviously CPUs and motherboards supporting this speed). > > But you also reflected to: > > http://www.intel.com/performance/server/xeon/hpc_ansys.htm > http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm > > On these pages you pointed out that: systems with CPUs of 20% higher FSB > speed are performing 20% better. But you see also RAM speed is 20% higher > for the better performing system (i.e 800MHz vs 667 MHz). > > So my question is that which is the actual indicator of "memory > bandwidth"per core? > Whether it is > (1) CPU's FSB speed > (2) RAM speed > (3) Motherboard's System Bus Speed. The answer is a bit complicated here. It depends upon the system architure. CPU Chip[s] <-----> chipset <-----> memory [banks] - Is the bandwidth on the CPU-Chip side is same as on the memory side? [there are machines where this is different, but most macines use *synchronous* buses - so that the 'memory chipset' does not have to do translation/buffering] For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]: bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec * = 25.6GByte/sec The othe CPU side - its balanced by FSB1600 => Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se So generally all the 3 things you've listed has to *match* correctly. [Some CPUs and chipsets support multiple FSB frequencies - so have to check what freq is set for the machine you are buying.] This choice can have *cost* implications.. Is it worth it to spend 20% more to get 20%more bandwidth? Perhaps yes for sparse-matrix appliations - but not for others.. > How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU > core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus > Speed). As mentioned 2GB/core is a approximate nubmer we thought off a few years back - when there were no multi-core machine [just SMP chipsets]. All we can do is eavalue the memorybandwidth number for a given machine. We can't *ensure* it - as this is a choice made by and other chip designers.[intel, amd, ibm etc..] The choice for the currently available products was probably made a few years back. There is another component to this memory bandwidth debate. Which of the following do we want? 1. best scalability chip? [when comparing the performance from 1-N cores] 2. overall best performance on 1-core. or N cores [i.e node]. And from the system architecture issues - mentioned above - there are a couple of other issues that influcene this. - are the CPU-Chips sharing bandwidth or spliting bandwidth? - within the CPU-Chip [multi-core] is the memory bus shared or split? The first one can achieved by the hardware spliting up 1/Nth total available bandwidth per core. So it shows scalable results. But the 1-core performance can be low. The second choice could happen by not spliting - but sharing at the core level. For eg: Intel machines - memory bandwidth is divided at the CPU-chip level. For the example case MatMult from ex2 on 8-core intel machine had the following performance on 1,2,4,8 cores: 397, 632, 724, 749 [MFlop/s] To me - its not clear which architecture is better. For publishing scalability results - the above numbers don't look good. [but it could be the best performance you can squeze out any sequential job - or out of any 8-core architecture] Satish From balay at mcs.anl.gov Mon Apr 21 10:55:21 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 21 Apr 2008 10:55:21 -0500 (CDT) Subject: Schur system + MatShell In-Reply-To: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch> References: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch> Message-ID: On Mon, 21 Apr 2008, tribur at vision.ee.ethz.ch wrote: > Dear all, > > Sorry for switching from Schur to Hypre and back, but I'm trying two > approaches at the same time to find the optimal solution for our > convection-diffusion/Stokes problems: a) solving the global stiffness matrix > directly and in parallel using Petsc and a suitable preconditioner (???) and > b) applying first non-overlapping domain decomposition and than solving the > Schur complement system. > > Being concerned with b in the moment, I managed to set up and solve the global > Schur system using MATDENSE. The solving works well with, e.g., gmres+jacobi, > but the assembling of the global Schur matrix takes too long. Hmm - with dense - if you have some other efficient way of assembling the matrix - you can specify this directly to MatCreateMPIDense() - [or use MatGetArray() - and set the values directly into this array] > Therefore, I'm > trying to use the matrix in unassembled form using MatShell. Not very > successfully, however: > > 1) When I use KSPGMRES, I got the error > [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c > [1]PETSC ERROR: PCApplyBAorAB() line 584 in src/ksp/pc/interface/precon.c > [1]PETSC ERROR: GMREScycle() line 159 in src/ksp/ksp/impls/gmres/gmres.c > [1]PETSC ERROR: KSPSolve_GMRES() line 241 in src/ksp/ksp/impls/gmres/gmres.c > [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c Which version of PETSc is this? I can't place the line numbers correctly with latest petsc-2.3.3. [Can you send the complete error trace?] > 2) Using KSPBICG, it iterates without error message, but the result is wrong > (norm of residual 1.42768 instead of something like 1.0e-10), although my > Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose seem to be > correct. I tested the latter comparing the vectors y1 and y2 computed by, > e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < e-15 > for both functions. Not sure what the problem could be. Can you confirm that the code is valgrind clean? It could explain the issue 1 aswell. With mpich2 you can do the following on linux: mpiexec -np 2 valgrind --tool=memcheck ./executable Satish > > > Could you please have a look at my code snippet below? > > Thank you very much! > Kathrin > > > > PS: My Code: > > Vec gtot, x; > ... > Mat Stot; IS is; > ISCreateGeneral(PETSC_COMM_SELF, NPb, &uBId_global[0], &is); > localData ctx; > ctx.NPb = NPb; //size of local Schur system S > ctx.Sloc = &S[0]; > ctx.is = is; > MatCreateShell(PETSC_COMM_WORLD,m,n,NPb_tot,NPb_tot,&ctx,&Stot); > MatShellSetOperation(Stot,MATOP_MULT,(void(*)(void)) PETSC_SchurMatMult); > MatShellSetOperation(Stot,MATOP_MULT_TRANSPOSE,(void(*)(void))PETSC_SchurMatMultTranspose); > KSP ksp; > KSPCreate(PETSC_COMM_WORLD,&ksp); > PC prec; > KSPSetOperators(ksp,Stot,Stot,DIFFERENT_NONZERO_PATTERN); > KSPGetPC(ksp,&prec); > PCSetType(prec, PCNONE); > KSPSetType(ksp, KSPBICG); > KSPSetTolerances(ksp, 1.e-10, 1.e-50,PETSC_DEFAULT,PETSC_DEFAULT); > KSPSolve(ksp,gtot,x); > ... > > > From bsmith at mcs.anl.gov Mon Apr 21 11:43:10 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Mon, 21 Apr 2008 11:43:10 -0500 Subject: Schur system + MatShell In-Reply-To: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch> References: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch> Message-ID: <92B048D2-BCC4-4463-9EE0-9353959C6B11@mcs.anl.gov> On Apr 21, 2008, at 5:54 AM, tribur at vision.ee.ethz.ch wrote: > Dear all, > > Sorry for switching from Schur to Hypre and back, but I'm trying two > approaches at the same time to find the optimal solution for our > convection-diffusion/Stokes problems: a) solving the global > stiffness matrix directly and in parallel using Petsc and a suitable > preconditioner (???) and b) applying first non-overlapping domain > decomposition and than solving the Schur complement system. > > Being concerned with b in the moment, I managed to set up and solve > the global Schur system using MATDENSE. The solving works well with, > e.g., gmres+jacobi, but the assembling of the global Schur matrix > takes too long. Even if GMRES+Jacobi works reasonably well, GMRES without Jacobi can be much worse, this is a danger of matrix free without some kind of preconditioner. > Therefore, I'm trying to use the matrix in unassembled form using > MatShell. Not very successfully, however: > > 1) When I use KSPGMRES, I got the error > [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c > [1]PETSC ERROR: PCApplyBAorAB() line 584 in src/ksp/pc/interface/ > precon.c > [1]PETSC ERROR: GMREScycle() line 159 in src/ksp/ksp/impls/gmres/ > gmres.c > [1]PETSC ERROR: KSPSolve_GMRES() line 241 in src/ksp/ksp/impls/gmres/ > gmres.c > [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c > > 2) Using KSPBICG, it iterates without error message, but the result > is wrong (norm of residual 1.42768 instead of something like > 1.0e-10), although my Mat-functions PETSC_SchurMatMult and > PETSC_SchurMatMultTranspose seem to be correct. I tested the latter > comparing the vectors y1 and y2 computed by, e.g., MatMult(S,x,y1) > and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < e-15 for both > functions. Run it with -ksp_monitor_true_residual (-ksp_truemonitor) for PETSc pre 2.3.3) and -ksp_converged_reason to see what is happening. Note that KSPSolve() does NOT generate an error if it fails to converge, you need to check with KSPGetConvergedReason() or -ksp_converged_reason after the solve to see if KSP thinks it has converged or why it did not converge. Barry > > > > Could you please have a look at my code snippet below? > > Thank you very much! > Kathrin > > > > PS: My Code: > > Vec gtot, x; > ... > Mat Stot; IS is; > ISCreateGeneral(PETSC_COMM_SELF, NPb, &uBId_global[0], &is); > localData ctx; > ctx.NPb = NPb; //size of local Schur system S > ctx.Sloc = &S[0]; > ctx.is = is; > MatCreateShell(PETSC_COMM_WORLD,m,n,NPb_tot,NPb_tot,&ctx,&Stot); > MatShellSetOperation(Stot,MATOP_MULT,(void(*)(void)) > PETSC_SchurMatMult); MatShellSetOperation(Stot,MATOP_MULT_TRANSPOSE, > (void(*)(void))PETSC_SchurMatMultTranspose); > KSP ksp; > KSPCreate(PETSC_COMM_WORLD,&ksp); > PC prec; > KSPSetOperators(ksp,Stot,Stot,DIFFERENT_NONZERO_PATTERN); > KSPGetPC(ksp,&prec); > PCSetType(prec, PCNONE); > KSPSetType(ksp, KSPBICG); > KSPSetTolerances(ksp, 1.e-10, 1.e-50,PETSC_DEFAULT,PETSC_DEFAULT); > KSPSolve(ksp,gtot,x); > ... > > From bsmith at mcs.anl.gov Mon Apr 21 11:47:26 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Mon, 21 Apr 2008 11:47:26 -0500 Subject: flexible block matrix In-Reply-To: <20080421145330.GA1994@brakk.ethz.ch> References: <20080421145330.GA1994@brakk.ethz.ch> Message-ID: <553D810D-E87C-4084-A9C4-F4F82B690D6B@mcs.anl.gov> I concur with Satish, AIJ with inodes is essentially variable block size so trying to force BAIJ when it is not appropriate is unnecessary. Barry On Apr 21, 2008, at 9:53 AM, Jed Brown wrote: > I am solving a Stokes problem with nonlinear slip boundary > conditions. I don't > think I can take advantage of block structure since the normal > component of > velocity has a Dirichlet constraint and this must be built into the > velocity > space in order to preserve conditioning. An alternative formulation > involves a > Lagrange multiplier for the constraint, but even with clever > preconditioning, > this system is still more expensive to solve according to [1]. > > In solving the (velocity-pressure) saddle point problem, many > approximate solves > with the velocity system is needed in the preconditioner, hence I > need a strong > preconditioner for the velocity system. Currently, I am using > algebraic > multigrid on a low-order discretization which works fairly well. > Since Hypre > and ML only take AIJ matrices, perhaps I shouldn't worry about > blocking after > all. Is there a way to use MATBAIJ when some nodes have fewer > degrees of > freedom? Should I bother? > > Note that my method (currently just a single element) uses a high > order > discretization on some elements and low order on others. The global > matrix for > the low order elements is assembled, but it is applied locally for > the high order > elements taking advantage of the tensor product basis. For the > preconditioner, > a low order discretization on the nodes of the high order elements > is globally > assembled and added to the global matrix from the low-order elements. > Experiments with a single element (spectral rather than spectral/hp > element) > show this to be effective, converging in a constant number of > iterations > independent of polynomial order when using a V-cycle of AMG as a > preconditioner. > > Thanks. > > Jed > > > [1] B?nsch, H?hn 2000, `Numerical treatment of the Navier-Stokes > equations with > slip boundary conditions', SIAM J. Sci. Comput. > From amjad11 at gmail.com Tue Apr 22 00:45:42 2008 From: amjad11 at gmail.com (amjad ali) Date: Tue, 22 Apr 2008 10:45:42 +0500 Subject: general question on speed using quad core Xeons In-Reply-To: References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> Message-ID: <428810f20804212245y27fab8bfh336aa5a26ff98528@mail.gmail.com> Hello Dr. Satish, Thanks for your intellectual reply. > > The othe CPU side - its balanced by FSB1600 => > Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se > > So generally all the 3 things you've listed has to *match* correctly. > [Some CPUs and chipsets support multiple FSB frequencies - so have to > check what freq is set for the machine you are buying.] Currently I am making a gigabit ethernet cluster of 4 compute nodes (totaling 8 cores), with each node having One Processor: Intel Core2Duo E6750 2.66 GHz Processor, FSB 1333MHz, 4MB L2. Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset supporting 1333/1066/800 MHz FSB . RAM: 2GB DDR2 800MHz ECC System Memory. What memory-bandwidth/CPU-core will be there for this system? Any other comment/remark? My area work deals in sparse matrices. I near future I would like to add 12 similar compute nodes in the cluster. On such a cluster what if I relapce "C2D 2.66 GHz FSB1333 processor" with "Intel Xeon 3070/3075 2.66 GHz FSB1066/1333 processor"? Would there be any significant improvement in performance? with best regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tribur at vision.ee.ethz.ch Tue Apr 22 07:06:12 2008 From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch) Date: Tue, 22 Apr 2008 14:06:12 +0200 Subject: Schur system + MatShell Message-ID: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch> Dear Satish, dear Barry, dear rest, Thank you for your response. >> Being concerned with b in the moment, I managed to set up and solve >> the global >> Schur system using MATDENSE. The solving works well with, e.g., >> gmres+jacobi, >> but the assembling of the global Schur matrix takes too long. > > Hmm - with dense - if you have some other efficient way of assembling > the matrix - you can specify this directly to MatCreateMPIDense() - [or > use MatGetArray() - and set the values directly into this array] I don't see an alternative, as the partitioning of PETSc has nothing to do with my partitioning (unstructured mesh, partitioned with Metis). Moreover, in case of 2 Subdomains, e.g., the local Schur complements S1 and S2 have the same size as the global one, S=S1+S2, and there is no matrix format in PETSc supporting this, isn't it? >> 2) Using KSPBICG, it iterates without error message, but the result is wrong >> (norm of residual 1.42768 instead of something like 1.0e-10), although my >> Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose seem to be >> correct. I tested the latter comparing the vectors y1 and y2 computed by, >> e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < e-15 >> for both functions. > > Not sure what the problem could be. Can you confirm that the code is > valgrind clean? It could explain the issue 1 aswell. Valgrind didn't find an error in my PETSC_SchurMatMult, but PETSc gave me now also an error message when running with KSPBICG (same MatShell-code as in my previous e-mail): [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c [1]PETSC ERROR: KSPSolve_BiCG() line 95 in src/ksp/ksp/impls/bicg/bicg.c [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c The error seems to occurr at the second call of PETSC_SchurMatMult. I attached the related source files (petsc version petsc-2.3.3-p8, downloaded about 4 months ago), and below you find additionally my MatMult-function PETSC_SchurMatMult (maybe there is problem with ctx?). I'm stuck and I'll be very grateful for any help, Kathrin PS My user defined MatMult-function: typedef struct { int NPb; IS is; //int *uBId_global; double *Sloc; } localData; void PETSC_SchurMatMult(Mat Stot, Vec xtot, Vec ytot){ localData * ctx; MatShellGetContext(Stot, (void**) &ctx); int NPb = ctx->NPb; IS is = ctx->is; double *Sloc = ctx->Sloc; //extracting local vector xloc Vec xloc; VecCreateSeq(PETSC_COMM_SELF, NPb, &xloc); VecScatter ctx2; VecScatterCreate(xtot,is,xloc,PETSC_NULL, &ctx2); VecScatterBegin(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD); VecScatterEnd(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD); VecScatterDestroy(ctx2); //local matrix multiplication vector yloc_array(NPb,0); PetscScalar *xloc_array; VecGetArray(xloc, &xloc_array); for(int k=0; k References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch> Message-ID: On Apr 22, 2008, at 7:06 AM, tribur at vision.ee.ethz.ch wrote: > Dear Satish, dear Barry, dear rest, > > Thank you for your response. > >>> Being concerned with b in the moment, I managed to set up and >>> solve the global >>> Schur system using MATDENSE. The solving works well with, e.g., >>> gmres+jacobi, >>> but the assembling of the global Schur matrix takes too long. >> >> Hmm - with dense - if you have some other efficient way of assembling >> the matrix - you can specify this directly to MatCreateMPIDense() - >> [or >> use MatGetArray() - and set the values directly into this array] > > I don't see an alternative, as the partitioning of PETSc has nothing > to do with my partitioning (unstructured mesh, partitioned with > Metis). Moreover, in case of 2 Subdomains, e.g., the local Schur > complements S1 and S2 have the same size as the global one, S=S1+S2, > and there is no matrix format in PETSc supporting this, isn't it? > > >>> 2) Using KSPBICG, it iterates without error message, but the >>> result is wrong >>> (norm of residual 1.42768 instead of something like 1.0e-10), >>> although my >>> Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose >>> seem to be >>> correct. I tested the latter comparing the vectors y1 and y2 >>> computed by, >>> e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) >>> was < e-15 >>> for both functions. >> >> Not sure what the problem could be. Can you confirm that the code is >> valgrind clean? It could explain the issue 1 aswell. > > Valgrind didn't find an error in my PETSC_SchurMatMult, but PETSc > gave me now also an error message when running with KSPBICG (same > MatShell-code as in my previous e-mail): > > [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c > [1]PETSC ERROR: KSPSolve_BiCG() line 95 in src/ksp/ksp/impls/bicg/ > bicg.c > [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c > What is the error message? This just tells you a problem in MatMult(). You need to send EVERYTHING that was printed when the program stopped with an error. Send it to petsc-maint at mcs.anl.gov Barry > The error seems to occurr at the second call of PETSC_SchurMatMult. > I attached the related source files (petsc version petsc-2.3.3-p8, > downloaded about 4 months ago), and below you find additionally my > MatMult-function PETSC_SchurMatMult (maybe there is problem with > ctx?). > > I'm stuck and I'll be very grateful for any help, > Kathrin > > > PS My user defined MatMult-function: > > typedef struct { > int NPb; > IS is; //int *uBId_global; > double *Sloc; > } localData; > > > void PETSC_SchurMatMult(Mat Stot, Vec xtot, Vec ytot){ > localData * ctx; > MatShellGetContext(Stot, (void**) &ctx); > int NPb = ctx->NPb; IS is = ctx->is; > double *Sloc = ctx->Sloc; > > //extracting local vector xloc > Vec xloc; > VecCreateSeq(PETSC_COMM_SELF, NPb, &xloc); > VecScatter ctx2; > VecScatterCreate(xtot,is,xloc,PETSC_NULL, &ctx2); > VecScatterBegin(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD); > VecScatterEnd(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD); > VecScatterDestroy(ctx2); > > //local matrix multiplication > vector yloc_array(NPb,0); > PetscScalar *xloc_array; > VecGetArray(xloc, &xloc_array); > for(int k=0; k for(int l=0; l yloc_array[k] += Sloc[k*NPb+l] * xloc_array[l]; > VecRestoreArray(xloc, &xloc_array); > VecDestroy(xloc); > > //scatter yloc to ytot > Vec yloc; > VecCreateSeqWithArray(PETSC_COMM_SELF, NPb, PETSC_NULL, &yloc); > VecPlaceArray(yloc,&yloc_array[0]); > VecScatter ctx3; > VecScatterCreate(yloc, PETSC_NULL, ytot, is, &ctx3); > VecScatterBegin(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD); > VecScatterEnd(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD); > VecScatterDestroy(ctx3); > VecDestroy(yloc); > > } > > > > > > > > From knepley at gmail.com Tue Apr 22 07:16:23 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 22 Apr 2008 07:16:23 -0500 Subject: Schur system + MatShell In-Reply-To: References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch> Message-ID: On Tue, Apr 22, 2008 at 7:11 AM, Barry Smith wrote: > > On Apr 22, 2008, at 7:06 AM, tribur at vision.ee.ethz.ch wrote: > > > > Dear Satish, dear Barry, dear rest, > > > > Thank you for your response. > > > > > > > > > > > Being concerned with b in the moment, I managed to set up and solve > the global > > > > Schur system using MATDENSE. The solving works well with, e.g., > gmres+jacobi, > > > > but the assembling of the global Schur matrix takes too long. > > > > > > > > > > Hmm - with dense - if you have some other efficient way of assembling > > > the matrix - you can specify this directly to MatCreateMPIDense() - [or > > > use MatGetArray() - and set the values directly into this array] > > > > > > > I don't see an alternative, as the partitioning of PETSc has nothing to do > with my partitioning (unstructured mesh, partitioned with Metis). Moreover, > in case of 2 Subdomains, e.g., the local Schur complements S1 and S2 have > the same size as the global one, S=S1+S2, and there is no matrix format in > PETSc supporting this, isn't it? This does not make sense to me. You decide how PETSc partitions things (if you want), And, I really do not understand what you want in parallel. If you mean that you solve the local Schur complements independently, then use a local matrix for each one. The important thing is to work out the linear algebra prior to coding. Then wrapping it with PETSc Mat/Vec is easy. Matt > > > > > > > > > > > > > 2) Using KSPBICG, it iterates without error message, but the result is > wrong > > > > (norm of residual 1.42768 instead of something like 1.0e-10), although > my > > > > Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose seem > to be > > > > correct. I tested the latter comparing the vectors y1 and y2 computed > by, > > > > e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < > e-15 > > > > for both functions. > > > > > > > > > > Not sure what the problem could be. Can you confirm that the code is > > > valgrind clean? It could explain the issue 1 aswell. > > > > > > > Valgrind didn't find an error in my PETSC_SchurMatMult, but PETSc gave me > now also an error message when running with KSPBICG (same MatShell-code as > in my previous e-mail): > > > > [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c > > [1]PETSC ERROR: KSPSolve_BiCG() line 95 in src/ksp/ksp/impls/bicg/bicg.c > > [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c > > > > > > What is the error message? This just tells you a problem in MatMult(). You > need to send EVERYTHING that was > printed when the program stopped with an error. Send it to > petsc-maint at mcs.anl.gov > > > Barry > > > > > > The error seems to occurr at the second call of PETSC_SchurMatMult. > > I attached the related source files (petsc version petsc-2.3.3-p8, > downloaded about 4 months ago), and below you find additionally my > MatMult-function PETSC_SchurMatMult (maybe there is problem with ctx?). > > > > I'm stuck and I'll be very grateful for any help, > > Kathrin > > > > > > PS My user defined MatMult-function: > > > > typedef struct { > > int NPb; > > IS is; //int *uBId_global; > > double *Sloc; > > } localData; > > > > > > void PETSC_SchurMatMult(Mat Stot, Vec xtot, Vec ytot){ > > localData * ctx; > > MatShellGetContext(Stot, (void**) &ctx); > > int NPb = ctx->NPb; IS is = ctx->is; > > double *Sloc = ctx->Sloc; > > > > //extracting local vector xloc > > Vec xloc; > > VecCreateSeq(PETSC_COMM_SELF, NPb, &xloc); > > VecScatter ctx2; > > VecScatterCreate(xtot,is,xloc,PETSC_NULL, &ctx2); > > VecScatterBegin(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD); > > VecScatterEnd(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD); > > VecScatterDestroy(ctx2); > > > > //local matrix multiplication > > vector yloc_array(NPb,0); > > PetscScalar *xloc_array; > > VecGetArray(xloc, &xloc_array); > > for(int k=0; k > for(int l=0; l > yloc_array[k] += Sloc[k*NPb+l] * xloc_array[l]; > > VecRestoreArray(xloc, &xloc_array); > > VecDestroy(xloc); > > > > //scatter yloc to ytot > > Vec yloc; > > VecCreateSeqWithArray(PETSC_COMM_SELF, NPb, PETSC_NULL, &yloc); > > VecPlaceArray(yloc,&yloc_array[0]); > > VecScatter ctx3; > > VecScatterCreate(yloc, PETSC_NULL, ytot, is, &ctx3); > > VecScatterBegin(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD); > > VecScatterEnd(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD); > > VecScatterDestroy(ctx3); > > VecDestroy(yloc); > > > > } > > > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From amjad11 at gmail.com Tue Apr 22 08:43:29 2008 From: amjad11 at gmail.com (amjad ali) Date: Tue, 22 Apr 2008 14:43:29 +0100 Subject: Selection between C2D and Xeon 3000 for PETSc Sparse solvers Message-ID: <428810f20804220643r618753dayb3cae42b9f92b7e7@mail.gmail.com> Hello, Please help me out in selecting any one choice of the following: (Currently I am making a gigabit ethernet cluster of 4 compute nodes (totaling 8 cores), with each node having) (Choice 1) One Processor: Intel Core2Duo E6750 2.66 GHz Processor, FSB 1333MHz, 4MB L2. Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset supporting 1333/1066/800 MHz FSB . RAM: 2GB DDR2 800MHz ECC System Memory. (Choice 2) One Processor: Intel Xeon 3075 2.66 GHz FSB1333 4MBL2. Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset supporting 1333/1066/800 MHz FSB . RAM: 2GB DDR2 800MHz ECC System Memory. Which one system has larger memory-bandwidth/CPU-core? Any other comment/remark? My area work deals in sparse matrices. I near future I would like to add 12 similar compute nodes in the cluster. with best regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From petsc-maint at mcs.anl.gov Tue Apr 22 09:08:22 2008 From: petsc-maint at mcs.anl.gov (Satish Balay) Date: Tue, 22 Apr 2008 09:08:22 -0500 (CDT) Subject: general question on speed using quad core Xeons In-Reply-To: <428810f20804212245y27fab8bfh336aa5a26ff98528@mail.gmail.com> References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> <428810f20804212245y27fab8bfh336aa5a26ff98528@mail.gmail.com> Message-ID: On Tue, 22 Apr 2008, amjad ali wrote: > > The othe CPU side - its balanced by FSB1600 => > > Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se > > > > So generally all the 3 things you've listed has to *match* correctly. > > [Some CPUs and chipsets support multiple FSB frequencies - so have to > > check what freq is set for the machine you are buying.] > > Currently I am making a gigabit ethernet cluster of 4 compute nodes > (totaling 8 cores), with each node having > One Processor: Intel Core2Duo E6750 2.66 GHz Processor, FSB 1333MHz, 4MB L2. > Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset > supporting 1333/1066/800 MHz FSB . > RAM: 2GB DDR2 800MHz ECC System Memory. > > What memory-bandwidth/CPU-core will be there for this system? > Any other comment/remark? > My area work deals in sparse matrices. > I near future I would like to add 12 similar compute nodes in the cluster. http://www.intel.com/cd/products/services/emea/eng/chipsets/374398.htm It says 12.8 GB/s for DDR2-800. I think the CPU with 1333 => 10.7GB/s It would be unbalanced - and I don't know how this will affect things.. [Perhaps it will perform better than DDR2-677 RAM] > On such a cluster what if I relapce "C2D 2.66 GHz FSB1333 processor" with > "Intel Xeon 3070/3075 2.66 GHz FSB1066/1333 processor"? Would there be any > significant improvement in performance? I doubt it will make a difference. But this is unproven speculation. Satish From Amit.Itagi at seagate.com Tue Apr 22 09:14:51 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Tue, 22 Apr 2008 10:14:51 -0400 Subject: Multiple versions of PetSc Message-ID: Hi, I have a naive question. I have a program that uses a C++, complex version of PetSc. I need to run a second program that uses a C, real version of PetSc. For the first program, I have defined the PETSC_ARCH and PETSC_DIR variables in my .tcshrc . In order to get the second program working, do I need to install a second version of PetSc ? How do I separate the environment variables ? Thanks Rgds, Amit From tribur at vision.ee.ethz.ch Tue Apr 22 09:25:59 2008 From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch) Date: Tue, 22 Apr 2008 16:25:59 +0200 Subject: Schur system + MatShell In-Reply-To: References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch> Message-ID: <20080422162559.w8uw27kw2044koo8@email.ee.ethz.ch> Dear Matt, > This does not make sense to me. You decide how PETSc partitions things (if > you want), And, I really do not understand what you want in parallel. > If you mean > that you solve the local Schur complements independently, then use a local > matrix for each one. The important thing is to work out the linear > algebra prior > to coding. Then wrapping it with PETSc Mat/Vec is easy. The linear algebra is completely clear. Again: I have the local Schur systems given (and NOT the solution of the local Schur systems), and I would like to solve the global Schur complement system in parallel. The global Schur complement system is theoretically constructed by putting and adding elements of the local systems in certain locations of a global matrix. Wrapping this with PETSc Mat/Vec, without the time-intensive assembling, is not easy for me as a PETSc-beginner. But I'm curious of the solution you propose... From knepley at gmail.com Tue Apr 22 09:37:07 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 22 Apr 2008 10:37:07 -0400 Subject: Schur system + MatShell In-Reply-To: <20080422162559.w8uw27kw2044koo8@email.ee.ethz.ch> References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch> <20080422162559.w8uw27kw2044koo8@email.ee.ethz.ch> Message-ID: On 4/22/08, tribur at vision.ee.ethz.ch wrote: > Dear Matt, > > > This does not make sense to me. You decide how PETSc partitions things (if > > you want), And, I really do not understand what you want in parallel. > > If you mean > > that you solve the local Schur complements independently, then use a local > > matrix for each one. The important thing is to work out the linear algebra > prior > > to coding. Then wrapping it with PETSc Mat/Vec is easy. > > > > The linear algebra is completely clear. Again: I have the local Schur > systems given (and NOT the solution of the local Schur systems), and I would > like to solve the global Schur complement system in parallel. The global > Schur complement system is theoretically constructed by putting and adding > elements of the local systems in certain locations of a global matrix. > Wrapping this with PETSc Mat/Vec, without the time-intensive assembling, is > not easy for me as a PETSc-beginner. But I'm curious of the solution you > propose... Did you verify that the Schur complement matrix was properly preallocated before assembly? This is the likely source of time. You can run with -info and search for "malloc" in the output. Matt -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From balay at mcs.anl.gov Tue Apr 22 09:41:38 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Tue, 22 Apr 2008 09:41:38 -0500 (CDT) Subject: Multiple versions of PetSc In-Reply-To: References: Message-ID: On Tue, 22 Apr 2008, Amit.Itagi at seagate.com wrote: > > Hi, > > I have a naive question. I have a program that uses a C++, complex version > of PetSc. I need to run a second program that uses a C, real version of > PetSc. For the first program, I have defined the PETSC_ARCH and PETSC_DIR > variables in my .tcshrc . In order to get the second program working, do I > need to install a second version of PetSc ? How do I separate the > environment variables ? You would just install with a different PETSC_ARCH value. Now at compile time - you can use the correct PETSC_ARCH value with make. for eg: ./config/configure.py PETSC_ARCH=linux-complex --with-clanguage=cxx --with-scalar-type=complex make PETSC_ARCH=linux-complex all test make PETSC_ARCH=linux-complex mycode ./config/configure.py PETSC_ARCH=linux-real make PETSC_ARCH=linux-real all test make PETSC_ARCH=linux-real mycode You can set a default PETSC_ARCH in your .cshrc - but to use the other build - you change it at command-line to make [as indicated above] Note: both version can coexist in the same PETSC_DIR. Satish From knepley at gmail.com Tue Apr 22 09:42:51 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 22 Apr 2008 10:42:51 -0400 Subject: Multiple versions of PetSc In-Reply-To: References: Message-ID: To build a different configuration of PETSc: 1) cd $PETSC_DIR 2) configure with new options, including --PETSC_ARCH= 3) make PETS_ARCH= 4) Build your code with PETSC_ARCH= Matt On 4/22/08, Amit.Itagi at seagate.com wrote: > > Hi, > > I have a naive question. I have a program that uses a C++, complex version > of PetSc. I need to run a second program that uses a C, real version of > PetSc. For the first program, I have defined the PETSC_ARCH and PETSC_DIR > variables in my .tcshrc . In order to get the second program working, do I > need to install a second version of PetSc ? How do I separate the > environment variables ? > > Thanks > > Rgds, > Amit > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From balay at mcs.anl.gov Tue Apr 22 09:52:07 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Tue, 22 Apr 2008 09:52:07 -0500 (CDT) Subject: Schur system + MatShell In-Reply-To: References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch> <20080422162559.w8uw27kw2044koo8@email.ee.ethz.ch> Message-ID: On Tue, 22 Apr 2008, Matthew Knepley wrote: > On 4/22/08, tribur at vision.ee.ethz.ch wrote: > > Dear Matt, > > > > > This does not make sense to me. You decide how PETSc partitions things (if > > > you want), And, I really do not understand what you want in parallel. > > > If you mean > > > that you solve the local Schur complements independently, then use a local > > > matrix for each one. The important thing is to work out the linear algebra > > prior > > > to coding. Then wrapping it with PETSc Mat/Vec is easy. > > > > > > > The linear algebra is completely clear. Again: I have the local Schur > > systems given (and NOT the solution of the local Schur systems), and I would > > like to solve the global Schur complement system in parallel. The global > > Schur complement system is theoretically constructed by putting and adding > > elements of the local systems in certain locations of a global matrix. > > Wrapping this with PETSc Mat/Vec, without the time-intensive assembling, is > > not easy for me as a PETSc-beginner. But I'm curious of the solution you > > propose... > > Did you verify that the Schur complement matrix was properly preallocated before > assembly? This is the likely source of time. You can run with -info and search > for "malloc" in the output. Isn't this using MATDENSE? If that the case - then I think the problem is due to wrong partitioning - causing communiation during MatAssembly(). -info should clearly show the communication part aswell. The fix would be to specify the local partition sizes for this matrix - and not use PETSC_DECIDE. Satish From Amit.Itagi at seagate.com Tue Apr 22 10:06:34 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Tue, 22 Apr 2008 11:06:34 -0400 Subject: Multiple versions of PetSc In-Reply-To: Message-ID: Thanks, Satish and Matt. Rgds, Amit "Matthew Knepley" To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: Multiple versions of PetSc 04/22/2008 10:42 AM Please respond to petsc-users at mcs.a nl.gov To build a different configuration of PETSc: 1) cd $PETSC_DIR 2) configure with new options, including --PETSC_ARCH= 3) make PETS_ARCH= 4) Build your code with PETSC_ARCH= Matt On 4/22/08, Amit.Itagi at seagate.com wrote: > > Hi, > > I have a naive question. I have a program that uses a C++, complex version > of PetSc. I need to run a second program that uses a C, real version of > PetSc. For the first program, I have defined the PETSC_ARCH and PETSC_DIR > variables in my .tcshrc . In order to get the second program working, do I > need to install a second version of PetSc ? How do I separate the > environment variables ? > > Thanks > > Rgds, > Amit > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From petsc-maint at mcs.anl.gov Tue Apr 22 10:25:06 2008 From: petsc-maint at mcs.anl.gov (Satish Balay) Date: Tue, 22 Apr 2008 10:25:06 -0500 (CDT) Subject: Selection between C2D and Xeon 3000 for PETSc Sparse solvers In-Reply-To: <428810f20804220643r618753dayb3cae42b9f92b7e7@mail.gmail.com> References: <428810f20804220643r618753dayb3cae42b9f92b7e7@mail.gmail.com> Message-ID: On Tue, 22 Apr 2008, amjad ali wrote: > Hello, > > Please help me out in selecting any one choice of the following: > (Currently I am making a gigabit ethernet cluster of 4 compute nodes > (totaling 8 cores), with each node having) > > (Choice 1) > One Processor: Intel Core2Duo E6750 2.66 GHz Processor, FSB 1333MHz, 4MB L2. > Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset > supporting 1333/1066/800 MHz FSB . > RAM: 2GB DDR2 800MHz ECC System Memory. > > (Choice 2) > One Processor: Intel Xeon 3075 2.66 GHz FSB1333 4MBL2. > Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 > Chipset supporting 1333/1066/800 MHz FSB . > RAM: 2GB DDR2 800MHz ECC System Memory. > > Which one system has larger memory-bandwidth/CPU-core? > Any other comment/remark? > My area work deals in sparse matrices. > I near future I would like to add 12 similar compute nodes in the cluster. Based on the above numbers - the memory bandwidth numbers should be the same. And I expect the performance to be the same in both cases. Ideally you would have access to both machines [perhaps from the vendor] - and run streams benchmark on each - to see if there is any difference. Satish From recrusader at gmail.com Tue Apr 22 15:16:32 2008 From: recrusader at gmail.com (Yujie) Date: Tue, 22 Apr 2008 13:16:32 -0700 Subject: about MatMult() Message-ID: <7ff0ee010804221316oa73a9c2s101b225fc3b760bf@mail.gmail.com> the following is about MatMult() in manual. " The parallel matrix can multiply a vector with n local entries, returning a vector with m local entries. That is, to form the product MatMult(Mat A,Vec x,Vec y); the vectors x and y should be generated with VecCreateMPI(MPI Comm comm,n,N,&x); VecCreateMPI(MPI Comm comm,m,M,&y); " I am wondering whether I must create Vector "y" before I call MatMult() regardless of parrellel and sequentail modes? thanks a lot. Regards, Yujie -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Tue Apr 22 15:48:47 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 22 Apr 2008 15:48:47 -0500 Subject: about MatMult() In-Reply-To: <7ff0ee010804221316oa73a9c2s101b225fc3b760bf@mail.gmail.com> References: <7ff0ee010804221316oa73a9c2s101b225fc3b760bf@mail.gmail.com> Message-ID: <9940D583-C343-4879-AD0A-7DFD1C73C489@mcs.anl.gov> On Apr 22, 2008, at 3:16 PM, Yujie wrote: > the following is about MatMult() in manual. > " > The parallel matrix can multiply a vector with n local entries, > returning a vector with m local entries. > That is, to form the product > MatMult(Mat A,Vec x,Vec y); > the vectors x and y should be generated with > VecCreateMPI(MPI Comm comm,n,N,&x); > VecCreateMPI(MPI Comm comm,m,M,&y); > " > I am wondering whether I must create Vector "y" before I call > MatMult() regardless of parrellel and sequentail modes? y is the location where the product of A*x is stored; if it is not created before the call to MatMult() the program will crash (or more likely generate an error message). Barry > > thanks a lot. > > Regards, > Yujie From Amit.Itagi at seagate.com Tue Apr 22 20:45:03 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Tue, 22 Apr 2008 21:45:03 -0400 Subject: Multilevel solver Message-ID: Hi, I am trying to implement a multilevel method for an EM problem. The reference is : "Comparison of hierarchical basis functions for efficient multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, IET Sci. Meas. Technol. 2007, 1(1), pp 48-52. Here is the summary: The matrix equation Ax=b is solved using GMRES with a multilevel pre-conditioner. A has a block structure. A11 A12 * x1 = b1 A21 A22 x2 b2 A11 is mxm and A33 is nxn, where m is not equal to n. Step 1 : Solve A11 * e1 = b1 (parallel LU using superLU or MUMPS) Step 2: Solve A22 * e2 =b2-A21*e1 (might either user a SOR solver or a parallel LU) Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) This gives the approximate solution to A11 A12 * e1 = b1 A21 A22 e2 b2 and is used as the pre-conditioner for the GMRES. Which PetSc method can implement this pre-conditioner ? I tried a PCSHELL type PC. With Hong's help, I also got the parallel LU to work withSuperLU/MUMPS. My program runs successfully on multiple processes on a single machine. But when I submit the program over multiple machines, I get a crash in the PCApply routine after several GMRES iterations. I think this has to do with using PCSHELL with GMRES (which is not a good idea). Is there a different way to implement this ? Does this resemble the usage pattern of one of the AMG preconditioners ? Thanks Rgds, Amit From bsmith at mcs.anl.gov Tue Apr 22 21:08:04 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 22 Apr 2008 21:08:04 -0500 Subject: Multilevel solver In-Reply-To: References: Message-ID: Amit, Using a a PCSHELL should be fine (it can be used with GMRES), my guess is there is a memory corruption error somewhere that is causing the crash. This could be tracked down with www.valgrind.com Another way to you could implement this is with some very recent additions I made to PCFIELDSPLIT that are in petsc-dev (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) With this you would chose PCSetType(pc,PCFIELDSPLIT PCFieldSplitSetIS(pc,is1 PCFieldSplitSetIS(pc,is2 PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE to use LU on A11 use the command line options -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly and SOR on A22 -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - fieldsplit_1_pc_sor_lits where is the number of iterations you want to use block A22 is1 is the IS that contains the indices for all the vector entries in the 1 block while is2 is all indices in the vector for the 2 block. You can use ISCreateGeneral() to create these. Probably it is easiest just to try this out. Barry On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > Hi, > > I am trying to implement a multilevel method for an EM problem. The > reference is : "Comparison of hierarchical basis functions for > efficient > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > IET > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > Here is the summary: > > The matrix equation Ax=b is solved using GMRES with a multilevel > pre-conditioner. A has a block structure. > > A11 A12 * x1 = b1 > A21 A22 x2 b2 > > A11 is mxm and A33 is nxn, where m is not equal to n. > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > superLU or > MUMPS) > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > a SOR > solver or a parallel LU) > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > This gives the approximate solution to > > A11 A12 * e1 = b1 > A21 A22 e2 b2 > > and is used as the pre-conditioner for the GMRES. > > > Which PetSc method can implement this pre-conditioner ? I tried a > PCSHELL > type PC. With Hong's help, I also got the parallel LU to work > withSuperLU/MUMPS. My program runs successfully on multiple > processes on a > single machine. But when I submit the program over multiple > machines, I get > a crash in the PCApply routine after several GMRES iterations. I > think this > has to do with using PCSHELL with GMRES (which is not a good idea). Is > there a different way to implement this ? Does this resemble the usage > pattern of one of the AMG preconditioners ? > > > Thanks > > Rgds, > Amit > From berend at chalmers.se Wed Apr 23 06:30:36 2008 From: berend at chalmers.se (Berend van Wachem) Date: Wed, 23 Apr 2008 13:30:36 +0200 Subject: valgrind error Message-ID: <480F1DDC.2040000@chalmers.se> Dear Petsc-Team, My program based upon PETSc seems to work fine, but I get a long list of errors with valgrind, see below. Does anyone have an idea what is going wrong? ==19756== Conditional jump or move depends on uninitialised value(s) ==19756== at 0x83EDC11: MatLUFactorNumeric_SeqAIJ (aijfact.c:529) ==19756== by 0x8376CEB: MatLUFactorNumeric (matrix.c:2227) ==19756== by 0x826EFFE: PCSetUp_ILU (ilu.c:564) ==19756== by 0x82EAAF2: PCSetUp (precon.c:787) ==19756== by 0x8283262: KSPSetUp (itfunc.c:234) ==19756== by 0x8283F63: KSPSolve (itfunc.c:347) ==19756== by 0x80A7235: SolveMatrix (solvematrix.c:61) ==19756== by 0x814F120: main (main.c:280) ==19756== ==19756== Conditional jump or move depends on uninitialised value(s) ==19756== at 0x83EDC5F: MatLUFactorNumeric_SeqAIJ (aijfact.c:529) ==19756== by 0x8376CEB: MatLUFactorNumeric (matrix.c:2227) ==19756== by 0x826EFFE: PCSetUp_ILU (ilu.c:564) ==19756== by 0x82EAAF2: PCSetUp (precon.c:787) ==19756== by 0x8283262: KSPSetUp (itfunc.c:234) ==19756== by 0x8283F63: KSPSolve (itfunc.c:347) ==19756== by 0x80A7235: SolveMatrix (solvematrix.c:61) ==19756== by 0x814F120: main (main.c:280) ==19756== ==19756== Conditional jump or move depends on uninitialised value(s) ==19756== at 0x83ED8E0: MatLUFactorNumeric_SeqAIJ (aijfact.c:523) ==19756== by 0x8376CEB: MatLUFactorNumeric (matrix.c:2227) ==19756== by 0x826EFFE: PCSetUp_ILU (ilu.c:564) ==19756== by 0x82EAAF2: PCSetUp (precon.c:787) ==19756== by 0x8283262: KSPSetUp (itfunc.c:234) ==19756== by 0x8283F63: KSPSolve (itfunc.c:347) ==19756== by 0x80A7235: SolveMatrix (solvematrix.c:61) ==19756== by 0x814F120: main (main.c:280) ==19756== ==19756== Conditional jump or move depends on uninitialised value(s) ==19756== at 0x83ED674: MatLUFactorNumeric_SeqAIJ (aijfact.c:504) ==19756== by 0x8376CEB: MatLUFactorNumeric (matrix.c:2227) ==19756== by 0x826EFFE: PCSetUp_ILU (ilu.c:564) ==19756== by 0x82EAAF2: PCSetUp (precon.c:787) ==19756== by 0x8283262: KSPSetUp (itfunc.c:234) ==19756== by 0x8283F63: KSPSolve (itfunc.c:347) ==19756== by 0x80A7235: SolveMatrix (solvematrix.c:61) ==19756== by 0x814F120: main (main.c:280) ==19756== ==19756== Conditional jump or move depends on uninitialised value(s) ==19756== at 0x88D8214: dnrm2_ (dnrm2.f:58) ==19756== by 0x8695C91: VecNorm_MPI (pvec2.c:79) ==19756== by 0x866B95F: VecNorm (rvector.c:162) ==19756== by 0x829C1B6: KSPSolve_BCGS (bcgs.c:45) ==19756== by 0x8284523: KSPSolve (itfunc.c:379) ==19756== by 0x80A7235: SolveMatrix (solvematrix.c:61) ==19756== by 0x814F120: main (main.c:280) From Amit.Itagi at seagate.com Wed Apr 23 08:11:21 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 23 Apr 2008 09:11:21 -0400 Subject: Multilevel solver In-Reply-To: Message-ID: Barry, This looks interesting. I will give it a shot. Thanks Rgds, Amit Barry Smith To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: Multilevel solver 04/22/2008 10:08 PM Please respond to petsc-users at mcs.a nl.gov Amit, Using a a PCSHELL should be fine (it can be used with GMRES), my guess is there is a memory corruption error somewhere that is causing the crash. This could be tracked down with www.valgrind.com Another way to you could implement this is with some very recent additions I made to PCFIELDSPLIT that are in petsc-dev (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) With this you would chose PCSetType(pc,PCFIELDSPLIT PCFieldSplitSetIS(pc,is1 PCFieldSplitSetIS(pc,is2 PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE to use LU on A11 use the command line options -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly and SOR on A22 -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - fieldsplit_1_pc_sor_lits where is the number of iterations you want to use block A22 is1 is the IS that contains the indices for all the vector entries in the 1 block while is2 is all indices in the vector for the 2 block. You can use ISCreateGeneral() to create these. Probably it is easiest just to try this out. Barry On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > Hi, > > I am trying to implement a multilevel method for an EM problem. The > reference is : "Comparison of hierarchical basis functions for > efficient > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > IET > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > Here is the summary: > > The matrix equation Ax=b is solved using GMRES with a multilevel > pre-conditioner. A has a block structure. > > A11 A12 * x1 = b1 > A21 A22 x2 b2 > > A11 is mxm and A33 is nxn, where m is not equal to n. > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > superLU or > MUMPS) > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > a SOR > solver or a parallel LU) > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > This gives the approximate solution to > > A11 A12 * e1 = b1 > A21 A22 e2 b2 > > and is used as the pre-conditioner for the GMRES. > > > Which PetSc method can implement this pre-conditioner ? I tried a > PCSHELL > type PC. With Hong's help, I also got the parallel LU to work > withSuperLU/MUMPS. My program runs successfully on multiple > processes on a > single machine. But when I submit the program over multiple > machines, I get > a crash in the PCApply routine after several GMRES iterations. I > think this > has to do with using PCSHELL with GMRES (which is not a good idea). Is > there a different way to implement this ? Does this resemble the usage > pattern of one of the AMG preconditioners ? > > > Thanks > > Rgds, > Amit > From petsc-maint at mcs.anl.gov Wed Apr 23 08:54:32 2008 From: petsc-maint at mcs.anl.gov (Satish Balay) Date: Wed, 23 Apr 2008 08:54:32 -0500 (CDT) Subject: [PETSC #17608] Re: general question on speed using quad core Xeons In-Reply-To: References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> Message-ID: On Mon, 21 Apr 2008, Satish Balay wrote: > For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]: > bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec * = 25.6GByte/sec My math was incorrect here.. DDR2-800 = 6.4Gb/s [its 2(ddr)* 400MHz/sec * 8bytes ] So this machine has 4 memory banks. i.e the above is: > bandwidth = 4(banks)* 2(ddr)* 400 MHz/sec* 8(bytes bus) * = 25.6GByte/sec Satish From Amit.Itagi at seagate.com Wed Apr 23 09:07:23 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 23 Apr 2008 10:07:23 -0400 Subject: Multilevel solver Message-ID: An HTML attachment was scrubbed... URL: From knepley at gmail.com Wed Apr 23 09:23:09 2008 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 23 Apr 2008 09:23:09 -0500 Subject: Multilevel solver In-Reply-To: References: Message-ID: On Wed, Apr 23, 2008 at 9:07 AM, wrote: > Barry, > > This is what valgrind gives me. Any idea ? What is confusing me is that I > get the crash after several GMRES iterations. 1) Always start with the simplest case, meaning serial 2) When you run valgrind in parallel, you need --trace-children=yes, since MPI usually spawns other processes 3) It is possible to corrupt memory so badly that valgrind crashes like this, but it is hard. Matt > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [3]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the > batch system) has told this process to end > [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > ------------------------------------------------------------------------ > [3]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[3]PETSC > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > find memory corruption errors > [2]PETSC ERROR: Caught signal number 1 Hang up: Some other process (or the > batch system) has told this process to end > [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [2]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[2]PETSC > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > find memory corruption errors > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the > batch system) has told this process to end > [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [0]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[0]PETSC > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > find memory corruption errors > [3]PETSC ERROR: likely location of problem given in stack below > [3]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > ------------------------------------------------------------------------ > [1]PETSC ERROR: [2]PETSC ERROR: Caught signal number 15 Terminate: Somet > process (or the batch system) has told this process to end > likely location of problem given in stack below > [1]PETSC ERROR: [2]PETSC ERROR: Try option -start_in_debugger or > -on_error_attach_debugger > --------------------- Stack Frames ------------------------------------ > [1]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[1]PETSC > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > find memory corruption errors > [0]PETSC ERROR: likely location of problem given in stack below > [0]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [1]PETSC ERROR: likely location of problem given in stack below > [1]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > [3]PETSC ERROR: INSTEAD the line number of the start of the function > [3]PETSC ERROR: is given. > [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > [3]PETSC ERROR: [2]PETSC ERROR: INSTEAD the line number of the start > of the function > [2]PETSC ERROR: is given. > [3] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c > [3]PETSC ERROR: [3] PCApply line 346 src/ksp/pc/interface/precon.c > [3]PETSC ERROR: [3] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c > [3]PETSC ERROR: [3] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c > [2]PETSC ERROR: [0]PETSC ERROR: Note: The EXACT line numbers in the stack > are not available, > [2] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c > [0]PETSC ERROR: INSTEAD the line number of the start of the function > [2]PETSC ERROR: [2] PCApply line 346 src/ksp/pc/interface/precon.c > [0]PETSC ERROR: is given. > [2]PETSC ERROR: [2] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c > [2]PETSC ERROR: [2] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c > [1]PETSC ERROR: [0]PETSC ERROR: Note: The EXACT line numbers in the stack > are not available, > [0] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c > [1]PETSC ERROR: INSTEAD the line number of the start of the function > [0]PETSC ERROR: [0] PCApply line 346 src/ksp/pc/interface/precon.c > [0]PETSC ERROR: [1]PETSC ERROR: is given. > [0] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c > [0]PETSC ERROR: [0] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c > [1]PETSC ERROR: [1] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c > [1]PETSC ERROR: [1] PCApply line 346 src/ksp/pc/interface/precon.c > [1]PETSC ERROR: [3]PETSC ERROR: [1] PCApplyBAorAB line 539 > src/ksp/pc/interface/precon.c > --------------------- Error Message ------------------------------------ > [1]PETSC ERROR: [1] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c > [2]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [0]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [3]PETSC ERROR: Signal received! > [3]PETSC ERROR: > ------------------------------------------------------------------------ > [3]PETSC ERROR: Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 > CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b > [3]PETSC ERROR: See docs/changes/index.html for recent updates. > [3]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > > Thanks > > Rgds, > Amit > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From balay at mcs.anl.gov Wed Apr 23 09:40:39 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 23 Apr 2008 09:40:39 -0500 (CDT) Subject: Multilevel solver In-Reply-To: References: Message-ID: If using valgrind - I sugest using MPICH2 [installed with options --enable-g=meminit --enable-fast] And valgrind can be invoked with: mpiexec -np 2 valgrind --tool=memcheck -q ./executable -exectuable-options Satish On Wed, 23 Apr 2008, Matthew Knepley wrote: > On Wed, Apr 23, 2008 at 9:07 AM, wrote: > > Barry, > > > > This is what valgrind gives me. Any idea ? What is confusing me is that I > > get the crash after several GMRES iterations. > > 1) Always start with the simplest case, meaning serial > > 2) When you run valgrind in parallel, you need --trace-children=yes, since > MPI usually spawns other processes > > 3) It is possible to corrupt memory so badly that valgrind crashes > like this, but it is hard. > > Matt > > > [2]PETSC ERROR: > > ------------------------------------------------------------------------ > > [3]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the > > batch system) has told this process to end > > [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > > ------------------------------------------------------------------------ > > [3]PETSC ERROR: or see > > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[3]PETSC > > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > > find memory corruption errors > > [2]PETSC ERROR: Caught signal number 1 Hang up: Some other process (or the > > batch system) has told this process to end > > [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > > [2]PETSC ERROR: or see > > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[2]PETSC > > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > > find memory corruption errors > > [0]PETSC ERROR: > > ------------------------------------------------------------------------ > > [0]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the > > batch system) has told this process to end > > [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > > [0]PETSC ERROR: or see > > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[0]PETSC > > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > > find memory corruption errors > > [3]PETSC ERROR: likely location of problem given in stack below > > [3]PETSC ERROR: --------------------- Stack Frames > > ------------------------------------ > > ------------------------------------------------------------------------ > > [1]PETSC ERROR: [2]PETSC ERROR: Caught signal number 15 Terminate: Somet > > process (or the batch system) has told this process to end > > likely location of problem given in stack below > > [1]PETSC ERROR: [2]PETSC ERROR: Try option -start_in_debugger or > > -on_error_attach_debugger > > --------------------- Stack Frames ------------------------------------ > > [1]PETSC ERROR: or see > > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[1]PETSC > > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > > find memory corruption errors > > [0]PETSC ERROR: likely location of problem given in stack below > > [0]PETSC ERROR: --------------------- Stack Frames > > ------------------------------------ > > [1]PETSC ERROR: likely location of problem given in stack below > > [1]PETSC ERROR: --------------------- Stack Frames > > ------------------------------------ > > [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > > [3]PETSC ERROR: INSTEAD the line number of the start of the function > > [3]PETSC ERROR: is given. > > [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > > [3]PETSC ERROR: [2]PETSC ERROR: INSTEAD the line number of the start > > of the function > > [2]PETSC ERROR: is given. > > [3] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c > > [3]PETSC ERROR: [3] PCApply line 346 src/ksp/pc/interface/precon.c > > [3]PETSC ERROR: [3] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c > > [3]PETSC ERROR: [3] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c > > [2]PETSC ERROR: [0]PETSC ERROR: Note: The EXACT line numbers in the stack > > are not available, > > [2] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c > > [0]PETSC ERROR: INSTEAD the line number of the start of the function > > [2]PETSC ERROR: [2] PCApply line 346 src/ksp/pc/interface/precon.c > > [0]PETSC ERROR: is given. > > [2]PETSC ERROR: [2] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c > > [2]PETSC ERROR: [2] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c > > [1]PETSC ERROR: [0]PETSC ERROR: Note: The EXACT line numbers in the stack > > are not available, > > [0] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c > > [1]PETSC ERROR: INSTEAD the line number of the start of the function > > [0]PETSC ERROR: [0] PCApply line 346 src/ksp/pc/interface/precon.c > > [0]PETSC ERROR: [1]PETSC ERROR: is given. > > [0] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c > > [0]PETSC ERROR: [0] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c > > [1]PETSC ERROR: [1] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c > > [1]PETSC ERROR: [1] PCApply line 346 src/ksp/pc/interface/precon.c > > [1]PETSC ERROR: [3]PETSC ERROR: [1] PCApplyBAorAB line 539 > > src/ksp/pc/interface/precon.c > > --------------------- Error Message ------------------------------------ > > [1]PETSC ERROR: [1] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c > > [2]PETSC ERROR: --------------------- Error Message > > ------------------------------------ > > [0]PETSC ERROR: --------------------- Error Message > > ------------------------------------ > > [3]PETSC ERROR: Signal received! > > [3]PETSC ERROR: > > ------------------------------------------------------------------------ > > [3]PETSC ERROR: Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 > > CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b > > [3]PETSC ERROR: See docs/changes/index.html for recent updates. > > [3]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > > > > Thanks > > > > Rgds, > > Amit > > > > > > From Amit.Itagi at seagate.com Wed Apr 23 13:32:11 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 23 Apr 2008 14:32:11 -0400 Subject: Multilevel solver In-Reply-To: Message-ID: Barry, Is the installation of petsc-dev different from the installation of the 2.3.3 release ? I ran the config. But the folder tree seems to be different. Hence, make is giving problems. Amit Barry Smith To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: Multilevel solver 04/22/2008 10:08 PM Please respond to petsc-users at mcs.a nl.gov Amit, Using a a PCSHELL should be fine (it can be used with GMRES), my guess is there is a memory corruption error somewhere that is causing the crash. This could be tracked down with www.valgrind.com Another way to you could implement this is with some very recent additions I made to PCFIELDSPLIT that are in petsc-dev (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) With this you would chose PCSetType(pc,PCFIELDSPLIT PCFieldSplitSetIS(pc,is1 PCFieldSplitSetIS(pc,is2 PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE to use LU on A11 use the command line options -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly and SOR on A22 -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - fieldsplit_1_pc_sor_lits where is the number of iterations you want to use block A22 is1 is the IS that contains the indices for all the vector entries in the 1 block while is2 is all indices in the vector for the 2 block. You can use ISCreateGeneral() to create these. Probably it is easiest just to try this out. Barry On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > Hi, > > I am trying to implement a multilevel method for an EM problem. The > reference is : "Comparison of hierarchical basis functions for > efficient > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > IET > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > Here is the summary: > > The matrix equation Ax=b is solved using GMRES with a multilevel > pre-conditioner. A has a block structure. > > A11 A12 * x1 = b1 > A21 A22 x2 b2 > > A11 is mxm and A33 is nxn, where m is not equal to n. > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > superLU or > MUMPS) > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > a SOR > solver or a parallel LU) > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > This gives the approximate solution to > > A11 A12 * e1 = b1 > A21 A22 e2 b2 > > and is used as the pre-conditioner for the GMRES. > > > Which PetSc method can implement this pre-conditioner ? I tried a > PCSHELL > type PC. With Hong's help, I also got the parallel LU to work > withSuperLU/MUMPS. My program runs successfully on multiple > processes on a > single machine. But when I submit the program over multiple > machines, I get > a crash in the PCApply routine after several GMRES iterations. I > think this > has to do with using PCSHELL with GMRES (which is not a good idea). Is > there a different way to implement this ? Does this resemble the usage > pattern of one of the AMG preconditioners ? > > > Thanks > > Rgds, > Amit > From knepley at gmail.com Wed Apr 23 13:43:11 2008 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 23 Apr 2008 13:43:11 -0500 Subject: Multilevel solver In-Reply-To: References: Message-ID: On Wed, Apr 23, 2008 at 1:32 PM, wrote: > Barry, > > Is the installation of petsc-dev different from the installation of the > 2.3.3 release ? I ran the config. But the folder tree seems to be > different. Hence, make is giving problems. 1) Always always send the error log. I cannot tell anything from the description "problems". 2) Some things have moved, but of course, make will work with the new organization. Matt > Amit > > Barry Smith > ov> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: Multilevel solver > > > 04/22/2008 10:08 > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > Amit, > > Using a a PCSHELL should be fine (it can be used with GMRES), > my guess is there is a memory corruption error somewhere that is > causing the crash. This could be tracked down with www.valgrind.com > > Another way to you could implement this is with some very recent > additions I made to PCFIELDSPLIT that are in petsc-dev > (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) > With this you would chose > PCSetType(pc,PCFIELDSPLIT > PCFieldSplitSetIS(pc,is1 > PCFieldSplitSetIS(pc,is2 > PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > to use LU on A11 use the command line options > -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly > and SOR on A22 > -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - > fieldsplit_1_pc_sor_lits where > is the number of iterations you want to use block A22 > > is1 is the IS that contains the indices for all the vector entries in > the 1 block while is2 is all indices in the > vector for the 2 block. You can use ISCreateGeneral() to create these. > > Probably it is easiest just to try this out. > > Barry > > > On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > > > > Hi, > > > > I am trying to implement a multilevel method for an EM problem. The > > reference is : "Comparison of hierarchical basis functions for > > efficient > > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > > IET > > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > > > Here is the summary: > > > > The matrix equation Ax=b is solved using GMRES with a multilevel > > pre-conditioner. A has a block structure. > > > > A11 A12 * x1 = b1 > > A21 A22 x2 b2 > > > > A11 is mxm and A33 is nxn, where m is not equal to n. > > > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > > superLU or > > MUMPS) > > > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > > a SOR > > solver or a parallel LU) > > > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > > > This gives the approximate solution to > > > > A11 A12 * e1 = b1 > > A21 A22 e2 b2 > > > > and is used as the pre-conditioner for the GMRES. > > > > > > Which PetSc method can implement this pre-conditioner ? I tried a > > PCSHELL > > type PC. With Hong's help, I also got the parallel LU to work > > withSuperLU/MUMPS. My program runs successfully on multiple > > processes on a > > single machine. But when I submit the program over multiple > > machines, I get > > a crash in the PCApply routine after several GMRES iterations. I > > think this > > has to do with using PCSHELL with GMRES (which is not a good idea). Is > > there a different way to implement this ? Does this resemble the usage > > pattern of one of the AMG preconditioners ? > > > > > > Thanks > > > > Rgds, > > Amit > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From Amit.Itagi at seagate.com Wed Apr 23 15:05:04 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 23 Apr 2008 16:05:04 -0400 Subject: Multilevel solver In-Reply-To: Message-ID: Here is my make log. ========================================== See documentation/faq.html and documentation/bugreporting.html for help with installation problems. Please send EVERYTHING printed out below when reporting problems To subscribe to the PETSc announcement list, send mail to majordomo at mcs.anl.gov with the message: subscribe petsc-announce To subscribe to the PETSc users mailing list, send mail to majordomo at mcs.anl.gov with the message: subscribe petsc-users ========================================== On Wed Apr 23 15:37:17 EDT 2008 on tabla Machine characteristics: Linux tabla 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux ----------------------------------------- Using PETSc directory: /home/amit/programs/ParEM/petsc-dev Using PETSc arch: linux-gnu-c-debug ----------------------------------------- PETSC_VERSION_RELEASE 0 PETSC_VERSION_MAJOR 2 PETSC_VERSION_MINOR 3 PETSC_VERSION_SUBMINOR 3 PETSC_VERSION_PATCH 12 PETSC_VERSION_DATE "May, 23, 2007" PETSC_VERSION_PATCH_DATE "unknown" PETSC_VERSION_HG "unknown" ----------------------------------------- Using configure Options: --PETSC_ARCH=linux-gnu-c-debug --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx --with-mpi=1 --download-mpich=1 --with-metis=1 --download-metis=1 --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 --download-superlu_dist=1 --with-mumps=1 --download-blacs=1 --download-scalapack=1 --download-mumps=1 COPTFLAGS="-O3 -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer -finline-functions -msse2" CXXOPTFLAGS="-O3 -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer -finline-functions -msse2" --with-shared=0 Using configuration flags: #define INCLUDED_PETSCCONF_H #define IS_COLORING_MAX 65535 #define STDC_HEADERS 1 #define MPIU_COLORING_VALUE MPI_UNSIGNED_SHORT #define PETSC_HAVE_SUPERLU_DIST 1 #define PETSC_STATIC_INLINE static inline #define PETSC_HAVE_BLACS 1 #define PETSC_HAVE_MUMPS 1 #define PETSC_DIR_SEPARATOR '/' #define PETSC_HAVE_BLASLAPACK 1 #define PETSC_PATH_SEPARATOR ':' #define PETSC_REPLACE_DIR_SEPARATOR '\\' #define PETSC_HAVE_DOUBLE_ALIGN_MALLOC 1 #define PETSC_RESTRICT __restrict__ #define PETSC_HAVE_X11 1 #define PETSC_HAVE_SOWING 1 #define PETSC_HAVE_SCALAPACK 1 #define PETSC_HAVE_MPI 1 #define PETSC_USE_SOCKET_VIEWER 1 #define PETSC_HAVE_PARMETIS 1 #define PETSC_HAVE_C2HTML 1 #define PETSC_HAVE_FORTRAN 1 #define PETSC_HAVE_STRING_H 1 #define PETSC_HAVE_SYS_TYPES_H 1 #define PETSC_HAVE_ENDIAN_H 1 #define PETSC_HAVE_SYS_PROCFS_H 1 #define PETSC_HAVE_LINUX_KERNEL_H 1 #define PETSC_HAVE_TIME_H 1 #define PETSC_HAVE_MATH_H 1 #define PETSC_HAVE_STDLIB_H 1 #define PETSC_HAVE_SYS_PARAM_H 1 #define PETSC_HAVE_SYS_SOCKET_H 1 #define PETSC_HAVE_UNISTD_H 1 #define PETSC_HAVE_SYS_WAIT_H 1 #define PETSC_HAVE_LIMITS_H 1 #define PETSC_HAVE_SEARCH_H 1 #define PETSC_HAVE_NETINET_IN_H 1 #define PETSC_HAVE_FLOAT_H 1 #define PETSC_HAVE_SYS_SYSINFO_H 1 #define PETSC_HAVE_SYS_RESOURCE_H 1 #define PETSC_HAVE_SYS_TIMES_H 1 #define PETSC_HAVE_NETDB_H 1 #define PETSC_HAVE_MALLOC_H 1 #define PETSC_HAVE_PWD_H 1 #define PETSC_HAVE_FCNTL_H 1 #define PETSC_HAVE_STRINGS_H 1 #define PETSC_HAVE_MEMORY_H 1 #define PETSC_TIME_WITH_SYS_TIME 1 #define PETSC_HAVE_SYS_TIME_H 1 #define PETSC_HAVE_SYS_UTSNAME_H 1 #define PETSC_USING_F90 1 #define PETSC_PRINTF_FORMAT_CHECK(A,B) __attribute__((format (printf, A, B))) #define PETSC_C_STATIC_INLINE static inline #define PETSC_HAVE_FORTRAN_UNDERSCORE 1 #define PETSC_HAVE_CXX_NAMESPACE 1 #define PETSC_C_RESTRICT __restrict__ #define PETSC_USE_F90_SRC_IMPL 1 #define PETSC_CXX_RESTRICT __restrict__ #define PETSC_CXX_STATIC_INLINE static inline #define PETSC_HAVE_LIBBLAS 1 #define PETSC_HAVE_LIBDMUMPS 1 #define PETSC_HAVE_LIBZMUMPS 1 #define PETSC_HAVE_LIBSCALAPACK 1 #define PETSC_HAVE_LIBM 1 #define PETSC_HAVE_LIBMETIS 1 #define PETSC_HAVE_LIBLAPACK 1 #define PETSC_HAVE_LIBCMUMPS 1 #define PETSC_HAVE_LIBSMUMPS 1 #define PETSC_HAVE_LIBGCC_S 1 #define PETSC_HAVE_LIBPORD 1 #define PETSC_HAVE_LIBGFORTRANBEGIN 1 #define PETSC_HAVE_ERF 1 #define PETSC_HAVE_LIBSUPERLU_DIST_2 1 #define PETSC_HAVE_LIBBLACS 1 #define PETSC_HAVE_LIBPARMETIS 1 #define PETSC_HAVE_LIBGFORTRAN 1 #define PETSC_ARCH_NAME "linux-gnu-c-debug" #define PETSC_ARCH linux #define PETSC_DIR /home/amit/programs/ParEM/petsc-dev #define PETSC_CLANGUAGE_CXX 1 #define PETSC_USE_ERRORCHECKING 1 #define PETSC_MISSING_DREAL 1 #define PETSC_SIZEOF_MPI_COMM 4 #define PETSC_BITS_PER_BYTE 8 #define PETSC_SIZEOF_MPI_FINT 4 #define PETSC_SIZEOF_VOID_P 4 #define PETSC_RETSIGTYPE void #define PETSC_HAVE_CXX_COMPLEX 1 #define PETSC_SIZEOF_LONG 4 #define PETSC_USE_FORTRANKIND 1 #define PETSC_SIZEOF_SIZE_T 4 #define PETSC_SIZEOF_CHAR 1 #define PETSC_SIZEOF_DOUBLE 8 #define PETSC_SIZEOF_FLOAT 4 #define PETSC_HAVE_C99_COMPLEX 1 #define PETSC_SIZEOF_INT 4 #define PETSC_SIZEOF_LONG_LONG 8 #define PETSC_SIZEOF_SHORT 2 #define PETSC_HAVE_STRCASECMP 1 #define PETSC_HAVE_ISNAN 1 #define PETSC_HAVE_POPEN 1 #define PETSC_HAVE_SIGSET 1 #define PETSC_HAVE_GETWD 1 #define PETSC_HAVE_TIMES 1 #define PETSC_HAVE_SNPRINTF 1 #define PETSC_HAVE_GETPWUID 1 #define PETSC_HAVE_ISINF 1 #define PETSC_HAVE_GETHOSTBYNAME 1 #define PETSC_HAVE_SLEEP 1 #define PETSC_HAVE_FORK 1 #define PETSC_HAVE_RAND 1 #define PETSC_HAVE_GETTIMEOFDAY 1 #define PETSC_HAVE_UNAME 1 #define PETSC_HAVE_GETHOSTNAME 1 #define PETSC_HAVE_MKSTEMP 1 #define PETSC_HAVE_SIGACTION 1 #define PETSC_HAVE_DRAND48 1 #define PETSC_HAVE_VA_COPY 1 #define PETSC_HAVE_CLOCK 1 #define PETSC_HAVE_ACCESS 1 #define PETSC_HAVE_SIGNAL 1 #define PETSC_HAVE_GETRUSAGE 1 #define PETSC_HAVE_MEMALIGN 1 #define PETSC_HAVE_GETDOMAINNAME 1 #define PETSC_HAVE_TIME 1 #define PETSC_HAVE_LSEEK 1 #define PETSC_HAVE_SOCKET 1 #define PETSC_HAVE_SYSINFO 1 #define PETSC_HAVE_READLINK 1 #define PETSC_HAVE_REALPATH 1 #define PETSC_HAVE_MEMMOVE 1 #define PETSC_HAVE__GFORTRAN_IARGC 1 #define PETSC_SIGNAL_CAST #define PETSC_HAVE_GETCWD 1 #define PETSC_HAVE_VPRINTF 1 #define PETSC_HAVE_BZERO 1 #define PETSC_HAVE_GETPAGESIZE 1 #define PETSC_USE_COMPLEX 1 #define PETSC_USE_GDB_DEBUGGER 1 #define PETSC_HAVE_GFORTRAN_IARGC 1 #define PETSC_USE_DEBUG 1 #define PETSC_USE_INFO 1 #define PETSC_USE_LOG 1 #define PETSC_IS_COLOR_VALUE_TYPE short #define PETSC_USE_CTABLE 1 #define PETSC_USE_PROC_FOR_SIZE 1 #define PETSC_HAVE_MPI_COMM_C2F 1 #define PETSC_HAVE_MPI_COMM_F2C 1 #define PETSC_HAVE_MPI_FINT 1 #define PETSC_HAVE_MPI_F90MODULE 1 #define PETSC_HAVE_MPI_ALLTOALLW 1 #define PETSC_HAVE_MPI_COMM_SPAWN 1 #define PETSC_HAVE_MPI_WIN_CREATE 1 #define PETSC_HAVE_MPI_FINALIZED 1 #define HAVE_GZIP 1 #define PETSC_BLASLAPACK_UNDERSCORE 1 ----------------------------------------- Using include paths: -I/home/amit/programs/ParEM/petsc-dev -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include -I/home/amit/programs/ParEM/petsc-dev/include -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include ------------------------------------------ Using C/C++ compiler: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpicxx C/C++ Compiler version: Using Fortran compiler: gfortran -g Fortran Compiler version: ----------------------------------------- Using C/C++ linker: Using Fortran linker: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpif90 ----------------------------------------- Using libraries: -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -lX11 -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lsuperlu_dist_2.2 -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lparmetis -lmetis -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lcmumps -ldmumps -lsmumps -lzmumps -lpord -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lscalapack -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lblacs -llapack -lblas -L/usr/lib/gcc/i486-linux-gnu/4.1.3 -L/lib -lgcc_s -lgfortranbegin -lgfortran -lm -L/usr/lib/gcc/i486-linux-gnu/4.2.1 -lm -lstdc++ -lstdc++ -lgcc_s ------------------------------------------ Using mpiexec: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpiexec ========================================== /bin/rm -f -f /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib/libpetsc*.* BEGINNING TO COMPILE LIBRARIES IN ALL DIRECTORIES ========================================= libfast in: /home/amit/programs/ParEM/petsc-dev/src libfast in: /home/amit/programs/ParEM/petsc-dev/src/inline libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/socket libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/socket/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/string libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/string/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/vu libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/x libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/x/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/ps libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/dll libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/dll/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/time libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/time/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/utils libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/utils/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls/rand libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls/rand48 libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src/fsrc make[8]: *** No rule to make target `libf'. Stop. libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-mod make[7]: *** No rule to make target `petscmod.o'. Stop. make[6]: *** [buildmod] Error 2 make[5]: [libfast] Error 2 (ignored) libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/shared libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/shared/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/general libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/general/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/stride libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/stride/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls/constant libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls/string libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/seq libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/seq/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/csrperm libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/crl libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/superlu_dist libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/mumps libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/csrperm libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/crl libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/aij libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj/mpi libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj/mpi/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/maij libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/is libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/is/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/mpi libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/mpi/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/normal libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/normal/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/lrc libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/lrc/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/scatter libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/scatter/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/blockmat libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/blockmat/seq libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/composite libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/composite/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls/pmetis libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls/pmetis/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/color libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/color/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/jacobi libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/jacobi/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/none libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/sor libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/sor/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/shell libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/shell/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/dmmg libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/dmmg/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/eisens libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/eisens/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/ksp libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/ksp/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/redundant libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/redundant/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/is libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/is/nn libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/pbjacobi libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mat libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/fieldsplit libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/fieldsplit/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ilu libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ilu/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/icc libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/cholesky libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/cholesky/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/galerkin libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/galerkin/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/openmp libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asa libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/cp libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface iguess.c: In function ???PetscErrorCode KSPFischerGuessFormGuess_Method1(KSPFischerGuess_Method1*, _p_Vec*, _p_Vec*)???: iguess.c:79: warning: cannot pass objects of non-POD type ???struct std::complex??? through ???...???; call will abort at runtime iguess.c: In function ???PetscErrorCode KSPFischerGuessFormGuess_Method2(KSPFischerGuess_Method2*, _p_Vec*, _p_Vec*)???: iguess.c:198: warning: cannot pass objects of non-POD type ???struct std::complex??? through ???...???; call will abort at runtime libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cr libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgs libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgsl libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgsl/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/cgne libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/gltr libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/gltr/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/nash libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/nash/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/stcg libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/stcg/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cgs libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/lgmres libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cheby libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cheby/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/rich libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/rich/ftn-autolibfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/lsqr libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/preonly libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/tcqmr libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/tfqmr libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/qcg libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/qcg/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bicg libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/minres libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/symmlq libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/lcd libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/mf libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/mf/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/tr libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/test libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/picard libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tutorials/ex10d libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/euler libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/rk libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/rk/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit/beuler libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit/cn libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/pseudo libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/pseudo/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/basic libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/basic/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/f90-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/examples libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/examples/tests libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/ftn-auto libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ftn-custom libfast in: /home/amit/programs/ParEM/petsc-dev/src/contrib libfast in: /home/amit/programs/ParEM/petsc-dev/src/contrib/fun3d libfast in: /home/amit/programs/ParEM/petsc-dev/src/benchmarks libfast in: /home/amit/programs/ParEM/petsc-dev/src/fortran libfast in: /home/amit/programs/ParEM/petsc-dev/src/fortran/fsrc make[7]: *** No rule to make target `libf'. Stop. libfast in: /home/amit/programs/ParEM/petsc-dev/src/docs libfast in: /home/amit/programs/ParEM/petsc-dev/include libfast in: /home/amit/programs/ParEM/petsc-dev/include/finclude libfast in: /home/amit/programs/ParEM/petsc-dev/tutorials libfast in: /home/amit/programs/ParEM/petsc-dev/tutorials/multiphysics Completed building libraries ========================================= Shared libraries disabled ******************************************************************** Error during compile, check linux-gnu-c-debug/conf/make.log Send it and linux-gnu-c-debug/conf/configure.log to petsc-maint at mcs.anl.gov ******************************************************************** make: [all] Error 1 (ignored) Running test examples to verify correct installation make[2]: [ex19.PETSc] Error 2 (ignored) make[2]: [ex5f.PETSc] Error 2 (ignored) --------------Error detected during compile or link!----------------------- See http://www.mcs.anl.gov/petsc/petsc-2/documentation/troubleshooting.html gfortran -I/home/amit/programs/ParEM/petsc-dev/include/finclude -c -o ex5f.o ex5f.F In file included from ex5f.F:43: ex5f.h:32: error: include/finclude/petsc.h: No such file or directory ex5f.h:33: error: include/finclude/petscvec.h: No such file or directory ex5f.h:34: error: include/finclude/petscda.h: No such file or directory ex5f.h:35: error: include/finclude/petscis.h: No such file or directory ex5f.h:36: error: include/finclude/petscmat.h: No such file or directory ex5f.h:37: error: include/finclude/petscksp.h: No such file or directory ex5f.h:38: error: include/finclude/petscpc.h: No such file or directory ex5f.h:39: error: include/finclude/petscsnes.h: No such file or directory make[3]: *** [ex5f.o] Error 1 Completed test examples I guess the tests fail because the program looks for include/finclude/petsc.h in /include/finclude. What about libf ? "Matthew Knepley" To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: Multilevel solver 04/23/2008 02:43 PM Please respond to petsc-users at mcs.a nl.gov On Wed, Apr 23, 2008 at 1:32 PM, wrote: > Barry, > > Is the installation of petsc-dev different from the installation of the > 2.3.3 release ? I ran the config. But the folder tree seems to be > different. Hence, make is giving problems. 1) Always always send the error log. I cannot tell anything from the description "problems". 2) Some things have moved, but of course, make will work with the new organization. Matt > Amit > > Barry Smith > ov> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: Multilevel solver > > > 04/22/2008 10:08 > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > Amit, > > Using a a PCSHELL should be fine (it can be used with GMRES), > my guess is there is a memory corruption error somewhere that is > causing the crash. This could be tracked down with www.valgrind.com > > Another way to you could implement this is with some very recent > additions I made to PCFIELDSPLIT that are in petsc-dev > (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) > With this you would chose > PCSetType(pc,PCFIELDSPLIT > PCFieldSplitSetIS(pc,is1 > PCFieldSplitSetIS(pc,is2 > PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > to use LU on A11 use the command line options > -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly > and SOR on A22 > -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - > fieldsplit_1_pc_sor_lits where > is the number of iterations you want to use block A22 > > is1 is the IS that contains the indices for all the vector entries in > the 1 block while is2 is all indices in the > vector for the 2 block. You can use ISCreateGeneral() to create these. > > Probably it is easiest just to try this out. > > Barry > > > On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > > > > Hi, > > > > I am trying to implement a multilevel method for an EM problem. The > > reference is : "Comparison of hierarchical basis functions for > > efficient > > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > > IET > > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > > > Here is the summary: > > > > The matrix equation Ax=b is solved using GMRES with a multilevel > > pre-conditioner. A has a block structure. > > > > A11 A12 * x1 = b1 > > A21 A22 x2 b2 > > > > A11 is mxm and A33 is nxn, where m is not equal to n. > > > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > > superLU or > > MUMPS) > > > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > > a SOR > > solver or a parallel LU) > > > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > > > This gives the approximate solution to > > > > A11 A12 * e1 = b1 > > A21 A22 e2 b2 > > > > and is used as the pre-conditioner for the GMRES. > > > > > > Which PetSc method can implement this pre-conditioner ? I tried a > > PCSHELL > > type PC. With Hong's help, I also got the parallel LU to work > > withSuperLU/MUMPS. My program runs successfully on multiple > > processes on a > > single machine. But when I submit the program over multiple > > machines, I get > > a crash in the PCApply routine after several GMRES iterations. I > > think this > > has to do with using PCSHELL with GMRES (which is not a good idea). Is > > there a different way to implement this ? Does this resemble the usage > > pattern of one of the AMG preconditioners ? > > > > > > Thanks > > > > Rgds, > > Amit > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From knepley at gmail.com Wed Apr 23 15:20:06 2008 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 23 Apr 2008 15:20:06 -0500 Subject: Multilevel solver In-Reply-To: References: Message-ID: On Wed, Apr 23, 2008 at 3:05 PM, wrote: > Here is my make log. When you clone petsc-dev, you need to run make allfortranstubs before 'make'. The dev docs will be fixed, Matt > ========================================== > > See documentation/faq.html and documentation/bugreporting.html > for help with installation problems. Please send EVERYTHING > printed out below when reporting problems > > To subscribe to the PETSc announcement list, send mail to > majordomo at mcs.anl.gov with the message: > subscribe petsc-announce > > To subscribe to the PETSc users mailing list, send mail to > majordomo at mcs.anl.gov with the message: > subscribe petsc-users > > ========================================== > On Wed Apr 23 15:37:17 EDT 2008 on tabla > Machine characteristics: Linux tabla 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux > ----------------------------------------- > Using PETSc directory: /home/amit/programs/ParEM/petsc-dev > Using PETSc arch: linux-gnu-c-debug > ----------------------------------------- > PETSC_VERSION_RELEASE 0 > PETSC_VERSION_MAJOR 2 > PETSC_VERSION_MINOR 3 > PETSC_VERSION_SUBMINOR 3 > PETSC_VERSION_PATCH 12 > PETSC_VERSION_DATE "May, 23, 2007" > PETSC_VERSION_PATCH_DATE "unknown" > PETSC_VERSION_HG "unknown" > ----------------------------------------- > Using configure Options: --PETSC_ARCH=linux-gnu-c-debug --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx --with-mpi=1 > --download-mpich=1 --with-metis=1 --download-metis=1 --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 --download-superlu_dist=1 > --with-mumps=1 --download-blacs=1 --download-scalapack=1 --download-mumps=1 COPTFLAGS="-O3 -march=p4 -mtune=p4 -ffast-math -malign-double > -funroll-loops -pipe -fomit-frame-pointer -finline-functions -msse2" CXXOPTFLAGS="-O3 -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops > -pipe -fomit-frame-pointer -finline-functions -msse2" --with-shared=0 > Using configuration flags: > #define INCLUDED_PETSCCONF_H > #define IS_COLORING_MAX 65535 > #define STDC_HEADERS 1 > #define MPIU_COLORING_VALUE MPI_UNSIGNED_SHORT > #define PETSC_HAVE_SUPERLU_DIST 1 > #define PETSC_STATIC_INLINE static inline > #define PETSC_HAVE_BLACS 1 > #define PETSC_HAVE_MUMPS 1 > #define PETSC_DIR_SEPARATOR '/' > #define PETSC_HAVE_BLASLAPACK 1 > #define PETSC_PATH_SEPARATOR ':' > #define PETSC_REPLACE_DIR_SEPARATOR '\\' > #define PETSC_HAVE_DOUBLE_ALIGN_MALLOC 1 > #define PETSC_RESTRICT __restrict__ > #define PETSC_HAVE_X11 1 > #define PETSC_HAVE_SOWING 1 > #define PETSC_HAVE_SCALAPACK 1 > #define PETSC_HAVE_MPI 1 > #define PETSC_USE_SOCKET_VIEWER 1 > #define PETSC_HAVE_PARMETIS 1 > #define PETSC_HAVE_C2HTML 1 > #define PETSC_HAVE_FORTRAN 1 > #define PETSC_HAVE_STRING_H 1 > #define PETSC_HAVE_SYS_TYPES_H 1 > #define PETSC_HAVE_ENDIAN_H 1 > #define PETSC_HAVE_SYS_PROCFS_H 1 > #define PETSC_HAVE_LINUX_KERNEL_H 1 > #define PETSC_HAVE_TIME_H 1 > #define PETSC_HAVE_MATH_H 1 > #define PETSC_HAVE_STDLIB_H 1 > #define PETSC_HAVE_SYS_PARAM_H 1 > #define PETSC_HAVE_SYS_SOCKET_H 1 > #define PETSC_HAVE_UNISTD_H 1 > #define PETSC_HAVE_SYS_WAIT_H 1 > #define PETSC_HAVE_LIMITS_H 1 > #define PETSC_HAVE_SEARCH_H 1 > #define PETSC_HAVE_NETINET_IN_H 1 > #define PETSC_HAVE_FLOAT_H 1 > #define PETSC_HAVE_SYS_SYSINFO_H 1 > #define PETSC_HAVE_SYS_RESOURCE_H 1 > #define PETSC_HAVE_SYS_TIMES_H 1 > #define PETSC_HAVE_NETDB_H 1 > #define PETSC_HAVE_MALLOC_H 1 > #define PETSC_HAVE_PWD_H 1 > #define PETSC_HAVE_FCNTL_H 1 > #define PETSC_HAVE_STRINGS_H 1 > #define PETSC_HAVE_MEMORY_H 1 > #define PETSC_TIME_WITH_SYS_TIME 1 > #define PETSC_HAVE_SYS_TIME_H 1 > #define PETSC_HAVE_SYS_UTSNAME_H 1 > #define PETSC_USING_F90 1 > #define PETSC_PRINTF_FORMAT_CHECK(A,B) __attribute__((format (printf, A, B))) > #define PETSC_C_STATIC_INLINE static inline > #define PETSC_HAVE_FORTRAN_UNDERSCORE 1 > #define PETSC_HAVE_CXX_NAMESPACE 1 > #define PETSC_C_RESTRICT __restrict__ > #define PETSC_USE_F90_SRC_IMPL 1 > #define PETSC_CXX_RESTRICT __restrict__ > #define PETSC_CXX_STATIC_INLINE static inline > #define PETSC_HAVE_LIBBLAS 1 > #define PETSC_HAVE_LIBDMUMPS 1 > #define PETSC_HAVE_LIBZMUMPS 1 > #define PETSC_HAVE_LIBSCALAPACK 1 > #define PETSC_HAVE_LIBM 1 > #define PETSC_HAVE_LIBMETIS 1 > #define PETSC_HAVE_LIBLAPACK 1 > #define PETSC_HAVE_LIBCMUMPS 1 > #define PETSC_HAVE_LIBSMUMPS 1 > #define PETSC_HAVE_LIBGCC_S 1 > #define PETSC_HAVE_LIBPORD 1 > #define PETSC_HAVE_LIBGFORTRANBEGIN 1 > #define PETSC_HAVE_ERF 1 > #define PETSC_HAVE_LIBSUPERLU_DIST_2 1 > #define PETSC_HAVE_LIBBLACS 1 > #define PETSC_HAVE_LIBPARMETIS 1 > #define PETSC_HAVE_LIBGFORTRAN 1 > #define PETSC_ARCH_NAME "linux-gnu-c-debug" > #define PETSC_ARCH linux > #define PETSC_DIR /home/amit/programs/ParEM/petsc-dev > #define PETSC_CLANGUAGE_CXX 1 > #define PETSC_USE_ERRORCHECKING 1 > #define PETSC_MISSING_DREAL 1 > #define PETSC_SIZEOF_MPI_COMM 4 > #define PETSC_BITS_PER_BYTE 8 > #define PETSC_SIZEOF_MPI_FINT 4 > #define PETSC_SIZEOF_VOID_P 4 > #define PETSC_RETSIGTYPE void > #define PETSC_HAVE_CXX_COMPLEX 1 > #define PETSC_SIZEOF_LONG 4 > #define PETSC_USE_FORTRANKIND 1 > #define PETSC_SIZEOF_SIZE_T 4 > #define PETSC_SIZEOF_CHAR 1 > #define PETSC_SIZEOF_DOUBLE 8 > #define PETSC_SIZEOF_FLOAT 4 > #define PETSC_HAVE_C99_COMPLEX 1 > #define PETSC_SIZEOF_INT 4 > #define PETSC_SIZEOF_LONG_LONG 8 > #define PETSC_SIZEOF_SHORT 2 > #define PETSC_HAVE_STRCASECMP 1 > #define PETSC_HAVE_ISNAN 1 > #define PETSC_HAVE_POPEN 1 > #define PETSC_HAVE_SIGSET 1 > #define PETSC_HAVE_GETWD 1 > #define PETSC_HAVE_TIMES 1 > #define PETSC_HAVE_SNPRINTF 1 > #define PETSC_HAVE_GETPWUID 1 > #define PETSC_HAVE_ISINF 1 > #define PETSC_HAVE_GETHOSTBYNAME 1 > #define PETSC_HAVE_SLEEP 1 > #define PETSC_HAVE_FORK 1 > #define PETSC_HAVE_RAND 1 > #define PETSC_HAVE_GETTIMEOFDAY 1 > #define PETSC_HAVE_UNAME 1 > #define PETSC_HAVE_GETHOSTNAME 1 > #define PETSC_HAVE_MKSTEMP 1 > #define PETSC_HAVE_SIGACTION 1 > #define PETSC_HAVE_DRAND48 1 > #define PETSC_HAVE_VA_COPY 1 > #define PETSC_HAVE_CLOCK 1 > #define PETSC_HAVE_ACCESS 1 > #define PETSC_HAVE_SIGNAL 1 > #define PETSC_HAVE_GETRUSAGE 1 > #define PETSC_HAVE_MEMALIGN 1 > #define PETSC_HAVE_GETDOMAINNAME 1 > #define PETSC_HAVE_TIME 1 > #define PETSC_HAVE_LSEEK 1 > #define PETSC_HAVE_SOCKET 1 > #define PETSC_HAVE_SYSINFO 1 > #define PETSC_HAVE_READLINK 1 > #define PETSC_HAVE_REALPATH 1 > #define PETSC_HAVE_MEMMOVE 1 > #define PETSC_HAVE__GFORTRAN_IARGC 1 > #define PETSC_SIGNAL_CAST > #define PETSC_HAVE_GETCWD 1 > #define PETSC_HAVE_VPRINTF 1 > #define PETSC_HAVE_BZERO 1 > #define PETSC_HAVE_GETPAGESIZE 1 > #define PETSC_USE_COMPLEX 1 > #define PETSC_USE_GDB_DEBUGGER 1 > #define PETSC_HAVE_GFORTRAN_IARGC 1 > #define PETSC_USE_DEBUG 1 > #define PETSC_USE_INFO 1 > #define PETSC_USE_LOG 1 > #define PETSC_IS_COLOR_VALUE_TYPE short > #define PETSC_USE_CTABLE 1 > #define PETSC_USE_PROC_FOR_SIZE 1 > #define PETSC_HAVE_MPI_COMM_C2F 1 > #define PETSC_HAVE_MPI_COMM_F2C 1 > #define PETSC_HAVE_MPI_FINT 1 > #define PETSC_HAVE_MPI_F90MODULE 1 > #define PETSC_HAVE_MPI_ALLTOALLW 1 > #define PETSC_HAVE_MPI_COMM_SPAWN 1 > #define PETSC_HAVE_MPI_WIN_CREATE 1 > #define PETSC_HAVE_MPI_FINALIZED 1 > #define HAVE_GZIP 1 > #define PETSC_BLASLAPACK_UNDERSCORE 1 > ----------------------------------------- > Using include paths: -I/home/amit/programs/ParEM/petsc-dev -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include > -I/home/amit/programs/ParEM/petsc-dev/include -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include > -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include > -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include > ------------------------------------------ > Using C/C++ compiler: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpicxx > C/C++ Compiler version: > Using Fortran compiler: gfortran -g > Fortran Compiler version: > ----------------------------------------- > Using C/C++ linker: > Using Fortran linker: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpif90 > ----------------------------------------- > Using libraries: -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib > -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -lX11 > -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lsuperlu_dist_2.2 > -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lparmetis -lmetis > -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lcmumps -ldmumps > -lsmumps -lzmumps -lpord -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib > -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lscalapack -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib > -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lblacs -llapack -lblas -L/usr/lib/gcc/i486-linux-gnu/4.1.3 -L/lib -lgcc_s > -lgfortranbegin -lgfortran -lm -L/usr/lib/gcc/i486-linux-gnu/4.2.1 -lm -lstdc++ -lstdc++ -lgcc_s > ------------------------------------------ > Using mpiexec: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpiexec > ========================================== > /bin/rm -f -f /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib/libpetsc*.* > BEGINNING TO COMPILE LIBRARIES IN ALL DIRECTORIES > ========================================= > libfast in: /home/amit/programs/ParEM/petsc-dev/src > libfast in: /home/amit/programs/ParEM/petsc-dev/src/inline > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/socket > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/socket/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/string > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/string/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/vu > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/x > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/x/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/ps > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/dll > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/dll/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/time > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/time/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/utils > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/utils/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls/rand > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls/rand48 > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src/fsrc > make[8]: *** No rule to make target `libf'. Stop. > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-mod > make[7]: *** No rule to make target `petscmod.o'. Stop. > make[6]: *** [buildmod] Error 2 > make[5]: [libfast] Error 2 (ignored) > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/shared > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/shared/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/general > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/general/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/stride > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/stride/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls/constant > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls/string > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/seq > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/seq/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/csrperm > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/crl > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/superlu_dist > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/mumps > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/csrperm > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/crl > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/aij > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj/mpi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj/mpi/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/maij > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/is > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/is/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/mpi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/mpi/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/normal > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/normal/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/lrc > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/lrc/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/scatter > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/scatter/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/blockmat > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/blockmat/seq > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/composite > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/composite/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls/pmetis > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls/pmetis/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/color > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/color/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/jacobi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/jacobi/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/none > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/sor > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/sor/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/shell > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/shell/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/dmmg > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/dmmg/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/eisens > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/eisens/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/ksp > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/ksp/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/redundant > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/redundant/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/is > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/is/nn > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/pbjacobi > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mat > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/fieldsplit > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/fieldsplit/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ilu > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ilu/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/icc > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/cholesky > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/cholesky/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/galerkin > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/galerkin/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/openmp > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asa > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/cp > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface > iguess.c: In function ???PetscErrorCode KSPFischerGuessFormGuess_Method1(KSPFischerGuess_Method1*, _p_Vec*, _p_Vec*)???: > iguess.c:79: warning: cannot pass objects of non-POD type ???struct std::complex??? through ???...???; call will abort at runtime > iguess.c: In function ???PetscErrorCode KSPFischerGuessFormGuess_Method2(KSPFischerGuess_Method2*, _p_Vec*, _p_Vec*)???: > iguess.c:198: warning: cannot pass objects of non-POD type ???struct std::complex??? through ???...???; call will abort at runtime > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cr > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgs > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgsl > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgsl/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/cgne > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/gltr > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/gltr/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/nash > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/nash/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/stcg > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/stcg/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cgs > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/lgmres > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cheby > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cheby/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/rich > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/rich/ftn-autolibfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/lsqr > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/preonly > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/tcqmr > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/tfqmr > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/qcg > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/qcg/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bicg > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/minres > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/symmlq > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/lcd > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/mf > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/mf/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/tr > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/test > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/picard > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tutorials/ex10d > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/euler > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/rk > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/rk/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit/beuler > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit/cn > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/pseudo > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/pseudo/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/basic > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/basic/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/f90-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/examples > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/examples/tests > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/ftn-auto > libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ftn-custom > libfast in: /home/amit/programs/ParEM/petsc-dev/src/contrib > libfast in: /home/amit/programs/ParEM/petsc-dev/src/contrib/fun3d > libfast in: /home/amit/programs/ParEM/petsc-dev/src/benchmarks > libfast in: /home/amit/programs/ParEM/petsc-dev/src/fortran > libfast in: /home/amit/programs/ParEM/petsc-dev/src/fortran/fsrc > make[7]: *** No rule to make target `libf'. Stop. > libfast in: /home/amit/programs/ParEM/petsc-dev/src/docs > libfast in: /home/amit/programs/ParEM/petsc-dev/include > libfast in: /home/amit/programs/ParEM/petsc-dev/include/finclude > libfast in: /home/amit/programs/ParEM/petsc-dev/tutorials > libfast in: /home/amit/programs/ParEM/petsc-dev/tutorials/multiphysics > Completed building libraries > ========================================= > Shared libraries disabled > ******************************************************************** > Error during compile, check linux-gnu-c-debug/conf/make.log > Send it and linux-gnu-c-debug/conf/configure.log to petsc-maint at mcs.anl.gov > ******************************************************************** > make: [all] Error 1 (ignored) > Running test examples to verify correct installation > make[2]: [ex19.PETSc] Error 2 (ignored) > make[2]: [ex5f.PETSc] Error 2 (ignored) > --------------Error detected during compile or link!----------------------- > See http://www.mcs.anl.gov/petsc/petsc-2/documentation/troubleshooting.html > gfortran -I/home/amit/programs/ParEM/petsc-dev/include/finclude -c -o ex5f.o ex5f.F > In file included from ex5f.F:43: > ex5f.h:32: error: include/finclude/petsc.h: No such file or directory > ex5f.h:33: error: include/finclude/petscvec.h: No such file or directory > ex5f.h:34: error: include/finclude/petscda.h: No such file or directory > ex5f.h:35: error: include/finclude/petscis.h: No such file or directory > ex5f.h:36: error: include/finclude/petscmat.h: No such file or directory > ex5f.h:37: error: include/finclude/petscksp.h: No such file or directory > ex5f.h:38: error: include/finclude/petscpc.h: No such file or directory > ex5f.h:39: error: include/finclude/petscsnes.h: No such file or directory > make[3]: *** [ex5f.o] Error 1 > Completed test examples > > > I guess the tests fail because the program looks for > include/finclude/petsc.h in /include/finclude. What about libf ? > > > > > > > "Matthew Knepley" > > m> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: Multilevel solver > > > 04/23/2008 02:43 > > > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > On Wed, Apr 23, 2008 at 1:32 PM, wrote: > > Barry, > > > > Is the installation of petsc-dev different from the installation of the > > 2.3.3 release ? I ran the config. But the folder tree seems to be > > different. Hence, make is giving problems. > > 1) Always always send the error log. I cannot tell anything from the > description "problems". > > 2) Some things have moved, but of course, make will work with the new > organization. > > Matt > > > Amit > > > > Barry Smith > > > ov> > To > > Sent by: petsc-users at mcs.anl.gov > > owner-petsc-users > cc > > @mcs.anl.gov > > No Phone Info > Subject > > Available Re: Multilevel solver > > > > > > 04/22/2008 10:08 > > PM > > > > > > Please respond to > > petsc-users at mcs.a > > nl.gov > > > > > > > > > > > > > > Amit, > > > > Using a a PCSHELL should be fine (it can be used with GMRES), > > my guess is there is a memory corruption error somewhere that is > > causing the crash. This could be tracked down with www.valgrind.com > > > > Another way to you could implement this is with some very recent > > additions I made to PCFIELDSPLIT that are in petsc-dev > > (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) > > With this you would chose > > PCSetType(pc,PCFIELDSPLIT > > PCFieldSplitSetIS(pc,is1 > > PCFieldSplitSetIS(pc,is2 > > PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > > to use LU on A11 use the command line options > > -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly > > and SOR on A22 > > -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - > > fieldsplit_1_pc_sor_lits where > > is the number of iterations you want to use block A22 > > > > is1 is the IS that contains the indices for all the vector entries in > > the 1 block while is2 is all indices in the > > vector for the 2 block. You can use ISCreateGeneral() to create these. > > > > Probably it is easiest just to try this out. > > > > Barry > > > > > > On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > > > > > > > Hi, > > > > > > I am trying to implement a multilevel method for an EM problem. The > > > reference is : "Comparison of hierarchical basis functions for > > > efficient > > > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > > > IET > > > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > > > > > Here is the summary: > > > > > > The matrix equation Ax=b is solved using GMRES with a multilevel > > > pre-conditioner. A has a block structure. > > > > > > A11 A12 * x1 = b1 > > > A21 A22 x2 b2 > > > > > > A11 is mxm and A33 is nxn, where m is not equal to n. > > > > > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > > > superLU or > > > MUMPS) > > > > > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > > > a SOR > > > solver or a parallel LU) > > > > > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > > > > > This gives the approximate solution to > > > > > > A11 A12 * e1 = b1 > > > A21 A22 e2 b2 > > > > > > and is used as the pre-conditioner for the GMRES. > > > > > > > > > Which PetSc method can implement this pre-conditioner ? I tried a > > > PCSHELL > > > type PC. With Hong's help, I also got the parallel LU to work > > > withSuperLU/MUMPS. My program runs successfully on multiple > > > processes on a > > > single machine. But when I submit the program over multiple > > > machines, I get > > > a crash in the PCApply routine after several GMRES iterations. I > > > think this > > > has to do with using PCSHELL with GMRES (which is not a good idea). Is > > > there a different way to implement this ? Does this resemble the usage > > > pattern of one of the AMG preconditioners ? > > > > > > > > > Thanks > > > > > > Rgds, > > > Amit > > > > > > > > > > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From balay at mcs.anl.gov Wed Apr 23 15:28:46 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 23 Apr 2008 15:28:46 -0500 (CDT) Subject: Multilevel solver In-Reply-To: References: Message-ID: On Wed, 23 Apr 2008, Amit.Itagi at seagate.com wrote: > Using configure Options: --PETSC_ARCH=linux-gnu-c-debug > --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx > --with-mpi=1 --download-mpich=1 --with-metis=1 --download-metis=1 > --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 > --download-superlu_dist=1 --with-mumps=1 --download-blacs=1 > --download-scalapack=1 --download-mumps=1 COPTFLAGS="-O3 -march=p4 > -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe > -fomit-frame-pointer -finline-functions -msse2" CXXOPTFLAGS="-O3 > -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe > -fomit-frame-pointer -finline-functions -msse2" --with-shared=0 For one - when debugging - you should not use optimization flags "-O3 etc.." Can you send the corresponding configure.log to petsc-maint at mcs.anl.gov? Also - what do you have for 'hg status' Satish From balay at mcs.anl.gov Wed Apr 23 15:31:51 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 23 Apr 2008 15:31:51 -0500 (CDT) Subject: Multilevel solver In-Reply-To: References: Message-ID: On Wed, 23 Apr 2008, Matthew Knepley wrote: > On Wed, Apr 23, 2008 at 3:05 PM, wrote: > > Here is my make log. > > When you clone petsc-dev, you need to run > > make allfortranstubs > > before 'make'. The dev docs will be fixed, Matt, Its not fortranstubs issue. For one configure should regenerate them. However > > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src/fsrc > > make[8]: *** No rule to make target `libf'. Stop. > > libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-mod > > make[7]: *** No rule to make target `petscmod.o'. Stop. > > make[6]: *** [buildmod] Error 2 > > make[5]: [libfast] Error 2 (ignored) The locations are not ftn-auto [which would correspond to the stubs]. Satish From balay at mcs.anl.gov Wed Apr 23 15:33:31 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 23 Apr 2008 15:33:31 -0500 (CDT) Subject: Multilevel solver In-Reply-To: References: Message-ID: On Wed, 23 Apr 2008, Satish Balay wrote: > On Wed, 23 Apr 2008, Amit.Itagi at seagate.com wrote: > > > Using configure Options: --PETSC_ARCH=linux-gnu-c-debug > > --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx > > --with-mpi=1 --download-mpich=1 --with-metis=1 --download-metis=1 > > --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 > > --download-superlu_dist=1 --with-mumps=1 --download-blacs=1 > > --download-scalapack=1 --download-mumps=1 COPTFLAGS="-O3 -march=p4 > > -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe > > -fomit-frame-pointer -finline-functions -msse2" CXXOPTFLAGS="-O3 > > -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe > > -fomit-frame-pointer -finline-functions -msse2" --with-shared=0 > > > For one - when debugging - you should not use optimization flags "-O3 etc.." > > Can you send the corresponding configure.log to petsc-maint at mcs.anl.gov? > > Also - what do you have for 'hg status' Ah - I think you pulled petsc-dev - but not BuildSystem. If you are using petsc-dev - you should be pulling both. http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html Satish From nliu at fit.edu Wed Apr 23 22:04:01 2008 From: nliu at fit.edu (Ningyu Liu) Date: Wed, 23 Apr 2008 23:04:01 -0400 Subject: Question on TS_EULER Message-ID: <7091BC9B-1920-4E22-A703-354D87BEA388@fit.edu> Hello, Is there a way by which the timestep of the explicit forward Euler method can be modified during the iterations? Looking at the source code of the method, the timestep dt is set when entering the function TSStep_Euler(). The iteration proceeds with this fixed timestep even calling TSSetTimeStep() in a monitoring function. I personally find it's a bit confusing. The actual solution is obtained with fixed timestep. However, the time returned from calling TSGetTime() takes into account any modifications made by the user. Thanks. Regards, Ningyu -------------- next part -------------- An HTML attachment was scrubbed... URL: From zonexo at gmail.com Thu Apr 24 04:11:25 2008 From: zonexo at gmail.com (Ben Tay) Date: Thu, 24 Apr 2008 17:11:25 +0800 Subject: Using PETSc libraries with MS Compute cluster and MS MPI Message-ID: <48104EBD.3080104@gmail.com> Hi, I'm trying to run my mpi code on the MS Compute cluster which my school just installed. Unfortunately, it failed without giving any error msg. I am using just a test example ex2f. I read in the MS website that there is no need to use MS MPI to compile the code or library. Anyway, I also tried to compile PETSc with MS MPI but I'm not able to get pass ./configure. It always complains that there is something wrong with the MS MPI. Is there anyone who has experience in these? Thank you very much. Regards. From knepley at gmail.com Thu Apr 24 08:54:07 2008 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 24 Apr 2008 08:54:07 -0500 Subject: Question on TS_EULER In-Reply-To: <7091BC9B-1920-4E22-A703-354D87BEA388@fit.edu> References: <7091BC9B-1920-4E22-A703-354D87BEA388@fit.edu> Message-ID: That is a bug. I have just fixed it in petsc-dev. You can easily fix it in your copy (if you are using the release) by changing line 41 to ierr = VecAXPY(sol,ts->time_step,update);CHKERRQ(ierr); Matt On Wed, Apr 23, 2008 at 10:04 PM, Ningyu Liu wrote: > Hello, > > Is there a way by which the timestep of the explicit forward Euler method > can be modified during the iterations? Looking at the source code of the > method, the timestep dt is set when entering the function TSStep_Euler(). > The iteration proceeds with this fixed timestep even calling TSSetTimeStep() > in a monitoring function. I personally find it's a bit confusing. The actual > solution is obtained with fixed timestep. However, the time returned from > calling TSGetTime() takes into account any modifications made by the user. > Thanks. > > Regards, > > Ningyu -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From Amit.Itagi at seagate.com Thu Apr 24 08:58:48 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Thu, 24 Apr 2008 09:58:48 -0400 Subject: Multilevel solver In-Reply-To: Message-ID: Barry, I have been trying out the PCFIELDSPLIT. I have not yet gotten it to work. I have some follow up questions which might help solve my problem. Consider the simple case of a 4x4 matrix equation being solved on two processes. I have vector elements 0 and 1 belonging to rank 0, and elements 2 and 3 belonging to rank 1. 1) For my example, can the index sets have staggered indices i.e. is1-> 0,2 and is2->1,3 (each is spans across ranks) ? 2) When I provide the -field_split__pc_type option on the command line, is the index in the same order that the PCFieldSplitSetIS function called in ? So if I have PCFieldSplitSetIS(pc,is2) before PCFieldSplitSetIS(pc,is1), will -field_split_0_... correspond to is2 and -field_split_1_... to is1 ? 3) Since I want to set PC type to lu for field 0, and I want to use MUMPS for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In this case, will a second copy of the submatrix be generated - one of type MUMPS for the PC and the other of the original MATAIJ type for the KSP ? 4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE ? Thanks Rgds, Amit Barry Smith To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: Multilevel solver 04/22/2008 10:08 PM Please respond to petsc-users at mcs.a nl.gov Amit, Using a a PCSHELL should be fine (it can be used with GMRES), my guess is there is a memory corruption error somewhere that is causing the crash. This could be tracked down with www.valgrind.com Another way to you could implement this is with some very recent additions I made to PCFIELDSPLIT that are in petsc-dev (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) With this you would chose PCSetType(pc,PCFIELDSPLIT PCFieldSplitSetIS(pc,is1 PCFieldSplitSetIS(pc,is2 PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE to use LU on A11 use the command line options -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly and SOR on A22 -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - fieldsplit_1_pc_sor_lits where is the number of iterations you want to use block A22 is1 is the IS that contains the indices for all the vector entries in the 1 block while is2 is all indices in the vector for the 2 block. You can use ISCreateGeneral() to create these. Probably it is easiest just to try this out. Barry On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > Hi, > > I am trying to implement a multilevel method for an EM problem. The > reference is : "Comparison of hierarchical basis functions for > efficient > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > IET > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > Here is the summary: > > The matrix equation Ax=b is solved using GMRES with a multilevel > pre-conditioner. A has a block structure. > > A11 A12 * x1 = b1 > A21 A22 x2 b2 > > A11 is mxm and A33 is nxn, where m is not equal to n. > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > superLU or > MUMPS) > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > a SOR > solver or a parallel LU) > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > This gives the approximate solution to > > A11 A12 * e1 = b1 > A21 A22 e2 b2 > > and is used as the pre-conditioner for the GMRES. > > > Which PetSc method can implement this pre-conditioner ? I tried a > PCSHELL > type PC. With Hong's help, I also got the parallel LU to work > withSuperLU/MUMPS. My program runs successfully on multiple > processes on a > single machine. But when I submit the program over multiple > machines, I get > a crash in the PCApply routine after several GMRES iterations. I > think this > has to do with using PCSHELL with GMRES (which is not a good idea). Is > there a different way to implement this ? Does this resemble the usage > pattern of one of the AMG preconditioners ? > > > Thanks > > Rgds, > Amit > From knepley at gmail.com Thu Apr 24 09:19:47 2008 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 24 Apr 2008 09:19:47 -0500 Subject: Multilevel solver In-Reply-To: References: Message-ID: On Thu, Apr 24, 2008 at 8:58 AM, wrote: > Barry, > > I have been trying out the PCFIELDSPLIT. I have not yet gotten it to work. > I have some follow up questions which might help solve my problem. > > Consider the simple case of a 4x4 matrix equation being solved on two > processes. I have vector elements 0 and 1 belonging to rank 0, and elements > 2 and 3 belonging to rank 1. > > 1) For my example, can the index sets have staggered indices i.e. is1-> 0,2 > and is2->1,3 (each is spans across ranks) ? Yes. > 2) When I provide the -field_split__pc_type option on the command line, > is the index in the same order that the PCFieldSplitSetIS function > called in ? > So if I have PCFieldSplitSetIS(pc,is2) before > PCFieldSplitSetIS(pc,is1), will -field_split_0_... correspond to is2 and > -field_split_1_... to is1 ? Yes. > 3) Since I want to set PC type to lu for field 0, and I want to use MUMPS > for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In this > case, will a second copy of the submatrix be generated - one of type MUMPS > for the PC and the other of the original MATAIJ type for the KSP ? I will have to check. However if we are consistent, then it should be -field_split_0_mat_type aijmumps > 4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE ? It is just the composition of the preconditioners, which is what you want here. Matt > Thanks > > Rgds, > Amit > > > > > Barry Smith > ov> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: Multilevel solver > > > 04/22/2008 10:08 > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > Amit, > > Using a a PCSHELL should be fine (it can be used with GMRES), > my guess is there is a memory corruption error somewhere that is > causing the crash. This could be tracked down with www.valgrind.com > > Another way to you could implement this is with some very recent > additions I made to PCFIELDSPLIT that are in petsc-dev > (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) > With this you would chose > PCSetType(pc,PCFIELDSPLIT > PCFieldSplitSetIS(pc,is1 > PCFieldSplitSetIS(pc,is2 > PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > to use LU on A11 use the command line options > -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly > and SOR on A22 > -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - > fieldsplit_1_pc_sor_lits where > is the number of iterations you want to use block A22 > > is1 is the IS that contains the indices for all the vector entries in > the 1 block while is2 is all indices in the > vector for the 2 block. You can use ISCreateGeneral() to create these. > > Probably it is easiest just to try this out. > > Barry > > > On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > > > > Hi, > > > > I am trying to implement a multilevel method for an EM problem. The > > reference is : "Comparison of hierarchical basis functions for > > efficient > > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > > IET > > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > > > Here is the summary: > > > > The matrix equation Ax=b is solved using GMRES with a multilevel > > pre-conditioner. A has a block structure. > > > > A11 A12 * x1 = b1 > > A21 A22 x2 b2 > > > > A11 is mxm and A33 is nxn, where m is not equal to n. > > > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > > superLU or > > MUMPS) > > > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > > a SOR > > solver or a parallel LU) > > > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > > > This gives the approximate solution to > > > > A11 A12 * e1 = b1 > > A21 A22 e2 b2 > > > > and is used as the pre-conditioner for the GMRES. > > > > > > Which PetSc method can implement this pre-conditioner ? I tried a > > PCSHELL > > type PC. With Hong's help, I also got the parallel LU to work > > withSuperLU/MUMPS. My program runs successfully on multiple > > processes on a > > single machine. But when I submit the program over multiple > > machines, I get > > a crash in the PCApply routine after several GMRES iterations. I > > think this > > has to do with using PCSHELL with GMRES (which is not a good idea). Is > > there a different way to implement this ? Does this resemble the usage > > pattern of one of the AMG preconditioners ? > > > > > > Thanks > > > > Rgds, > > Amit > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From Amit.Itagi at seagate.com Thu Apr 24 11:07:08 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Thu, 24 Apr 2008 12:07:08 -0400 Subject: Multilevel solver In-Reply-To: Message-ID: "Matthew Knepley" To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: Multilevel solver 04/24/2008 10:19 AM Please respond to petsc-users at mcs.a nl.gov > 3) Since I want to set PC type to lu for field 0, and I want to use MUMPS > for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In this > case, will a second copy of the submatrix be generated - one of type MUMPS > for the PC and the other of the original MATAIJ type for the KSP ? I will have to check. However if we are consistent, then it should be -field_split_0_mat_type aijmumps Can these options be set inside the code, instead of on the command line ? > 4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE ? It is just the composition of the preconditioners, which is what you want here. Matt > Thanks > > Rgds, > Amit > > > > > Barry Smith > ov> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: Multilevel solver > > > 04/22/2008 10:08 > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > Amit, > > Using a a PCSHELL should be fine (it can be used with GMRES), > my guess is there is a memory corruption error somewhere that is > causing the crash. This could be tracked down with www.valgrind.com > > Another way to you could implement this is with some very recent > additions I made to PCFIELDSPLIT that are in petsc-dev > (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) > With this you would chose > PCSetType(pc,PCFIELDSPLIT > PCFieldSplitSetIS(pc,is1 > PCFieldSplitSetIS(pc,is2 > PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > to use LU on A11 use the command line options > -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly > and SOR on A22 > -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - > fieldsplit_1_pc_sor_lits where > is the number of iterations you want to use block A22 > > is1 is the IS that contains the indices for all the vector entries in > the 1 block while is2 is all indices in the > vector for the 2 block. You can use ISCreateGeneral() to create these. > > Probably it is easiest just to try this out. > > Barry > > > On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > > > > Hi, > > > > I am trying to implement a multilevel method for an EM problem. The > > reference is : "Comparison of hierarchical basis functions for > > efficient > > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > > IET > > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > > > Here is the summary: > > > > The matrix equation Ax=b is solved using GMRES with a multilevel > > pre-conditioner. A has a block structure. > > > > A11 A12 * x1 = b1 > > A21 A22 x2 b2 > > > > A11 is mxm and A33 is nxn, where m is not equal to n. > > > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > > superLU or > > MUMPS) > > > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > > a SOR > > solver or a parallel LU) > > > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > > > This gives the approximate solution to > > > > A11 A12 * e1 = b1 > > A21 A22 e2 b2 > > > > and is used as the pre-conditioner for the GMRES. > > > > > > Which PetSc method can implement this pre-conditioner ? I tried a > > PCSHELL > > type PC. With Hong's help, I also got the parallel LU to work > > withSuperLU/MUMPS. My program runs successfully on multiple > > processes on a > > single machine. But when I submit the program over multiple > > machines, I get > > a crash in the PCApply routine after several GMRES iterations. I > > think this > > has to do with using PCSHELL with GMRES (which is not a good idea). Is > > there a different way to implement this ? Does this resemble the usage > > pattern of one of the AMG preconditioners ? > > > > > > Thanks > > > > Rgds, > > Amit > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From knepley at gmail.com Thu Apr 24 11:28:42 2008 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 24 Apr 2008 11:28:42 -0500 Subject: Multilevel solver In-Reply-To: References: Message-ID: On Thu, Apr 24, 2008 at 11:07 AM, wrote: > > "Matthew Knepley" > > m> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: Multilevel solver > > > 04/24/2008 10:19 > AM > > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > > > > > > 3) Since I want to set PC type to lu for field 0, and I want to use > MUMPS > > for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In > this > > case, will a second copy of the submatrix be generated - one of type > MUMPS > > for the PC and the other of the original MATAIJ type for the KSP ? > > I will have to check. However if we are consistent, then it should be > > -field_split_0_mat_type aijmumps > > > Can these options be set inside the code, instead of on the command line ? Yes, however it becomes more complicated. You must pull these objects out of the FieldSplitPC (not that hard), but you must also be careful to set the values after options are read in, but before higher level things are initialized (like the outer solver), so it can be somewhat delicate. I would not advise hardcoding things which you are likely to change based upon architecture, problem, etc. However, if you want to, the easiest way is to use PetscSetOption(). Matt > > 4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > ? > > It is just the composition of the preconditioners, which is what you want > here. > > Matt > > > Thanks > > > > Rgds, > > Amit > > > > > > > > > > Barry Smith > > > ov> > To > > Sent by: petsc-users at mcs.anl.gov > > owner-petsc-users > cc > > @mcs.anl.gov > > No Phone Info > Subject > > Available Re: Multilevel solver > > > > > > 04/22/2008 10:08 > > PM > > > > > > Please respond to > > petsc-users at mcs.a > > nl.gov > > > > > > > > > > > > > > Amit, > > > > Using a a PCSHELL should be fine (it can be used with GMRES), > > my guess is there is a memory corruption error somewhere that is > > causing the crash. This could be tracked down with www.valgrind.com > > > > Another way to you could implement this is with some very recent > > additions I made to PCFIELDSPLIT that are in petsc-dev > > (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) > > With this you would chose > > PCSetType(pc,PCFIELDSPLIT > > PCFieldSplitSetIS(pc,is1 > > PCFieldSplitSetIS(pc,is2 > > PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > > to use LU on A11 use the command line options > > -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly > > and SOR on A22 > > -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - > > fieldsplit_1_pc_sor_lits where > > is the number of iterations you want to use block A22 > > > > is1 is the IS that contains the indices for all the vector entries in > > the 1 block while is2 is all indices in the > > vector for the 2 block. You can use ISCreateGeneral() to create these. > > > > Probably it is easiest just to try this out. > > > > Barry > > > > > > On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > > > > > > > Hi, > > > > > > I am trying to implement a multilevel method for an EM problem. The > > > reference is : "Comparison of hierarchical basis functions for > > > efficient > > > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > > > IET > > > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > > > > > Here is the summary: > > > > > > The matrix equation Ax=b is solved using GMRES with a multilevel > > > pre-conditioner. A has a block structure. > > > > > > A11 A12 * x1 = b1 > > > A21 A22 x2 b2 > > > > > > A11 is mxm and A33 is nxn, where m is not equal to n. > > > > > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > > > superLU or > > > MUMPS) > > > > > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > > > a SOR > > > solver or a parallel LU) > > > > > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > > > > > This gives the approximate solution to > > > > > > A11 A12 * e1 = b1 > > > A21 A22 e2 b2 > > > > > > and is used as the pre-conditioner for the GMRES. > > > > > > > > > Which PetSc method can implement this pre-conditioner ? I tried a > > > PCSHELL > > > type PC. With Hong's help, I also got the parallel LU to work > > > withSuperLU/MUMPS. My program runs successfully on multiple > > > processes on a > > > single machine. But when I submit the program over multiple > > > machines, I get > > > a crash in the PCApply routine after several GMRES iterations. I > > > think this > > > has to do with using PCSHELL with GMRES (which is not a good idea). Is > > > there a different way to implement this ? Does this resemble the usage > > > pattern of one of the AMG preconditioners ? > > > > > > > > > Thanks > > > > > > Rgds, > > > Amit > > > > > > > > > > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From Amit.Itagi at seagate.com Thu Apr 24 12:01:44 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Thu, 24 Apr 2008 13:01:44 -0400 Subject: Multilevel solver In-Reply-To: Message-ID: Matt, So putting eveything together, I wrote this simple 4x4 example (to be run on two processes). ========================================================================================= #include #include #include #include "petsc.h" #include "petscmat.h" #include "petscvec.h" #include "petscksp.h" using namespace std; int main( int argc, char *argv[] ) { int rank, size; Mat A; PetscErrorCode ierr; int nrow, ncol, loc; Vec x, b; KSP solver; PC prec; IS is1, is2; PetscScalar val; // Matrix dimensions nrow=4; ncol=4; // Number of non-zeros in each row int d_nnz1[2], d_nnz2[2], o_nnz1[2],o_nnz2[2]; d_nnz1[0]=2; o_nnz1[0]=2; d_nnz1[1]=2; o_nnz1[1]=2; d_nnz2[0]=2; o_nnz2[0]=2; d_nnz2[1]=2; o_nnz2[1]=2; ierr=PetscInitialize(&argc,&argv,PETSC_NULL,PETSC_NULL); CHKERRQ(ierr); ierr=MPI_Comm_size(PETSC_COMM_WORLD,&size); CHKERRQ(ierr); ierr=MPI_Comm_rank(PETSC_COMM_WORLD,&rank); CHKERRQ(ierr); // Matrix assembly if(rank==0) { MatCreateMPIAIJ(PETSC_COMM_WORLD,2,2,4,4,0,d_nnz1,0,o_nnz1,&A); val=complex(2.0,3.0); ierr=MatSetValue(A,0,0,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(5.0,-1.0); ierr=MatSetValue(A,0,1,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(1.0,2.0); ierr=MatSetValue(A,0,2,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(1.0,-1.0); ierr=MatSetValue(A,0,3,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(5.0,-1.0); ierr=MatSetValue(A,1,0,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(2.0,0.0); ierr=MatSetValue(A,1,1,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(3.0,0.0); ierr=MatSetValue(A,1,2,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(1.0,0.0); ierr=MatSetValue(A,1,3,val,INSERT_VALUES);CHKERRQ(ierr); } else if(rank==1) { MatCreateMPIAIJ(PETSC_COMM_WORLD,2,2,4,4,0,d_nnz2,0,o_nnz2,&A); val=complex(1.0,2.0); ierr=MatSetValue(A,2,0,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(3.0,0.0); ierr=MatSetValue(A,2,1,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(0.0,2.0); ierr=MatSetValue(A,2,2,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(1.0,0.0); ierr=MatSetValue(A,2,3,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(1.0,-1.0); ierr=MatSetValue(A,3,0,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(1.0,0.0); ierr=MatSetValue(A,3,1,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(1.0,0.0); ierr=MatSetValue(A,3,2,val,INSERT_VALUES);CHKERRQ(ierr); val=complex(2.0,0.0); ierr=MatSetValue(A,3,3,val,INSERT_VALUES);CHKERRQ(ierr); } else { MatCreateMPIAIJ(PETSC_COMM_WORLD,0,0,4,4,0,PETSC_NULL,0,PETSC_NULL,&A); } ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY); CHKERRQ(ierr); ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY); CHKERRQ(ierr); ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Defined matrix\n"); CHKERRQ(ierr); ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr); ierr=MatView(A,PETSC_VIEWER_STDOUT_WORLD); CHKERRQ(ierr); // Vector assembly // Allocate memory for the vectors if(rank==0) { ierr=VecCreateMPI(PETSC_COMM_WORLD,2,4,&x); CHKERRQ(ierr); val=complex(1.0,0.0); loc=0; ierr=VecSetValues(x,1,&loc,&val,INSERT_VALUES);CHKERRQ(ierr); val=complex(-1.0,0.0); loc=1; ierr=VecSetValues(x,1,&loc,&val,INSERT_VALUES);CHKERRQ(ierr); } else if(rank==1) { ierr=VecCreateMPI(PETSC_COMM_WORLD,2,4,&x); CHKERRQ(ierr); val=complex(1.0,1.0); loc=2; ierr=VecSetValues(x,1,&loc,&val,INSERT_VALUES);CHKERRQ(ierr); val=complex(1.0,0.0); loc=3; ierr=VecSetValues(x,1,&loc,&val,INSERT_VALUES);CHKERRQ(ierr); } else { ierr=VecCreateMPI(PETSC_COMM_WORLD,0,4,&x); CHKERRQ(ierr); } ierr=VecAssemblyBegin(x); CHKERRQ(ierr); ierr=VecAssemblyEnd(x); CHKERRQ(ierr); ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Defined vector\n"); CHKERRQ(ierr); ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr); ierr=VecView(x,PETSC_VIEWER_STDOUT_WORLD); CHKERRQ(ierr); // Storage for the solution VecDuplicate(x,&b); // Create the Field Split index sets PetscInt idxA[2], idxB[2]; idxA[0]=0; idxA[1]=1; idxB[0]=2; idxB[1]=3; ierr=ISCreateGeneral(PETSC_COMM_WORLD,2,idxA,&is1); CHKERRQ(ierr); ierr=ISCreateGeneral(PETSC_COMM_WORLD,2,idxB,&is2); CHKERRQ(ierr); // Krylov Solver ierr=KSPCreate(PETSC_COMM_WORLD,&solver); CHKERRQ(ierr); ierr=KSPSetOperators(solver,A,A,SAME_NONZERO_PATTERN); CHKERRQ(ierr); ierr=KSPSetType(solver,KSPGMRES); CHKERRQ(ierr); // Pre-conditioner ierr=KSPGetPC(solver,&prec); CHKERRQ(ierr); ierr=PCSetType(prec,PCFIELDSPLIT); CHKERRQ(ierr); ierr=PCFieldSplitSetIS(prec,is1); CHKERRQ(ierr); ierr=PCFieldSplitSetIS(prec,is2); CHKERRQ(ierr); ierr=PCFieldSplitSetType(prec,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE); CHKERRQ(ierr); ierr=PCSetFromOptions(prec); CHKERRQ(ierr); ierr=KSPSetFromOptions(solver); CHKERRQ(ierr); ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Set KSP/PC options\n"); CHKERRQ(ierr); ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr); ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Solving the equation\n"); CHKERRQ(ierr); ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr); ierr=KSPSolve(solver,x,b); CHKERRQ(ierr); ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Solving over\n"); CHKERRQ(ierr); ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr); ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"\nThe solution\n"); CHKERRQ(ierr); ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr); ierr=VecView(b,PETSC_VIEWER_STDOUT_WORLD); CHKERRQ(ierr); // Clean up ierr=VecDestroy(x); CHKERRQ(ierr); ierr=VecDestroy(b); CHKERRQ(ierr); ierr=MatDestroy(A); CHKERRQ(ierr); ierr=KSPDestroy(solver); CHKERRQ(ierr); ierr=ISDestroy(is1); CHKERRQ(ierr); ierr=ISDestroy(is2); CHKERRQ(ierr); ierr=PetscFinalize(); CHKERRQ(ierr); // Finalize return 0; } ============================================================================================== I run the program with mpiexec -np 2 ./main -fieldsplit_0_pc_type sor -fieldsplit_0_ksp_type_preonly -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type_preonly > &err In the KSPSolve step, I get an error. Here is the output of my run. ================================================================================================= Defined matrix Defined matrix row 0: (0, 2 + 3 i) (1, 5 - 1 i) (2, 1 + 2 i) (3, 1 - 1 i) row 1: (0, 5 - 1 i) (1, 2) (2, 3) (3, 1) row 2: (0, 1 + 2 i) (1, 3) (2, 0 + 2 i) (3, 1) row 3: (0, 1 - 1 i) (1, 1) (2, 1) (3, 2) Defined vector Defined vector Process [0] 1 -1 Process [1] 1 + 1 i 1 Set KSP/PC options Set KSP/PC options Solving the equation Solving the equation [1]PETSC ERROR: --------------------- Error Message ------------------------------------ [1]PETSC ERROR: Nonconforming object sizes! [1]PETSC ERROR: Local column sizes 0 do not add up to total number of columns 4! [1]PETSC ERROR: ------------------------------------------------------------------------ [1]PETSC ERROR: Petsc Development Version 2.3.3, Patch 12, unknown HG revision: unknown [1]PETSC ERROR: See docs/changes/index.html for recent updates. [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting. [1]PETSC ERROR: See docs/index.html for manual pages. [1]PETSC ERROR: ------------------------------------------------------------------------ [1]PETSC ERROR: ./main on a linux-gnu named tabla by amit Thu Apr 24 13:04:57 2008 [1]PETSC ERROR: Libraries linked from /home/amit/programs/ParEM/petsc-dev/lib [1]PETSC ERROR: Configure run at Wed Apr 23 22:02:21 2008 [1]PETSC ERROR: Configure options --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx --with-mpi=1 --download-mpich=1 --with-metis=1 --download-metis=1 --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 --download-superlu_dist=1 --with-mumps=1 --download-blacs=1 --download-scalapack=1 --download-mumps=1 --with-shared=0 [1]PETSC ERROR: ------------------------------------------------------------------------ [1]PETSC ERROR: MatGetSubMatrix_MPIAIJ() line 2974 in src/mat/impls/aij/mpi/mpiaij.c [1]PETSC ERROR: MatGetSubMatrix() line 5956 in src/mat/interface/matrix.c [1]PETSC ERROR: PCSetUp_FieldSplit() line 177 in src/ksp/pc/impls/fieldsplit/fieldsplit.c [1]PETSC ERROR: PCSetUp() line 788 in src/ksp/pc/interface/precon.c [1]PETSC ERROR: KSPSetUp() line 234 in src/ksp/ksp/interface/itfunc.c [1]PETSC ERROR: KSPSolve() line 350 in src/ksp/ksp/interface/itfunc.c [1]PETSC ERROR: User provided function() line 175 in main.cpp [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Caught signal number 13 Broken Pipe: Likely while reading or writing to a socket [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[0]PETSC ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to find memory corruption errors [0]PETSC ERROR: likely location of problem given in stack below [0]PETSC ERROR: --------------------- Stack Frames ------------------------------------ [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, [0]PETSC ERROR: INSTEAD the line number of the start of the function [0]PETSC ERROR: is given. [0]PETSC ERROR: [0] MatSetType line 45 src/mat/interface/matreg.c [0]PETSC ERROR: [0] PCSetUp line 765 src/ksp/pc/interface/precon.c [0]PETSC ERROR: [0] KSPSetUp line 183 src/ksp/ksp/interface/itfunc.c [0]PETSC ERROR: [0] KSPSolve line 305 src/ksp/ksp/interface/itfunc.c [0]PETSC ERROR: --------------------- Error Message ------------------------------------ [0]PETSC ERROR: Signal received! [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Petsc Development Version 2.3.3, Patch 12, unknown HG revision: unknown [0]PETSC ERROR: See docs/changes/index.html for recent updates. [0]PETSC ERROR: See docs/faq.html for hints about trouble shooting. [0]PETSC ERROR: See docs/index.html for manual pages. [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: ./main on a linux-gnu named tabla by amit Thu Apr 24 13:04:57 2008 [0]PETSC ERROR: Libraries linked from /home/amit/programs/ParEM/petsc-dev/lib [0]PETSC ERROR: Configure run at Wed Apr 23 22:02:21 2008 [0]PETSC ERROR: Configure options --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx --with-mpi=1 --download-mpich=1 --with-metis=1 --download-metis=1 --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 --download-superlu_dist=1 --with-mumps=1 --download-blacs=1 --download-scalapack=1 --download-mumps=1 --with-shared=0 [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: User provided function() line 0 in unknown directory unknown file application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0[cli_0]: aborting job: application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 What am I doing wrong ? Thanks Rgds, Amit "Matthew Knepley" To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: Multilevel solver 04/24/2008 12:28 PM Please respond to petsc-users at mcs.a nl.gov On Thu, Apr 24, 2008 at 11:07 AM, wrote: > > "Matthew Knepley" > > m> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc-users cc > @mcs.anl.gov > No Phone Info Subject > Available Re: Multilevel solver > > > 04/24/2008 10:19 > AM > > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > > > > > > 3) Since I want to set PC type to lu for field 0, and I want to use > MUMPS > > for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In > this > > case, will a second copy of the submatrix be generated - one of type > MUMPS > > for the PC and the other of the original MATAIJ type for the KSP ? > > I will have to check. However if we are consistent, then it should be > > -field_split_0_mat_type aijmumps > > > Can these options be set inside the code, instead of on the command line ? Yes, however it becomes more complicated. You must pull these objects out of the FieldSplitPC (not that hard), but you must also be careful to set the values after options are read in, but before higher level things are initialized (like the outer solver), so it can be somewhat delicate. I would not advise hardcoding things which you are likely to change based upon architecture, problem, etc. However, if you want to, the easiest way is to use PetscSetOption(). Matt > > 4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > ? > > It is just the composition of the preconditioners, which is what you want > here. > > Matt > > > Thanks > > > > Rgds, > > Amit > > > > > > > > > > Barry Smith > > > ov> > To > > Sent by: petsc-users at mcs.anl.gov > > owner-petsc-users > cc > > @mcs.anl.gov > > No Phone Info > Subject > > Available Re: Multilevel solver > > > > > > 04/22/2008 10:08 > > PM > > > > > > Please respond to > > petsc-users at mcs.a > > nl.gov > > > > > > > > > > > > > > Amit, > > > > Using a a PCSHELL should be fine (it can be used with GMRES), > > my guess is there is a memory corruption error somewhere that is > > causing the crash. This could be tracked down with www.valgrind.com > > > > Another way to you could implement this is with some very recent > > additions I made to PCFIELDSPLIT that are in petsc-dev > > (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) > > With this you would chose > > PCSetType(pc,PCFIELDSPLIT > > PCFieldSplitSetIS(pc,is1 > > PCFieldSplitSetIS(pc,is2 > > PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > > to use LU on A11 use the command line options > > -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly > > and SOR on A22 > > -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - > > fieldsplit_1_pc_sor_lits where > > is the number of iterations you want to use block A22 > > > > is1 is the IS that contains the indices for all the vector entries in > > the 1 block while is2 is all indices in the > > vector for the 2 block. You can use ISCreateGeneral() to create these. > > > > Probably it is easiest just to try this out. > > > > Barry > > > > > > On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > > > > > > > > Hi, > > > > > > I am trying to implement a multilevel method for an EM problem. The > > > reference is : "Comparison of hierarchical basis functions for > > > efficient > > > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, > > > IET > > > Sci. Meas. Technol. 2007, 1(1), pp 48-52. > > > > > > Here is the summary: > > > > > > The matrix equation Ax=b is solved using GMRES with a multilevel > > > pre-conditioner. A has a block structure. > > > > > > A11 A12 * x1 = b1 > > > A21 A22 x2 b2 > > > > > > A11 is mxm and A33 is nxn, where m is not equal to n. > > > > > > Step 1 : Solve A11 * e1 = b1 (parallel LU using > > > superLU or > > > MUMPS) > > > > > > Step 2: Solve A22 * e2 =b2-A21*e1 (might either user > > > a SOR > > > solver or a parallel LU) > > > > > > Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) > > > > > > This gives the approximate solution to > > > > > > A11 A12 * e1 = b1 > > > A21 A22 e2 b2 > > > > > > and is used as the pre-conditioner for the GMRES. > > > > > > > > > Which PetSc method can implement this pre-conditioner ? I tried a > > > PCSHELL > > > type PC. With Hong's help, I also got the parallel LU to work > > > withSuperLU/MUMPS. My program runs successfully on multiple > > > processes on a > > > single machine. But when I submit the program over multiple > > > machines, I get > > > a crash in the PCApply routine after several GMRES iterations. I > > > think this > > > has to do with using PCSHELL with GMRES (which is not a good idea). Is > > > there a different way to implement this ? Does this resemble the usage > > > pattern of one of the AMG preconditioners ? > > > > > > > > > Thanks > > > > > > Rgds, > > > Amit > > > > > > > > > > > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From bsmith at mcs.anl.gov Thu Apr 24 12:13:28 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Thu, 24 Apr 2008 12:13:28 -0500 Subject: Multilevel solver In-Reply-To: References: Message-ID: On Apr 24, 2008, at 8:58 AM, Amit.Itagi at seagate.com wrote: > Barry, > > I have been trying out the PCFIELDSPLIT. I have not yet gotten it to > work. > I have some follow up questions which might help solve my problem. > > Consider the simple case of a 4x4 matrix equation being solved on two > processes. I have vector elements 0 and 1 belonging to rank 0, and > elements > 2 and 3 belonging to rank 1. > > 1) For my example, can the index sets have staggered indices i.e. > is1-> 0,2 > and is2->1,3 (each is spans across ranks) ? > > 2) When I provide the -field_split__pc_type option on the command > line, ^^^^^ There is no underscore here because the PC name is fieldsplit and we never split the names into pieces. > > is the index in the same order that the PCFieldSplitSetIS function > called in ? > So if I have PCFieldSplitSetIS(pc,is2) before > PCFieldSplitSetIS(pc,is1), will -field_split_0_... correspond to is2 > and > -field_split_1_... to is1 ? You can list them in any order you want but the order determines how the multiplicative versions are applied. They are applied always started from zero (the first one you put in). > > > 3) Since I want to set PC type to lu for field 0, and I want to use > MUMPS > for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? > In this > case, will a second copy of the submatrix be generated - one of type > MUMPS > for the PC and the other of the original MATAIJ type for the KSP ? This issue will be fixed in a few days. I think you need to start with an entire matrix that is aijmumps and then the subs will also be. Barry > > > 4) How is the PC applied when I do > PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE ? > > Thanks > > Rgds, > Amit > > > > > Barry Smith > > ov> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc- > users cc > @mcs.anl.gov > No Phone Info > Subject > Available Re: Multilevel solver > > > 04/22/2008 10:08 > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > Amit, > > Using a a PCSHELL should be fine (it can be used with GMRES), > my guess is there is a memory corruption error somewhere that is > causing the crash. This could be tracked down with www.valgrind.com > > Another way to you could implement this is with some very recent > additions I made to PCFIELDSPLIT that are in petsc-dev > (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html) > With this you would chose > PCSetType(pc,PCFIELDSPLIT > PCFieldSplitSetIS(pc,is1 > PCFieldSplitSetIS(pc,is2 > PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE > to use LU on A11 use the command line options > -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly > and SOR on A22 > -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - > fieldsplit_1_pc_sor_lits where > is the number of iterations you want to use block A22 > > is1 is the IS that contains the indices for all the vector entries in > the 1 block while is2 is all indices in the > vector for the 2 block. You can use ISCreateGeneral() to create these. > > Probably it is easiest just to try this out. > > Barry > > > On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote: > >> >> Hi, >> >> I am trying to implement a multilevel method for an EM problem. The >> reference is : "Comparison of hierarchical basis functions for >> efficient >> multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, >> IET >> Sci. Meas. Technol. 2007, 1(1), pp 48-52. >> >> Here is the summary: >> >> The matrix equation Ax=b is solved using GMRES with a multilevel >> pre-conditioner. A has a block structure. >> >> A11 A12 * x1 = b1 >> A21 A22 x2 b2 >> >> A11 is mxm and A33 is nxn, where m is not equal to n. >> >> Step 1 : Solve A11 * e1 = b1 (parallel LU using >> superLU or >> MUMPS) >> >> Step 2: Solve A22 * e2 =b2-A21*e1 (might either user >> a SOR >> solver or a parallel LU) >> >> Step 3: Solve A11* e1 = b1-A12*e2 (parallel LU) >> >> This gives the approximate solution to >> >> A11 A12 * e1 = b1 >> A21 A22 e2 b2 >> >> and is used as the pre-conditioner for the GMRES. >> >> >> Which PetSc method can implement this pre-conditioner ? I tried a >> PCSHELL >> type PC. With Hong's help, I also got the parallel LU to work >> withSuperLU/MUMPS. My program runs successfully on multiple >> processes on a >> single machine. But when I submit the program over multiple >> machines, I get >> a crash in the PCApply routine after several GMRES iterations. I >> think this >> has to do with using PCSHELL with GMRES (which is not a good idea). >> Is >> there a different way to implement this ? Does this resemble the >> usage >> pattern of one of the AMG preconditioners ? >> >> >> Thanks >> >> Rgds, >> Amit >> > > > From bsmith at mcs.anl.gov Thu Apr 24 12:19:58 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Thu, 24 Apr 2008 12:19:58 -0500 Subject: Using PETSc libraries with MS Compute cluster and MS MPI In-Reply-To: <48104EBD.3080104@gmail.com> References: <48104EBD.3080104@gmail.com> Message-ID: Ben, This is an error in our config/configure.py model (and autoconf's as well) that does not properly do the library checks under certain uncommon circumstances, it is not so easy for us to fix since we do not have access to the Microsoft cluster environment. Barry On Apr 24, 2008, at 4:11 AM, Ben Tay wrote: > Hi, > > I'm trying to run my mpi code on the MS Compute cluster which my > school just installed. Unfortunately, it failed without giving any > error msg. I am using just a test example ex2f. > > I read in the MS website that there is no need to use MS MPI to > compile the code or library. > > Anyway, I also tried to compile PETSc with MS MPI but I'm not able > to get pass ./configure. It always complains that there is something > wrong with the MS MPI. > > Is there anyone who has experience in these? > > Thank you very much. > > Regards. > From tribur at vision.ee.ethz.ch Thu Apr 24 16:32:08 2008 From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch) Date: Thu, 24 Apr 2008 23:32:08 +0200 Subject: Schur system + MatShell Message-ID: <20080424233208.9m3yc35qg40s0kgk@email.ee.ethz.ch> Dear, > On Tue, 22 Apr 2008, Matthew Knepley wrote: >> Did you verify that the Schur complement matrix was properly >> preallocated before >> assembly? This is the likely source of time. You can run with -info >> and search >> for "malloc" in the output. Preallocation doesn't make sense in case of MATDENSE, does it? > Isn't this using MATDENSE? If that the case - then I think the problem > is due to wrong partitioning - causing communiation during > MatAssembly(). > > -info should clearly show the communication part aswell. > > The fix would be to specify the local partition sizes for this matrix > - and not use PETSC_DECIDE. > > Satish Hm, I think communication during MatAssembly() is necessary, because the global Schur complement is obtained by summing up elements of the local ones. This also means that the sum of the sizes of the local complements is greater than the size of the global Schur complement. Therefore, I can not specify the local partition sizes according to the real sizes of the local Schur complements, otherwise the global size was an unrealistic number (in PETSc the global size is ALWAYS the sum of the local ones, isn't it?). Do you know what I mean? Is there another possibility of partitioning? Anyway, I got the thing in MATSHELL-format running, and it's really much faster: In an unstructured mesh of 321493 nodes, partitioned into 7 subdomains with 25577 interface nodes (= size of global Schur complement), e.g., the solving of the Schur complement takes now 3 min instead of 38 min for the assembling+solving using MATDENSE. Thank you again for your help and attention, Kathrin From mossaiby at yahoo.com Thu Apr 24 16:29:01 2008 From: mossaiby at yahoo.com (Farshid Mossaiby) Date: Thu, 24 Apr 2008 14:29:01 -0700 (PDT) Subject: Using PETSc libraries with MS Compute cluster and MS MPI In-Reply-To: Message-ID: <292717.24603.qm@web52209.mail.re2.yahoo.com> Ben, Have you used headers and libraries from Compute Cluster SDK for your configure? I doubt you can run programs compiled with MPICH, for example, on WCCS. I am going to try this in the next step of my work, so please share your findings. Regards, Farshid Mossaiby --- Barry Smith wrote: > > Ben, > > This is an error in our config/configure.py > model (and > autoconf's as well) that does not properly > do the library checks under certain uncommon > circumstances, it is not > so easy for us to fix since we > do not have access to the Microsoft cluster > environment. > > Barry > > On Apr 24, 2008, at 4:11 AM, Ben Tay wrote: > > > Hi, > > > > I'm trying to run my mpi code on the MS Compute > cluster which my > > school just installed. Unfortunately, it failed > without giving any > > error msg. I am using just a test example ex2f. > > > > I read in the MS website that there is no need to > use MS MPI to > > compile the code or library. > > > > Anyway, I also tried to compile PETSc with MS MPI > but I'm not able > > to get pass ./configure. It always complains that > there is something > > wrong with the MS MPI. > > > > Is there anyone who has experience in these? > > > > Thank you very much. > > > > Regards. > > > > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From recrusader at gmail.com Sun Apr 27 17:19:56 2008 From: recrusader at gmail.com (Yujie) Date: Sun, 27 Apr 2008 15:19:56 -0700 Subject: how to get the explicit matrix of the preconditioner Message-ID: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com> Hi, everyone How to get the matrix "M_{L}^{-1}" in "M_{L}^{-1}Ax=M_{L}^{-1}b". I have checked the function "PCGetOperators()". It only gets the matrix "pmat" used to obtain the preconditioning matrix. thanks a lot. Regards, Yujie -------------- next part -------------- An HTML attachment was scrubbed... URL: From mossaiby at yahoo.com Sun Apr 27 05:01:10 2008 From: mossaiby at yahoo.com (Farshid Mossaiby) Date: Sun, 27 Apr 2008 03:01:10 -0700 (PDT) Subject: Compiling PETSc with Visual Studio 2008 In-Reply-To: Message-ID: <30962.99968.qm@web52206.mail.re2.yahoo.com> Hi, Configure says it cannot make ParMetis with the option --download-parmetis. Is this related to Visual Studio 2008 compiler I use, or something else is wrong? Here is the message: Error running make on ParMetis: Could not execute 'cd /home/Administrator/petsc-2.3.3-p12/externalpackages/ParMetis-dev; make clean; make lib; make minstall; make clean': make: *** No rule to make target `clean'. Stop. make: *** No rule to make target `lib'. Stop. make: *** No rule to make target `minstall'. Stop. make: *** No rule to make target `clean'. Stop. ********************************************************************************* Regards, Farshid Mossaiby P.S. I hope this is appropriate place to ask this. ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From dave.mayhem23 at gmail.com Mon Apr 28 02:11:48 2008 From: dave.mayhem23 at gmail.com (Dave May) Date: Mon, 28 Apr 2008 17:11:48 +1000 Subject: how to get the explicit matrix of the preconditioner In-Reply-To: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com> References: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com> Message-ID: <956373f0804280011g251ce489l94b6a51fec88d14e@mail.gmail.com> Hi, You can use PetscErrorCode PCComputeExplicitOperator(PC pc,Mat *mat) See, http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/PC/PCComputeExplicitOperator.html On Mon, Apr 28, 2008 at 8:19 AM, Yujie wrote: > Hi, everyone > > How to get the matrix "M_{L}^{-1}" in "M_{L}^{-1}Ax=M_{L}^{-1}b". I have > checked the function "PCGetOperators()". It only gets the matrix "pmat" used > to obtain the preconditioning matrix. thanks a lot. > > Regards, > Yujie > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Apr 28 07:39:23 2008 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 28 Apr 2008 07:39:23 -0500 Subject: Compiling PETSc with Visual Studio 2008 In-Reply-To: <30962.99968.qm@web52206.mail.re2.yahoo.com> References: <30962.99968.qm@web52206.mail.re2.yahoo.com> Message-ID: On Sun, Apr 27, 2008 at 5:01 AM, Farshid Mossaiby wrote: > Hi, > > Configure says it cannot make ParMetis with the option > --download-parmetis. Is this related to Visual Studio > 2008 compiler I use, or something else is wrong? Here > is the message: > > Error running make on ParMetis: Could not execute 'cd > /home/Administrator/petsc-2.3.3-p12/externalpackages/ParMetis-dev; > make clean; make lib; make minstall; make clean': > make: *** No rule to make target `clean'. Stop. > make: *** No rule to make target `lib'. Stop. > make: *** No rule to make target `minstall'. Stop. > make: *** No rule to make target `clean'. Stop. > ********************************************************************************* > > Regards, > Farshid Mossaiby > > P.S. I hope this is appropriate place to ask this. 1) No, this belongs on petsc-maint 2) We cannot tell anything without the configure.log file Matt > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From recrusader at gmail.com Mon Apr 28 10:48:05 2008 From: recrusader at gmail.com (Yujie) Date: Mon, 28 Apr 2008 08:48:05 -0700 Subject: how to get the explicit matrix of the preconditioner In-Reply-To: <956373f0804280011g251ce489l94b6a51fec88d14e@mail.gmail.com> References: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com> <956373f0804280011g251ce489l94b6a51fec88d14e@mail.gmail.com> Message-ID: <7ff0ee010804280848n39539f8axeab203197d9bb2d4@mail.gmail.com> Thank you, Dave. I am wondering whether this function is ok to external packages, such as Hypre? thanks a lot. Regards, Yujie On 4/28/08, Dave May wrote: > > Hi, > You can use PetscErrorCode PCComputeExplicitOperator(PC pc,Mat *mat) > > See, > > http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/PC/PCComputeExplicitOperator.html > > > > On Mon, Apr 28, 2008 at 8:19 AM, Yujie wrote: > > > Hi, everyone > > > > How to get the matrix "M_{L}^{-1}" in "M_{L}^{-1}Ax=M_{L}^{-1}b". I have > > checked the function "PCGetOperators()". It only gets the matrix "pmat" used > > to obtain the preconditioning matrix. thanks a lot. > > > > Regards, > > Yujie > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Mon Apr 28 11:09:25 2008 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 28 Apr 2008 11:09:25 -0500 Subject: how to get the explicit matrix of the preconditioner In-Reply-To: <7ff0ee010804280848n39539f8axeab203197d9bb2d4@mail.gmail.com> References: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com> <956373f0804280011g251ce489l94b6a51fec88d14e@mail.gmail.com> <7ff0ee010804280848n39539f8axeab203197d9bb2d4@mail.gmail.com> Message-ID: On Mon, Apr 28, 2008 at 10:48 AM, Yujie wrote: > Thank you, Dave. I am wondering whether this function is ok to external > packages, such as Hypre? thanks a lot. Yes, since it just calls PCApply() for each basis vector. Matt > Regards, > Yujie > > On 4/28/08, Dave May wrote: > > Hi, > > You can use PetscErrorCode PCComputeExplicitOperator(PC pc,Mat *mat) > > > > See, > > > http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/PC/PCComputeExplicitOperator.html > > > > > > > > > > > > On Mon, Apr 28, 2008 at 8:19 AM, Yujie wrote: > > > > > Hi, everyone > > > > > > How to get the matrix "M_{L}^{-1}" in "M_{L}^{-1}Ax=M_{L}^{-1}b". I have > checked the function "PCGetOperators()". It only gets the matrix "pmat" used > to obtain the preconditioning matrix. thanks a lot. > > > > > > Regards, > > > Yujie > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From Amit.Itagi at seagate.com Mon Apr 28 13:15:34 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Mon, 28 Apr 2008 14:15:34 -0400 Subject: Shared libraries Message-ID: Hi, I am trying to recompile my PetSc installation to have "--with-shared=1". Also, I am specifying "--with-blas-lapack-dir=...". The library building goes through ok. However, in the last step of generating the shared libraries, I get an error about the lapack library. This step is looking for liblapack in /usr/local/lib and that lib was not compiled with -fPIC. I want the program to use the one that I specified in "--with-blas-lapack-dir=...". Which option (in which file) do I need to tweak ? This is in petsc-2.3.3-p8. Thanks Rgds, Amit From balay at mcs.anl.gov Mon Apr 28 13:34:23 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 28 Apr 2008 13:34:23 -0500 (CDT) Subject: Shared libraries In-Reply-To: References: Message-ID: On Mon, 28 Apr 2008, Amit.Itagi at seagate.com wrote: > > Hi, > > I am trying to recompile my PetSc installation to have "--with-shared=1". > Also, I am specifying "--with-blas-lapack-dir=...". The library building > goes through ok. However, in the last step of generating the shared > libraries, I get an error about the lapack library. This step is looking > for liblapack in /usr/local/lib and that lib was not compiled with -fPIC. I > want the program to use the one that I specified in > "--with-blas-lapack-dir=...". Which option (in which file) do I need to > tweak ? So the primary question is: you are specifying --with-blas-lapack-dir - but that version of blas-lapack is not picked up by configure. However its going ahead and using blas from /usr/local/lib - which is not what you want? [because its not compiled with -fPIC] Please send the corresponding configure.log to petsc-maint at mcs.anl.gov Satish From balay at mcs.anl.gov Mon Apr 28 13:48:59 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 28 Apr 2008 13:48:59 -0500 (CDT) Subject: Compiling PETSc with Visual Studio 2008 In-Reply-To: <30962.99968.qm@web52206.mail.re2.yahoo.com> References: <30962.99968.qm@web52206.mail.re2.yahoo.com> Message-ID: On Sun, 27 Apr 2008, Farshid Mossaiby wrote: > Hi, > > Configure says it cannot make ParMetis with the option > --download-parmetis. Is this related to Visual Studio > 2008 compiler I use, or something else is wrong? Here > is the message: Most externalpackages are never tested by their original authors with MS compilers. And we have not tried porting them to them. So most of them won't compile - hence --download-packagename might not work. BTW: Currently my test windows box is down - so I can't check if this is supporsed to work with MS compilers. > Error running make on ParMetis: Could not execute 'cd > /home/Administrator/petsc-2.3.3-p12/externalpackages/ParMetis-dev; > make clean; make lib; make minstall; make clean': > make: *** No rule to make target `clean'. Stop. > make: *** No rule to make target `lib'. Stop. > make: *** No rule to make target `minstall'. Stop. > make: *** No rule to make target `clean'. Stop. > ********************************************************************************* > > Regards, > Farshid Mossaiby > > P.S. I hope this is appropriate place to ask this. The appropriate thing is to send us the relavent log files [configure.log etc] - and these can't be sent to petsc-user list [we don't want to flood petsc-users subscribers with multi-megabyte logfiles]. So the appropriate place is petsc-maint - with the relavent logfiles. Satish From balay at mcs.anl.gov Mon Apr 28 22:13:02 2008 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 28 Apr 2008 22:13:02 -0500 (CDT) Subject: Compiling PETSc with Visual Studio 2008 In-Reply-To: References: <30962.99968.qm@web52206.mail.re2.yahoo.com> Message-ID: On Mon, 28 Apr 2008, Satish Balay wrote: > > On Sun, 27 Apr 2008, Farshid Mossaiby wrote: > > Configure says it cannot make ParMetis with the option > > --download-parmetis. Is this related to Visual Studio 2008 > > compiler I use, or something else is wrong? Here is the message: > > Most externalpackages are never tested by their original authors with > MS compilers. And we have not tried porting them to them. > > So most of them won't compile - hence --download-packagename might not > work. > > BTW: Currently my test windows box is down - so I can't check if this > is supporsed to work with MS compilers. Looks like parmetis does compile with MS compilers. Please try try the attached patch. cd petsc-2.3.3 patch -Np1 < parmetis-win.patch rm -rf externalpackage/ParMetis* ./config/configure.py ..... This fix will be in petsc-dev. Satish -------------- next part -------------- diff -r 4041e3152979 python/PETSc/packages/ParMetis.py --- a/python/PETSc/packages/ParMetis.py Mon Apr 21 11:20:42 2008 -0500 +++ b/python/PETSc/packages/ParMetis.py Mon Apr 28 21:15:01 2008 -0500 @@ -5,7 +5,7 @@ class Configure(PETSc.package.Package): def __init__(self, framework): PETSc.package.Package.__init__(self, framework) - self.download = ['hg://petsc.cs.iit.edu/petsc/ParMetis-dev','ftp://ftp.mcs.anl.gov/pub/petsc/externalpackages/ParMetis-dev-p1.tar.gz'] + self.download = ['hg://petsc.cs.iit.edu/petsc/ParMetis-dev','ftp://ftp.mcs.anl.gov/pub/petsc/externalpackages/ParMetis-dev-p2.tar.gz'] self.functions = ['ParMETIS_V3_PartKway'] self.includes = ['parmetis.h'] self.liblist = [['libparmetis.a','libmetis.a']] @@ -27,8 +27,9 @@ installDir = os.path.join(parmetisDir, self.arch.arch) makeinc = os.path.join(parmetisDir,'make.inc') installmakeinc = os.path.join(installDir,'make.inc') - configheader = os.path.join(parmetisDir,'ParMETISLib','configureheader.h') - + metisconfigheader = os.path.join(parmetisDir,'METISLib','configureheader.h') + parmetisconfigheader = os.path.join(parmetisDir,'ParMETISLib','configureheader.h') + # Configure ParMetis if os.path.isfile(makeinc): os.unlink(makeinc) @@ -63,7 +64,8 @@ if not os.path.isfile(installmakeinc) or not (self.getChecksum(installmakeinc) == self.getChecksum(makeinc)): self.framework.log.write('Have to rebuild ParMetis, make.inc != '+installmakeinc+'\n') - self.framework.outputHeader(configheader) + self.framework.outputHeader(metisconfigheader) + self.framework.outputHeader(parmetisconfigheader) try: self.logPrintBox('Compiling & installing Parmetis; this may take several minutes') output = config.base.Configure.executeShellCommand('cd '+parmetisDir+'; make clean; make lib; make minstall; make clean', timeout=2500, log = self.framework.log)[0] From mossaiby at yahoo.com Tue Apr 29 01:59:57 2008 From: mossaiby at yahoo.com (Farshid Mossaiby) Date: Mon, 28 Apr 2008 23:59:57 -0700 (PDT) Subject: Compiling PETSc with Visual Studio 2008 In-Reply-To: Message-ID: <30861.23159.qm@web52208.mail.re2.yahoo.com> Sorry I saw your email after I sent my log. Thanks for your help. Will check and report the results. Best regards, Farshid Mossaiby --- Satish Balay wrote: > On Mon, 28 Apr 2008, Satish Balay wrote: > > > > > On Sun, 27 Apr 2008, Farshid Mossaiby wrote: > > > > Configure says it cannot make ParMetis with the > option > > > --download-parmetis. Is this related to Visual > Studio 2008 > > > compiler I use, or something else is wrong? Here > is the message: > > > > Most externalpackages are never tested by their > original authors with > > MS compilers. And we have not tried porting them > to them. > > > > So most of them won't compile - hence > --download-packagename might not > > work. > > > > BTW: Currently my test windows box is down - so I > can't check if this > > is supporsed to work with MS compilers. > > Looks like parmetis does compile with MS compilers. > Please try try the > attached patch. > > cd petsc-2.3.3 > patch -Np1 < parmetis-win.patch > rm -rf externalpackage/ParMetis* > ./config/configure.py ..... > > This fix will be in petsc-dev. > > Satish> diff -r 4041e3152979 > python/PETSc/packages/ParMetis.py > --- a/python/PETSc/packages/ParMetis.py Mon Apr 21 > 11:20:42 2008 -0500 > +++ b/python/PETSc/packages/ParMetis.py Mon Apr 28 > 21:15:01 2008 -0500 > @@ -5,7 +5,7 @@ > class Configure(PETSc.package.Package): > def __init__(self, framework): > PETSc.package.Package.__init__(self, framework) > - self.download = > ['hg://petsc.cs.iit.edu/petsc/ParMetis-dev','ftp://ftp.mcs.anl.gov/pub/petsc/externalpackages/ParMetis-dev-p1.tar.gz'] > + self.download = > ['hg://petsc.cs.iit.edu/petsc/ParMetis-dev','ftp://ftp.mcs.anl.gov/pub/petsc/externalpackages/ParMetis-dev-p2.tar.gz'] > self.functions = ['ParMETIS_V3_PartKway'] > self.includes = ['parmetis.h'] > self.liblist = > [['libparmetis.a','libmetis.a']] > @@ -27,8 +27,9 @@ > installDir = os.path.join(parmetisDir, > self.arch.arch) > makeinc = > os.path.join(parmetisDir,'make.inc') > installmakeinc = > os.path.join(installDir,'make.inc') > - configheader = > os.path.join(parmetisDir,'ParMETISLib','configureheader.h') > - > + metisconfigheader = > os.path.join(parmetisDir,'METISLib','configureheader.h') > + parmetisconfigheader = > os.path.join(parmetisDir,'ParMETISLib','configureheader.h') > + > # Configure ParMetis > if os.path.isfile(makeinc): > os.unlink(makeinc) > @@ -63,7 +64,8 @@ > > if not os.path.isfile(installmakeinc) or not > (self.getChecksum(installmakeinc) == > self.getChecksum(makeinc)): > self.framework.log.write('Have to rebuild > ParMetis, make.inc != '+installmakeinc+'\n') > - self.framework.outputHeader(configheader) > + > self.framework.outputHeader(metisconfigheader) > + > self.framework.outputHeader(parmetisconfigheader) > try: > self.logPrintBox('Compiling & installing > Parmetis; this may take several minutes') > output = > config.base.Configure.executeShellCommand('cd > '+parmetisDir+'; make clean; make lib; make > minstall; make clean', timeout=2500, log = > self.framework.log)[0] > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From Amit.Itagi at seagate.com Tue Apr 29 08:54:24 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Tue, 29 Apr 2008 09:54:24 -0400 Subject: DA question In-Reply-To: <47FD2297.1010602@gmail.com> Message-ID: Hi, I spent some more time understanding DA's, and how DA's should serve my purpose. Since in the time domain calculation, I will have to scatter from the global vector to the local vector and vice-versa at every iteration step, I have some follow-up questions. 1) Does the scattering involve copying the part stored on the local node as well (i.e. part of the local vector other than the ghost values), or is the local part just accessed by reference ? In the first scenario, this would involve allocating twice the storage for the local part. Also, does the scattering of the local part give a big hit in terms of CPU time ? 2) In the manual, it says "In most cases, several different vectors can share the same communication information (or, in other words, can share a given DA)" and "PETSc currently provides no container for multiple arrays sharing the same distributed array communication; note, however, that the dof parameter handles many cases of interest". I am a bit confused. Suppose I have two arrays having the same layout on the regular grid, can I store the first array data on one vector, and the second array data on the second vector (and have a DA with dof=1, instead of a DA with dof=2), and be able to scatter and update the first vector without scattering/updating the second vector ? Thanks Rgds, Amit owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: > Hi Amit, > > Why do you need two staggered grids? I do EM finite difference frequency > domain modeling on a staggered grid using just one DA. Works perfectly fine. > There are some grid points that are not used, but you just set them to zero > and put a 1 on the diagonal of the coefficient matrix. > > > Randy > > > Amit.Itagi at seagate.com wrote: > > Hi Berend, > > > > A detailed explanation of the finite difference scheme is given here : > > > > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method > > > > > > Thanks > > > > Rgds, > > Amit > > > > > > > > > > Berend van Wachem > > > se> To > > Sent by: petsc-users at mcs.anl.gov > > owner-petsc-users cc > > @mcs.anl.gov > > No Phone Info Subject > > Available Re: DA question > > > > > > 04/09/2008 02:59 > > PM > > > > > > Please respond to > > petsc-users at mcs.a > > nl.gov > > > > > > > > > > > > > > Dear Amit, > > > > Could you explain how the two grids are attached? > > I am using multiple DA's for multiple structured grids glued together. > > I've done the gluing with setting up various IS objects. From the > > multiple DA's, one global variable vector is formed. Is that what you > > are looking for? > > > > Best regards, > > > > Berend. > > > > > > Amit.Itagi at seagate.com wrote: > >> Hi, > >> > >> Is it possible to use DA to perform finite differences on two staggered > >> regular grids (as in the electromagnetic finite difference time domain > >> method) ? Surrounding nodes from one grid are used to update the value in > >> the dual grid. In addition, local manipulations need to be done on the > >> nodal values. > >> > >> Thanks > >> > >> Rgds, > >> Amit > >> > > > > > > > From knepley at gmail.com Tue Apr 29 10:54:32 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 29 Apr 2008 10:54:32 -0500 Subject: DA question In-Reply-To: References: <47FD2297.1010602@gmail.com> Message-ID: On Tue, Apr 29, 2008 at 8:54 AM, wrote: > Hi, > > I spent some more time understanding DA's, and how DA's should serve my > purpose. Since in the time domain calculation, I will have to scatter from > the global vector to the local vector and vice-versa at every iteration > step, I have some follow-up questions. > > 1) Does the scattering involve copying the part stored on the local node as > well (i.e. part of the local vector other than the ghost values), or is the > local part just accessed by reference ? In the first scenario, this would No, you get a separate local vector since we reorder to give contiguous access. > involve allocating twice the storage for the local part. Also, does the Yes, however unless you run an explicit code at the limit of memory, this really does not matter. > scattering of the local part give a big hit in terms of CPU time ? Not for these cartesian topologies with small overlap. This is easy to prove. > 2) In the manual, it says "In most cases, several different vectors can > share the same communication information (or, in other words, can share a > given DA)" and "PETSc currently provides no container for multiple arrays > sharing the same distributed array communication; note, however, that the > dof parameter handles many cases of interest". I am a bit confused. Suppose > I have two arrays having the same layout on the regular grid, can I store > the first array data on one vector, and the second array data on the second > vector (and have a DA with dof=1, instead of a DA with dof=2), and be able > to scatter and update the first vector without scattering/updating the > second vector ? Yes. You call DAGetGlobalVector() twice, and then when you want one vector updated, call DALocalToGlobal() or DAGlobalToLocal() with that vector. Matt > Thanks > > Rgds, > Amit -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From Amit.Itagi at seagate.com Tue Apr 29 12:10:29 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Tue, 29 Apr 2008 13:10:29 -0400 Subject: DA question In-Reply-To: Message-ID: Thanks, Matt. Rgds, Amit "Matthew Knepley" To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: DA question 04/29/2008 11:54 AM Please respond to petsc-users at mcs.a nl.gov On Tue, Apr 29, 2008 at 8:54 AM, wrote: > Hi, > > I spent some more time understanding DA's, and how DA's should serve my > purpose. Since in the time domain calculation, I will have to scatter from > the global vector to the local vector and vice-versa at every iteration > step, I have some follow-up questions. > > 1) Does the scattering involve copying the part stored on the local node as > well (i.e. part of the local vector other than the ghost values), or is the > local part just accessed by reference ? In the first scenario, this would No, you get a separate local vector since we reorder to give contiguous access. > involve allocating twice the storage for the local part. Also, does the Yes, however unless you run an explicit code at the limit of memory, this really does not matter. > scattering of the local part give a big hit in terms of CPU time ? Not for these cartesian topologies with small overlap. This is easy to prove. > 2) In the manual, it says "In most cases, several different vectors can > share the same communication information (or, in other words, can share a > given DA)" and "PETSc currently provides no container for multiple arrays > sharing the same distributed array communication; note, however, that the > dof parameter handles many cases of interest". I am a bit confused. Suppose > I have two arrays having the same layout on the regular grid, can I store > the first array data on one vector, and the second array data on the second > vector (and have a DA with dof=1, instead of a DA with dof=2), and be able > to scatter and update the first vector without scattering/updating the > second vector ? Yes. You call DAGetGlobalVector() twice, and then when you want one vector updated, call DALocalToGlobal() or DAGlobalToLocal() with that vector. Matt > Thanks > > Rgds, > Amit -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From bsmith at mcs.anl.gov Tue Apr 29 12:28:17 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 29 Apr 2008 12:28:17 -0500 Subject: DA question In-Reply-To: References: Message-ID: If you are running a true explicit scheme then you have no need to ever have a "global representation" at each time step. In this case you can use DALocalToLocalBegin() then DALocalToLocalEnd() and pass the same vector in both locations. This will update the ghost points but WILL NOT do any copy of the local data since it is already in the correct locations. Barry On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote: > Hi, > > I spent some more time understanding DA's, and how DA's should serve > my > purpose. Since in the time domain calculation, I will have to > scatter from > the global vector to the local vector and vice-versa at every > iteration > step, I have some follow-up questions. > > 1) Does the scattering involve copying the part stored on the local > node as > well (i.e. part of the local vector other than the ghost values), or > is the > local part just accessed by reference ? In the first scenario, this > would > involve allocating twice the storage for the local part. Also, does > the > scattering of the local part give a big hit in terms of CPU time ? > > 2) In the manual, it says "In most cases, several different vectors > can > share the same communication information (or, in other words, can > share a > given DA)" and "PETSc currently provides no container for multiple > arrays > sharing the same distributed array communication; note, however, > that the > dof parameter handles many cases of interest". I am a bit confused. > Suppose > I have two arrays having the same layout on the regular grid, can I > store > the first array data on one vector, and the second array data on the > second > vector (and have a DA with dof=1, instead of a DA with dof=2), and > be able > to scatter and update the first vector without scattering/updating the > second vector ? > > Thanks > > Rgds, > Amit > > owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: > >> Hi Amit, >> >> Why do you need two staggered grids? I do EM finite difference >> frequency >> domain modeling on a staggered grid using just one DA. Works >> perfectly > fine. >> There are some grid points that are not used, but you just set them >> to > zero >> and put a 1 on the diagonal of the coefficient matrix. >> >> >> Randy >> >> >> Amit.Itagi at seagate.com wrote: >>> Hi Berend, >>> >>> A detailed explanation of the finite difference scheme is given >>> here : >>> >>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method >>> >>> >>> Thanks >>> >>> Rgds, >>> Amit >>> >>> >>> >>> > >>> Berend van Wachem > >>> >>> se> > To >>> Sent by: petsc-users at mcs.anl.gov > >>> owner-petsc-users > cc >>> @mcs.anl.gov > >>> No Phone Info > Subject >>> Available Re: DA question > >>> > >>> > >>> 04/09/2008 02:59 > >>> PM > >>> > >>> > >>> Please respond to > >>> petsc-users at mcs.a > >>> nl.gov > >>> > >>> > >>> >>> >>> >>> >>> Dear Amit, >>> >>> Could you explain how the two grids are attached? >>> I am using multiple DA's for multiple structured grids glued >>> together. >>> I've done the gluing with setting up various IS objects. From the >>> multiple DA's, one global variable vector is formed. Is that what >>> you >>> are looking for? >>> >>> Best regards, >>> >>> Berend. >>> >>> >>> Amit.Itagi at seagate.com wrote: >>>> Hi, >>>> >>>> Is it possible to use DA to perform finite differences on two > staggered >>>> regular grids (as in the electromagnetic finite difference time >>>> domain >>>> method) ? Surrounding nodes from one grid are used to update the >>>> value > in >>>> the dual grid. In addition, local manipulations need to be done >>>> on the >>>> nodal values. >>>> >>>> Thanks >>>> >>>> Rgds, >>>> Amit >>>> >>> >>> >>> >> > From Amit.Itagi at seagate.com Tue Apr 29 14:27:53 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Tue, 29 Apr 2008 15:27:53 -0400 Subject: DA question In-Reply-To: Message-ID: Barry, Can this be achieved using SDA ? I am working with regular arrays, and doing only explicit updates. Thanks Rgds, Amit owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM: > > If you are running a true explicit scheme then you have no need > to ever have a "global representation" at each time step. In this > case you can use DALocalToLocalBegin() then DALocalToLocalEnd() > and pass the same vector in both locations. This will update the ghost > points but WILL NOT do any copy of the local data since it is already > in the correct locations. > > Barry > > On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote: > > > Hi, > > > > I spent some more time understanding DA's, and how DA's should serve > > my > > purpose. Since in the time domain calculation, I will have to > > scatter from > > the global vector to the local vector and vice-versa at every > > iteration > > step, I have some follow-up questions. > > > > 1) Does the scattering involve copying the part stored on the local > > node as > > well (i.e. part of the local vector other than the ghost values), or > > is the > > local part just accessed by reference ? In the first scenario, this > > would > > involve allocating twice the storage for the local part. Also, does > > the > > scattering of the local part give a big hit in terms of CPU time ? > > > > 2) In the manual, it says "In most cases, several different vectors > > can > > share the same communication information (or, in other words, can > > share a > > given DA)" and "PETSc currently provides no container for multiple > > arrays > > sharing the same distributed array communication; note, however, > > that the > > dof parameter handles many cases of interest". I am a bit confused. > > Suppose > > I have two arrays having the same layout on the regular grid, can I > > store > > the first array data on one vector, and the second array data on the > > second > > vector (and have a DA with dof=1, instead of a DA with dof=2), and > > be able > > to scatter and update the first vector without scattering/updating the > > second vector ? > > > > Thanks > > > > Rgds, > > Amit > > > > owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: > > > >> Hi Amit, > >> > >> Why do you need two staggered grids? I do EM finite difference > >> frequency > >> domain modeling on a staggered grid using just one DA. Works > >> perfectly > > fine. > >> There are some grid points that are not used, but you just set them > >> to > > zero > >> and put a 1 on the diagonal of the coefficient matrix. > >> > >> > >> Randy > >> > >> > >> Amit.Itagi at seagate.com wrote: > >>> Hi Berend, > >>> > >>> A detailed explanation of the finite difference scheme is given > >>> here : > >>> > >>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method > >>> > >>> > >>> Thanks > >>> > >>> Rgds, > >>> Amit > >>> > >>> > >>> > >>> > > > >>> Berend van Wachem > > > >>> > > >>> se> > > To > >>> Sent by: petsc-users at mcs.anl.gov > > > >>> owner-petsc-users > > cc > >>> @mcs.anl.gov > > > >>> No Phone Info > > Subject > >>> Available Re: DA question > > > >>> > > > >>> > > > >>> 04/09/2008 02:59 > > > >>> PM > > > >>> > > > >>> > > > >>> Please respond to > > > >>> petsc-users at mcs.a > > > >>> nl.gov > > > >>> > > > >>> > > > >>> > >>> > >>> > >>> > >>> Dear Amit, > >>> > >>> Could you explain how the two grids are attached? > >>> I am using multiple DA's for multiple structured grids glued > >>> together. > >>> I've done the gluing with setting up various IS objects. From the > >>> multiple DA's, one global variable vector is formed. Is that what > >>> you > >>> are looking for? > >>> > >>> Best regards, > >>> > >>> Berend. > >>> > >>> > >>> Amit.Itagi at seagate.com wrote: > >>>> Hi, > >>>> > >>>> Is it possible to use DA to perform finite differences on two > > staggered > >>>> regular grids (as in the electromagnetic finite difference time > >>>> domain > >>>> method) ? Surrounding nodes from one grid are used to update the > >>>> value > > in > >>>> the dual grid. In addition, local manipulations need to be done > >>>> on the > >>>> nodal values. > >>>> > >>>> Thanks > >>>> > >>>> Rgds, > >>>> Amit > >>>> > >>> > >>> > >>> > >> > > > From knepley at gmail.com Tue Apr 29 14:39:19 2008 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 29 Apr 2008 14:39:19 -0500 Subject: DA question In-Reply-To: References: Message-ID: On Tue, Apr 29, 2008 at 2:27 PM, wrote: > Barry, > > Can this be achieved using SDA ? I am working with regular arrays, and > doing only explicit updates. What is SDA? Barry's point is that, if no solve is done (as in your case), no global system or global vectors need to be formed. You can use LocalToLocal calls to keep gohst points synchronized. Matt > Thanks > > Rgds, > Amit > > owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM: > > > > > If you are running a true explicit scheme then you have no need > > to ever have a "global representation" at each time step. In this > > case you can use DALocalToLocalBegin() then DALocalToLocalEnd() > > and pass the same vector in both locations. This will update the ghost > > points but WILL NOT do any copy of the local data since it is already > > in the correct locations. > > > > Barry > > > > On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote: > > > > > Hi, > > > > > > I spent some more time understanding DA's, and how DA's should serve > > > my > > > purpose. Since in the time domain calculation, I will have to > > > scatter from > > > the global vector to the local vector and vice-versa at every > > > iteration > > > step, I have some follow-up questions. > > > > > > 1) Does the scattering involve copying the part stored on the local > > > node as > > > well (i.e. part of the local vector other than the ghost values), or > > > is the > > > local part just accessed by reference ? In the first scenario, this > > > would > > > involve allocating twice the storage for the local part. Also, does > > > the > > > scattering of the local part give a big hit in terms of CPU time ? > > > > > > 2) In the manual, it says "In most cases, several different vectors > > > can > > > share the same communication information (or, in other words, can > > > share a > > > given DA)" and "PETSc currently provides no container for multiple > > > arrays > > > sharing the same distributed array communication; note, however, > > > that the > > > dof parameter handles many cases of interest". I am a bit confused. > > > Suppose > > > I have two arrays having the same layout on the regular grid, can I > > > store > > > the first array data on one vector, and the second array data on the > > > second > > > vector (and have a DA with dof=1, instead of a DA with dof=2), and > > > be able > > > to scatter and update the first vector without scattering/updating the > > > second vector ? > > > > > > Thanks > > > > > > Rgds, > > > Amit > > > > > > owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: > > > > > >> Hi Amit, > > >> > > >> Why do you need two staggered grids? I do EM finite difference > > >> frequency > > >> domain modeling on a staggered grid using just one DA. Works > > >> perfectly > > > fine. > > >> There are some grid points that are not used, but you just set them > > >> to > > > zero > > >> and put a 1 on the diagonal of the coefficient matrix. > > >> > > >> > > >> Randy > > >> > > >> > > >> Amit.Itagi at seagate.com wrote: > > >>> Hi Berend, > > >>> > > >>> A detailed explanation of the finite difference scheme is given > > >>> here : > > >>> > > >>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method > > >>> > > >>> > > >>> Thanks > > >>> > > >>> Rgds, > > >>> Amit > > >>> > > >>> > > >>> > > >>> > > > > > >>> Berend van Wachem > > > > > >>> > > > > >>> se> > > > To > > >>> Sent by: petsc-users at mcs.anl.gov > > > > > >>> owner-petsc-users > > > cc > > >>> @mcs.anl.gov > > > > > >>> No Phone Info > > > Subject > > >>> Available Re: DA question > > > > > >>> > > > > > >>> > > > > > >>> 04/09/2008 02:59 > > > > > >>> PM > > > > > >>> > > > > > >>> > > > > > >>> Please respond to > > > > > >>> petsc-users at mcs.a > > > > > >>> nl.gov > > > > > >>> > > > > > >>> > > > > > >>> > > >>> > > >>> > > >>> > > >>> Dear Amit, > > >>> > > >>> Could you explain how the two grids are attached? > > >>> I am using multiple DA's for multiple structured grids glued > > >>> together. > > >>> I've done the gluing with setting up various IS objects. From the > > >>> multiple DA's, one global variable vector is formed. Is that what > > >>> you > > >>> are looking for? > > >>> > > >>> Best regards, > > >>> > > >>> Berend. > > >>> > > >>> > > >>> Amit.Itagi at seagate.com wrote: > > >>>> Hi, > > >>>> > > >>>> Is it possible to use DA to perform finite differences on two > > > staggered > > >>>> regular grids (as in the electromagnetic finite difference time > > >>>> domain > > >>>> method) ? Surrounding nodes from one grid are used to update the > > >>>> value > > > in > > >>>> the dual grid. In addition, local manipulations need to be done > > >>>> on the > > >>>> nodal values. > > >>>> > > >>>> Thanks > > >>>> > > >>>> Rgds, > > >>>> Amit > > >>>> > > >>> > > >>> > > >>> > > >> > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener From bsmith at mcs.anl.gov Tue Apr 29 14:51:42 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 29 Apr 2008 14:51:42 -0500 Subject: DA question In-Reply-To: References: Message-ID: On Apr 29, 2008, at 2:27 PM, Amit.Itagi at seagate.com wrote: > Barry, > > Can this be achieved using SDA ? I am working with regular arrays, and > doing only explicit updates. Yes. SDA actually uses the DA, it just hides the Vec concept from the user. Barry > > > Thanks > > Rgds, > Amit > > owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM: > >> >> If you are running a true explicit scheme then you have no need >> to ever have a "global representation" at each time step. In this >> case you can use DALocalToLocalBegin() then DALocalToLocalEnd() >> and pass the same vector in both locations. This will update the >> ghost >> points but WILL NOT do any copy of the local data since it is already >> in the correct locations. >> >> Barry >> >> On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote: >> >>> Hi, >>> >>> I spent some more time understanding DA's, and how DA's should serve >>> my >>> purpose. Since in the time domain calculation, I will have to >>> scatter from >>> the global vector to the local vector and vice-versa at every >>> iteration >>> step, I have some follow-up questions. >>> >>> 1) Does the scattering involve copying the part stored on the local >>> node as >>> well (i.e. part of the local vector other than the ghost values), or >>> is the >>> local part just accessed by reference ? In the first scenario, this >>> would >>> involve allocating twice the storage for the local part. Also, does >>> the >>> scattering of the local part give a big hit in terms of CPU time ? >>> >>> 2) In the manual, it says "In most cases, several different vectors >>> can >>> share the same communication information (or, in other words, can >>> share a >>> given DA)" and "PETSc currently provides no container for multiple >>> arrays >>> sharing the same distributed array communication; note, however, >>> that the >>> dof parameter handles many cases of interest". I am a bit confused. >>> Suppose >>> I have two arrays having the same layout on the regular grid, can I >>> store >>> the first array data on one vector, and the second array data on the >>> second >>> vector (and have a DA with dof=1, instead of a DA with dof=2), and >>> be able >>> to scatter and update the first vector without scattering/updating >>> the >>> second vector ? >>> >>> Thanks >>> >>> Rgds, >>> Amit >>> >>> owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: >>> >>>> Hi Amit, >>>> >>>> Why do you need two staggered grids? I do EM finite difference >>>> frequency >>>> domain modeling on a staggered grid using just one DA. Works >>>> perfectly >>> fine. >>>> There are some grid points that are not used, but you just set them >>>> to >>> zero >>>> and put a 1 on the diagonal of the coefficient matrix. >>>> >>>> >>>> Randy >>>> >>>> >>>> Amit.Itagi at seagate.com wrote: >>>>> Hi Berend, >>>>> >>>>> A detailed explanation of the finite difference scheme is given >>>>> here : >>>>> >>>>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Rgds, >>>>> Amit >>>>> >>>>> >>>>> >>>>> >>> >>>>> Berend van Wachem >>> >>>>> >> >>>>> se> >>> To >>>>> Sent by: petsc-users at mcs.anl.gov >>> >>>>> owner-petsc-users >>> cc >>>>> @mcs.anl.gov >>> >>>>> No Phone Info >>> Subject >>>>> Available Re: DA question >>> >>>>> >>> >>>>> >>> >>>>> 04/09/2008 02:59 >>> >>>>> PM >>> >>>>> >>> >>>>> >>> >>>>> Please respond to >>> >>>>> petsc-users at mcs.a >>> >>>>> nl.gov >>> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>>> >>>>> >>>>> Dear Amit, >>>>> >>>>> Could you explain how the two grids are attached? >>>>> I am using multiple DA's for multiple structured grids glued >>>>> together. >>>>> I've done the gluing with setting up various IS objects. From the >>>>> multiple DA's, one global variable vector is formed. Is that what >>>>> you >>>>> are looking for? >>>>> >>>>> Best regards, >>>>> >>>>> Berend. >>>>> >>>>> >>>>> Amit.Itagi at seagate.com wrote: >>>>>> Hi, >>>>>> >>>>>> Is it possible to use DA to perform finite differences on two >>> staggered >>>>>> regular grids (as in the electromagnetic finite difference time >>>>>> domain >>>>>> method) ? Surrounding nodes from one grid are used to update the >>>>>> value >>> in >>>>>> the dual grid. In addition, local manipulations need to be done >>>>>> on the >>>>>> nodal values. >>>>>> >>>>>> Thanks >>>>>> >>>>>> Rgds, >>>>>> Amit >>>>>> >>>>> >>>>> >>>>> >>>> >>> >> > From amjad11 at gmail.com Wed Apr 30 01:24:27 2008 From: amjad11 at gmail.com (amjad ali) Date: Wed, 30 Apr 2008 11:24:27 +0500 Subject: PETSC with MPI-GAMMA ?? Message-ID: <428810f20804292324q25cbedanf6cef98153493460@mail.gmail.com> Hello, I read that: Genoa Active Message MAchine (GAMMA ) is a low-latency replacement for TCP/IP on gigabit and is supported for Intel platforms on modern Linux kernels (both 32 and 64 bit). It completely bypasses the Linux network stack to produce record breaking latency figures. (Please see also http://www.opencfd.co.uk/openfoam/parallel1.4.html that tells how OPEN-FOAM is/will-be using GAMMA). Please comment on that can take benefit of GAMMA (if there is??) with PETSc? For example, installing PETSc with GAMMA? regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Amit.Itagi at seagate.com Wed Apr 30 10:33:11 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 30 Apr 2008 11:33:11 -0400 Subject: DA question In-Reply-To: Message-ID: Barry, I tried this out. This serves my purpose nicely. One question : How compatible is PetSc with Blitz++ ? Can I declare the array to be returned by DAVecGetArray to be a Blitz array ? Thanks Rgds, Amit Barry Smith To Sent by: petsc-users at mcs.anl.gov owner-petsc-users cc @mcs.anl.gov No Phone Info Subject Available Re: DA question 04/29/2008 01:28 PM Please respond to petsc-users at mcs.a nl.gov If you are running a true explicit scheme then you have no need to ever have a "global representation" at each time step. In this case you can use DALocalToLocalBegin() then DALocalToLocalEnd() and pass the same vector in both locations. This will update the ghost points but WILL NOT do any copy of the local data since it is already in the correct locations. Barry On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote: > Hi, > > I spent some more time understanding DA's, and how DA's should serve > my > purpose. Since in the time domain calculation, I will have to > scatter from > the global vector to the local vector and vice-versa at every > iteration > step, I have some follow-up questions. > > 1) Does the scattering involve copying the part stored on the local > node as > well (i.e. part of the local vector other than the ghost values), or > is the > local part just accessed by reference ? In the first scenario, this > would > involve allocating twice the storage for the local part. Also, does > the > scattering of the local part give a big hit in terms of CPU time ? > > 2) In the manual, it says "In most cases, several different vectors > can > share the same communication information (or, in other words, can > share a > given DA)" and "PETSc currently provides no container for multiple > arrays > sharing the same distributed array communication; note, however, > that the > dof parameter handles many cases of interest". I am a bit confused. > Suppose > I have two arrays having the same layout on the regular grid, can I > store > the first array data on one vector, and the second array data on the > second > vector (and have a DA with dof=1, instead of a DA with dof=2), and > be able > to scatter and update the first vector without scattering/updating the > second vector ? > > Thanks > > Rgds, > Amit > > owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: > >> Hi Amit, >> >> Why do you need two staggered grids? I do EM finite difference >> frequency >> domain modeling on a staggered grid using just one DA. Works >> perfectly > fine. >> There are some grid points that are not used, but you just set them >> to > zero >> and put a 1 on the diagonal of the coefficient matrix. >> >> >> Randy >> >> >> Amit.Itagi at seagate.com wrote: >>> Hi Berend, >>> >>> A detailed explanation of the finite difference scheme is given >>> here : >>> >>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method >>> >>> >>> Thanks >>> >>> Rgds, >>> Amit >>> >>> >>> >>> > >>> Berend van Wachem > >>> >>> se> > To >>> Sent by: petsc-users at mcs.anl.gov > >>> owner-petsc-users > cc >>> @mcs.anl.gov > >>> No Phone Info > Subject >>> Available Re: DA question > >>> > >>> > >>> 04/09/2008 02:59 > >>> PM > >>> > >>> > >>> Please respond to > >>> petsc-users at mcs.a > >>> nl.gov > >>> > >>> > >>> >>> >>> >>> >>> Dear Amit, >>> >>> Could you explain how the two grids are attached? >>> I am using multiple DA's for multiple structured grids glued >>> together. >>> I've done the gluing with setting up various IS objects. From the >>> multiple DA's, one global variable vector is formed. Is that what >>> you >>> are looking for? >>> >>> Best regards, >>> >>> Berend. >>> >>> >>> Amit.Itagi at seagate.com wrote: >>>> Hi, >>>> >>>> Is it possible to use DA to perform finite differences on two > staggered >>>> regular grids (as in the electromagnetic finite difference time >>>> domain >>>> method) ? Surrounding nodes from one grid are used to update the >>>> value > in >>>> the dual grid. In addition, local manipulations need to be done >>>> on the >>>> nodal values. >>>> >>>> Thanks >>>> >>>> Rgds, >>>> Amit >>>> >>> >>> >>> >> > From bsmith at mcs.anl.gov Wed Apr 30 10:58:30 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 30 Apr 2008 10:58:30 -0500 Subject: DA question In-Reply-To: References: Message-ID: <1D003C34-5E65-4340-98EC-8274AA32BA16@mcs.anl.gov> On Apr 30, 2008, at 10:33 AM, Amit.Itagi at seagate.com wrote: > Barry, > > I tried this out. This serves my purpose nicely. > > One question : How compatible is PetSc with Blitz++ ? Can I declare > the > array to be returned by DAVecGetArray to be a Blitz array ? Likely you would need to use VecGetArray() and then somehow build the Blitz array using the pointer returned and the sizes of the local part of the DA. If you figure out how to do this then maybe we could have a DAVecGetArrayBlitz() Barry > > > Thanks > > Rgds, > Amit > > > > > Barry Smith > > ov> To > Sent by: petsc-users at mcs.anl.gov > owner-petsc- > users cc > @mcs.anl.gov > No Phone Info > Subject > Available Re: DA question > > > 04/29/2008 01:28 > PM > > > Please respond to > petsc-users at mcs.a > nl.gov > > > > > > > > If you are running a true explicit scheme then you have no need > to ever have a "global representation" at each time step. In this > case you can use DALocalToLocalBegin() then DALocalToLocalEnd() > and pass the same vector in both locations. This will update the ghost > points but WILL NOT do any copy of the local data since it is already > in the correct locations. > > Barry > > On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote: > >> Hi, >> >> I spent some more time understanding DA's, and how DA's should serve >> my >> purpose. Since in the time domain calculation, I will have to >> scatter from >> the global vector to the local vector and vice-versa at every >> iteration >> step, I have some follow-up questions. >> >> 1) Does the scattering involve copying the part stored on the local >> node as >> well (i.e. part of the local vector other than the ghost values), or >> is the >> local part just accessed by reference ? In the first scenario, this >> would >> involve allocating twice the storage for the local part. Also, does >> the >> scattering of the local part give a big hit in terms of CPU time ? >> >> 2) In the manual, it says "In most cases, several different vectors >> can >> share the same communication information (or, in other words, can >> share a >> given DA)" and "PETSc currently provides no container for multiple >> arrays >> sharing the same distributed array communication; note, however, >> that the >> dof parameter handles many cases of interest". I am a bit confused. >> Suppose >> I have two arrays having the same layout on the regular grid, can I >> store >> the first array data on one vector, and the second array data on the >> second >> vector (and have a DA with dof=1, instead of a DA with dof=2), and >> be able >> to scatter and update the first vector without scattering/updating >> the >> second vector ? >> >> Thanks >> >> Rgds, >> Amit >> >> owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: >> >>> Hi Amit, >>> >>> Why do you need two staggered grids? I do EM finite difference >>> frequency >>> domain modeling on a staggered grid using just one DA. Works >>> perfectly >> fine. >>> There are some grid points that are not used, but you just set them >>> to >> zero >>> and put a 1 on the diagonal of the coefficient matrix. >>> >>> >>> Randy >>> >>> >>> Amit.Itagi at seagate.com wrote: >>>> Hi Berend, >>>> >>>> A detailed explanation of the finite difference scheme is given >>>> here : >>>> >>>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method >>>> >>>> >>>> Thanks >>>> >>>> Rgds, >>>> Amit >>>> >>>> >>>> >>>> >> >>>> Berend van Wachem >> >>>> > >>>> se> >> To >>>> Sent by: petsc-users at mcs.anl.gov >> >>>> owner-petsc-users >> cc >>>> @mcs.anl.gov >> >>>> No Phone Info >> Subject >>>> Available Re: DA question >> >>>> >> >>>> >> >>>> 04/09/2008 02:59 >> >>>> PM >> >>>> >> >>>> >> >>>> Please respond to >> >>>> petsc-users at mcs.a >> >>>> nl.gov >> >>>> >> >>>> >> >>>> >>>> >>>> >>>> >>>> Dear Amit, >>>> >>>> Could you explain how the two grids are attached? >>>> I am using multiple DA's for multiple structured grids glued >>>> together. >>>> I've done the gluing with setting up various IS objects. From the >>>> multiple DA's, one global variable vector is formed. Is that what >>>> you >>>> are looking for? >>>> >>>> Best regards, >>>> >>>> Berend. >>>> >>>> >>>> Amit.Itagi at seagate.com wrote: >>>>> Hi, >>>>> >>>>> Is it possible to use DA to perform finite differences on two >> staggered >>>>> regular grids (as in the electromagnetic finite difference time >>>>> domain >>>>> method) ? Surrounding nodes from one grid are used to update the >>>>> value >> in >>>>> the dual grid. In addition, local manipulations need to be done >>>>> on the >>>>> nodal values. >>>>> >>>>> Thanks >>>>> >>>>> Rgds, >>>>> Amit >>>>> >>>> >>>> >>>> >>> >> > > > From Amit.Itagi at seagate.com Wed Apr 30 15:27:57 2008 From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com) Date: Wed, 30 Apr 2008 16:27:57 -0400 Subject: DA question In-Reply-To: Message-ID: owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM: > > If you are running a true explicit scheme then you have no need > to ever have a "global representation" at each time step. In this > case you can use DALocalToLocalBegin() then DALocalToLocalEnd() > and pass the same vector in both locations. This will update the ghost > points but WILL NOT do any copy of the local data since it is already > in the correct locations. Barry, I implemented an explicit scheme using your suggestion. The scheme seems to work. Now I want to output the data to a file (according to the natural ordering of a 3D array). I guess, I would need a global vector for this. Hence, I did DACreateGlobalVector DALocalToGlocal DAGetAO Thus, I have a global vector and the AO. Now, how do access the vector elements in the AO order ? Eventually, I will use PetscFPrintf for writing. Thanks Rgds, Amit > > Barry > > On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote: > > > Hi, > > > > I spent some more time understanding DA's, and how DA's should serve > > my > > purpose. Since in the time domain calculation, I will have to > > scatter from > > the global vector to the local vector and vice-versa at every > > iteration > > step, I have some follow-up questions. > > > > 1) Does the scattering involve copying the part stored on the local > > node as > > well (i.e. part of the local vector other than the ghost values), or > > is the > > local part just accessed by reference ? In the first scenario, this > > would > > involve allocating twice the storage for the local part. Also, does > > the > > scattering of the local part give a big hit in terms of CPU time ? > > > > 2) In the manual, it says "In most cases, several different vectors > > can > > share the same communication information (or, in other words, can > > share a > > given DA)" and "PETSc currently provides no container for multiple > > arrays > > sharing the same distributed array communication; note, however, > > that the > > dof parameter handles many cases of interest". I am a bit confused. > > Suppose > > I have two arrays having the same layout on the regular grid, can I > > store > > the first array data on one vector, and the second array data on the > > second > > vector (and have a DA with dof=1, instead of a DA with dof=2), and > > be able > > to scatter and update the first vector without scattering/updating the > > second vector ? > > > > Thanks > > > > Rgds, > > Amit > > > > owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: > > > >> Hi Amit, > >> > >> Why do you need two staggered grids? I do EM finite difference > >> frequency > >> domain modeling on a staggered grid using just one DA. Works > >> perfectly > > fine. > >> There are some grid points that are not used, but you just set them > >> to > > zero > >> and put a 1 on the diagonal of the coefficient matrix. > >> > >> > >> Randy > >> > >> > >> Amit.Itagi at seagate.com wrote: > >>> Hi Berend, > >>> > >>> A detailed explanation of the finite difference scheme is given > >>> here : > >>> > >>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method > >>> > >>> > >>> Thanks > >>> > >>> Rgds, > >>> Amit > >>> > >>> > >>> > >>> > > > >>> Berend van Wachem > > > >>> > > >>> se> > > To > >>> Sent by: petsc-users at mcs.anl.gov > > > >>> owner-petsc-users > > cc > >>> @mcs.anl.gov > > > >>> No Phone Info > > Subject > >>> Available Re: DA question > > > >>> > > > >>> > > > >>> 04/09/2008 02:59 > > > >>> PM > > > >>> > > > >>> > > > >>> Please respond to > > > >>> petsc-users at mcs.a > > > >>> nl.gov > > > >>> > > > >>> > > > >>> > >>> > >>> > >>> > >>> Dear Amit, > >>> > >>> Could you explain how the two grids are attached? > >>> I am using multiple DA's for multiple structured grids glued > >>> together. > >>> I've done the gluing with setting up various IS objects. From the > >>> multiple DA's, one global variable vector is formed. Is that what > >>> you > >>> are looking for? > >>> > >>> Best regards, > >>> > >>> Berend. > >>> > >>> > >>> Amit.Itagi at seagate.com wrote: > >>>> Hi, > >>>> > >>>> Is it possible to use DA to perform finite differences on two > > staggered > >>>> regular grids (as in the electromagnetic finite difference time > >>>> domain > >>>> method) ? Surrounding nodes from one grid are used to update the > >>>> value > > in > >>>> the dual grid. In addition, local manipulations need to be done > >>>> on the > >>>> nodal values. > >>>> > >>>> Thanks > >>>> > >>>> Rgds, > >>>> Amit > >>>> > >>> > >>> > >>> > >> > > > From bsmith at mcs.anl.gov Wed Apr 30 16:41:55 2008 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 30 Apr 2008 16:41:55 -0500 Subject: DA question In-Reply-To: References: Message-ID: <205D1872-F1BC-4B1A-BF0E-4A6FE057AA79@mcs.anl.gov> On Apr 30, 2008, at 3:27 PM, Amit.Itagi at seagate.com wrote: > > owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM: > >> >> If you are running a true explicit scheme then you have no need >> to ever have a "global representation" at each time step. In this >> case you can use DALocalToLocalBegin() then DALocalToLocalEnd() >> and pass the same vector in both locations. This will update the >> ghost >> points but WILL NOT do any copy of the local data since it is already >> in the correct locations. > > Barry, > > I implemented an explicit scheme using your suggestion. The scheme > seems to > work. Now I want to output the data to a file (according to the > natural > ordering of a 3D array). I guess, I would need a global vector for > this. > Hence, I did > > DACreateGlobalVector > DALocalToGlocal > DAGetAO > > Thus, I have a global vector and the AO. > > Now, how do access the vector elements in the AO order ? Eventually, > I will > use PetscFPrintf for writing. > It would be insane to use PetscFPrintf() to print/save the vector entries; it doesn't scale in larger problems (even moderate sized problems). You can use VecView() on the global vector; it will automatically map the vector entries to the natural ordering so you don't need to worry about the AO. VecView() as various options for presenting the values, as ASCII if you want, or binary or even HDF5 format. Barry > Thanks > > Rgds, > Amit > > > >> >> Barry >> >> On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote: >> >>> Hi, >>> >>> I spent some more time understanding DA's, and how DA's should serve >>> my >>> purpose. Since in the time domain calculation, I will have to >>> scatter from >>> the global vector to the local vector and vice-versa at every >>> iteration >>> step, I have some follow-up questions. >>> >>> 1) Does the scattering involve copying the part stored on the local >>> node as >>> well (i.e. part of the local vector other than the ghost values), or >>> is the >>> local part just accessed by reference ? In the first scenario, this >>> would >>> involve allocating twice the storage for the local part. Also, does >>> the >>> scattering of the local part give a big hit in terms of CPU time ? >>> >>> 2) In the manual, it says "In most cases, several different vectors >>> can >>> share the same communication information (or, in other words, can >>> share a >>> given DA)" and "PETSc currently provides no container for multiple >>> arrays >>> sharing the same distributed array communication; note, however, >>> that the >>> dof parameter handles many cases of interest". I am a bit confused. >>> Suppose >>> I have two arrays having the same layout on the regular grid, can I >>> store >>> the first array data on one vector, and the second array data on the >>> second >>> vector (and have a DA with dof=1, instead of a DA with dof=2), and >>> be able >>> to scatter and update the first vector without scattering/updating >>> the >>> second vector ? >>> >>> Thanks >>> >>> Rgds, >>> Amit >>> >>> owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM: >>> >>>> Hi Amit, >>>> >>>> Why do you need two staggered grids? I do EM finite difference >>>> frequency >>>> domain modeling on a staggered grid using just one DA. Works >>>> perfectly >>> fine. >>>> There are some grid points that are not used, but you just set them >>>> to >>> zero >>>> and put a 1 on the diagonal of the coefficient matrix. >>>> >>>> >>>> Randy >>>> >>>> >>>> Amit.Itagi at seagate.com wrote: >>>>> Hi Berend, >>>>> >>>>> A detailed explanation of the finite difference scheme is given >>>>> here : >>>>> >>>>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Rgds, >>>>> Amit >>>>> >>>>> >>>>> >>>>> >>> >>>>> Berend van Wachem >>> >>>>> >> >>>>> se> >>> To >>>>> Sent by: petsc-users at mcs.anl.gov >>> >>>>> owner-petsc-users >>> cc >>>>> @mcs.anl.gov >>> >>>>> No Phone Info >>> Subject >>>>> Available Re: DA question >>> >>>>> >>> >>>>> >>> >>>>> 04/09/2008 02:59 >>> >>>>> PM >>> >>>>> >>> >>>>> >>> >>>>> Please respond to >>> >>>>> petsc-users at mcs.a >>> >>>>> nl.gov >>> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>>> >>>>> >>>>> Dear Amit, >>>>> >>>>> Could you explain how the two grids are attached? >>>>> I am using multiple DA's for multiple structured grids glued >>>>> together. >>>>> I've done the gluing with setting up various IS objects. From the >>>>> multiple DA's, one global variable vector is formed. Is that what >>>>> you >>>>> are looking for? >>>>> >>>>> Best regards, >>>>> >>>>> Berend. >>>>> >>>>> >>>>> Amit.Itagi at seagate.com wrote: >>>>>> Hi, >>>>>> >>>>>> Is it possible to use DA to perform finite differences on two >>> staggered >>>>>> regular grids (as in the electromagnetic finite difference time >>>>>> domain >>>>>> method) ? Surrounding nodes from one grid are used to update the >>>>>> value >>> in >>>>>> the dual grid. In addition, local manipulations need to be done >>>>>> on the >>>>>> nodal values. >>>>>> >>>>>> Thanks >>>>>> >>>>>> Rgds, >>>>>> Amit >>>>>> >>>>> >>>>> >>>>> >>>> >>> >> >