From berend at chalmers.se Fri Feb 2 02:42:13 2007 From: berend at chalmers.se (Berend van Wachem) Date: Fri, 2 Feb 2007 09:42:13 +0100 Subject: Using PETSc in structured c-grid for CFD and multigrid In-Reply-To: <804ab5d40701311932o43d85cc2q9dbf7d6189a47cdd@mail.gmail.com> References: <804ab5d40701251902x31cd3d29ye97ce5c0b2924e4d@mail.gmail.com> <804ab5d40701311932o43d85cc2q9dbf7d6189a47cdd@mail.gmail.com> Message-ID: <200702020942.13518.berend@chalmers.se> Hi Ben, It will probably work, but it will be more expensive. If you use an implicit algorithm to solve the flow, it really pays off to have the boundary conditions implicit as well. Explicit boundary conditions mean you will need additional iterations, which is really unnecessary in your case. Why not put the dependency directly in the matrix? Berend. > Hi, > > somone suggested that I treat that face as a dirichlet boundary > condition. after 1 or a few iterations, the face value will be > updated and it will be repeated until covergerence. I wonder if that > is possible as well? > > It'll make the job much easier, although the iteration may take > longer... > > On 2/1/07, Barry Smith wrote: > > The glueing might be able to be handled by using periodic for that > > dimension of the DA you create. But this gets tricky if you have > > any nodes that have "an extra degree of freedom". > > > > Barry > > > > On Wed, 31 Jan 2007, Berend van Wachem wrote: > > > Hi Ben, > > > > > > The challenge in your problem is how you "glue" the C grid in > > > the back; there you will need to do some additional scattering. > > > I would set-up the IS for this, and then use that to scatter the > > > values into "ghostcells" which will be present on the block(s). > > > > > > Berend. > > > > > > > Thank you Berend! I'll go through DA again. I'm also looking > > > > at HYPRE. Its way of creating grids and linking them seems > > > > intuitive. Btw, is there a mailing list for HYPRE similar to > > > > PETSc to discuss problems? I find that their explanation are > > > > quite brief. > > > > > > > > I tried to install HYPRE 2.0 on windows using cygwin but it > > > > failed. I then install it as an external software thru PETSc. > > > > I think it's installing HYPRE 1.0 or something. But similarly, > > > > there's illegal operation. > > > > > > > > Installing HYPRE 2.0 on my school's linux worked, though > > > > there's seems to be some minor error. So what's the best way > > > > to employ multigird? Is it to install as an external software > > > > thru PETSc or just use HYPRE on its own? > > > > > > > > Btw, it will be great if you can send me parts of your code > > > > regarding DA. > > > > > > > > Thank you very much! > > > > > > > > On 1/26/07, Berend van Wachem wrote: > > > > > Hi, > > > > > > > > > > I am not an expert - but have used PETSc for both structured > > > > > and unstructured grids. > > > > > > > > > > When you use an unstructured code for a structured grid, > > > > > there is additional overhead (addressing, connectivity) > > > > > which is redundant; this information is not required for > > > > > solving on a structured grid. I would say this is maximum a > > > > > 10% efficiency loss for bigger problems - it does not affect > > > > > solving the matrix, only in gathering your coefficients. I > > > > > would not rewrite my CFD code for this. > > > > > > > > > > If you only deal with structured grids, using the PETSc DA > > > > > framework should work for you - you are not saving all > > > > > connectivity. The DA framework is not difficult at all, > > > > > according to my opinion. Look at a few examples that come > > > > > with PETSc. I use a block structured solver - using multiple > > > > > DA's within one problem. Let me know if you are interested > > > > > in this, and I can send you parts of code. > > > > > > > > > > Multigrid is certainly possible (I would reccomend through > > > > > HYPRE, discussed on the mailinglist, although I still have > > > > > problems with it), but the question is how efficient it will > > > > > be for your CFD problem. For an efficient multigrid in CFD, > > > > > it is important to consider the coefficient structure > > > > > arising from the momentum equations - the grouping of cells > > > > > should occur following the advection term. Only then will > > > > > you achieve linear scaling with the problem size. For > > > > > instance, consider a rotating flow in a square box. Most > > > > > multigrid algorithms will group cells in "squares" which > > > > > will not lead to a significant improvement, as the flow > > > > > (advection, pressure grad) does not move in these squares. > > > > > In fact, to have an efficient multigrid algorithm, the cels > > > > > should be grouped along the circular flow. As this cannot be > > > > > seen directly from the pressure coefficients, I doubt any > > > > > "automatic" multigrid algorithm (in Hypre or Petsc) would be > > > > > able to capture this, but don't quote me on it - I am not > > > > > 100% sure. So concluding, if you want to do efficient > > > > > multigridding for CFD, you will need to point out which > > > > > cells are grouped into which structure, based upon the > > > > > upwind advection coefficients. > > > > > > > > > > Good luck, > > > > > > > > > > Berend. > > > > > > > > > > > Hi, > > > > > > > > > > > > I was discussing with another user in another forum > > > > > > (cfd-online.com) > > > > > > > > > > about > > > > > > > > > > > using PETSc in my cfd code. I am now using KSP to solve my > > > > > > momentum and poisson eqn by inserting values into the > > > > > > matrix. I was told that using PETSc > > > > > > this way is only for unstructured grids. It is very > > > > > > inefficient and much slower if I'm using it for my > > > > > > structured grid because I am not > > > > > > > > > > exploiting > > > > > > > > > > > the regular structure of my grid. > > > > > > > > > > > > Is that true? I'm solving flow around airfoil using > > > > > > c-grid. > > > > > > > > > > > > So how can I improve? Is it by using DA? I took a glance > > > > > > and it seems quite > > > > > > complicated. > > > > > > > > > > > > Also, is multigrid available in PETSc? Chapter 7 discusses > > > > > > about it but > > > > > > > > > > it > > > > > > > > > > > seems very brief. Is there a more elaborate tutorial > > > > > > besides that c examples? > > > > > > > > > > > > Hope someone can give me some ideas. > > > > > > > > > > > > Thank you. From jinzishuai at yahoo.com Fri Feb 2 15:22:21 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 2 Feb 2007 13:22:21 -0800 (PST) Subject: PETSc runs slower on a shared memory machine than on a cluster Message-ID: <419301.13342.qm@web36213.mail.mud.yahoo.com> Hi there, I am fairly new to PETSc but have 5 years of MPI programming already. I recently took on a project of analyzing a finite element code written in C with PETSc. I found out that on a shared-memory machine (60GB RAM, 16 CPUS), the code runs around 4 times slower than on a distributed memory cluster (4GB Ram, 4CPU/node), although they yield identical results. There are 1.6Million finite elements in the problem so it is a fairly large calculation. The total memory used is 3GBx16=48GB. Both the two systems run Linux as OS and the same code is compiled against the same version of MPICH-2 and PETSc. The shared-memory machine is actually a little faster than the cluster machines in terms of single process runs. I am surprised at this result since we usually tend to think that shared-memory would be much faster since the in-memory operation is much faster that the network communication. However, I read the PETSc FAQ and found that "the speed of sparse matrix computations is almost totally determined by the speed of the memory, not the speed of the CPU". This makes me wonder whether the poor performance of my code on a shared-memory machine is due to the competition of different process on the same memory bus. Since the code is still MPI based, a lot of data are moving around inside the memory. Is this a reasonable explanation of what I observed? Thank you very much. Shi ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From bsmith at mcs.anl.gov Fri Feb 2 15:38:39 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 2 Feb 2007 15:38:39 -0600 (CST) Subject: Non-uniform 2D mesh questions In-Reply-To: <20070130230718.29179.qmail@s402.sureserver.com> References: <20070130230718.29179.qmail@s402.sureserver.com> Message-ID: Yaron, Anything is possible :-) and maybe not terribly difficult to get started. You could use DAGetMatrx() to give you the properly pre-allocated "huge" Mat. Have each process loop over the "rectangular portion[s] of the domain" that it mostly owns (that is if a rectangular portion lies across two processes just assign it to one of them for this loop.) Then loop over the locations inside the rectangular portion calling MatSetValuesStencil() for that row of the huge matrix to put the entries from the smaller matrix INTO the huge matrix using the natural grid i,j coordindates (so not have to map the coordinates from the grid location to the location in the matrix). This may require some thought to get right but should require little coding (if you are writting hundreds and hundreds of lines of code then likely something is wrong). Good luck, Barry On Tue, 30 Jan 2007, yaron at oak-research.com wrote: > Barry- > So far I only thought of having a single large sparse matrix. > > Yaron > > > -------Original Message------- > From: Barry Smith > Subject: Re: Non-uniform 2D mesh questions > Sent: 30 Jan '07 10:58 > > > Yaron, > > Do you want to end up generating a single large sparse matrix? Like a > MPIAIJ > matrix? Or do you want to somehow not store the entire huge matrix but > still > be able to solve with the composed matrix? Or both? > > Barry > > > On Mon, 29 Jan 2007, [LINK: > http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > yaron at oak-research.com wrote: > > > Barry- > > Yes, each block is a rectangular portion of the domain. Not so small > > though (more like 100 x 100 nodes) > > > > Yaron > > > > > > -------Original Message------- > > From: Barry Smith > > Subject: Re: Non-uniform 2D mesh questions > > Sent: 29 Jan '07 19:40 > > > > > > Yaron, > > > > Is each one of these "blocks" a small rectangular part of the > > domain (like a 4 by 5 set of nodes)? I don't understand what you > > want to do. > > > > Barry > > > > > > On Mon, 29 Jan 2007, [LINK: > > [LINK: > http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > > [LINK: > http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > yaron at oak-research.com wrote: > > > > > Hi all > > > I have a laplace-type problem that's physically built from repeating > > > instances of the same block. > > > I'm creaing matrices for the individual blocks, and I'd like to > reuse > > > the individual block matrices in order to compose the complete > > problem. > > > (i.e if there 10K instances of 20 blocks, I'd like to build 20 > > matrices, > > > then use them to compose the large complete matrix) > > > Is a 2D DA the right object to do that? And if so, where can I find > a > > > small example of building the DA object in parallel, then using the > > > different (for every instance) mappings of local nodes to global > nodes > > in > > > order to build the complete matrix? > > > > > > > > > Thanks > > > Yaron > > > > > > From dalcinl at gmail.com Fri Feb 2 15:47:56 2007 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Fri, 2 Feb 2007 18:47:56 -0300 Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <419301.13342.qm@web36213.mail.mud.yahoo.com> References: <419301.13342.qm@web36213.mail.mud.yahoo.com> Message-ID: On 2/2/07, Shi Jin wrote: > I found out that on a shared-memory machine (60GB RAM, > 16 CPUS), the code runs around 4 times slower than > on a distributed memory cluster (4GB Ram, 4CPU/node), > although they yield identical results. > However, I read the PETSc FAQ and found that "the > speed of sparse matrix computations is almost totally > determined by the speed of the memory, not the speed > of the CPU". > This makes me wonder whether the poor performance of > my code on a shared-memory machine is due to the > competition of different process on the same memory > bus. Since the code is still MPI based, a lot of data > are moving around inside the memory. Is this a > reasonable explanation of what I observed? There is a point which is not clear for me. When you run in your shared-memory machine... - Are you running your as a 'sequential' program with a global,shared memory space? - Or are you running it through MPI, as a distributed memory application using MPI message passing (where shared mem is the underlying communication 'channel') ? -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From balay at mcs.anl.gov Fri Feb 2 15:55:02 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Fri, 2 Feb 2007 15:55:02 -0600 (CST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <419301.13342.qm@web36213.mail.mud.yahoo.com> References: <419301.13342.qm@web36213.mail.mud.yahoo.com> Message-ID: There are 2 aspects to performance. - MPI performance [while message passing] - sequential performance for the numerical stuff. So it could be that the SMP box has better MPI performance. This can be verified with -log_summary from both the runs [and looking at VecScatter times] However with the sequential numerical codes - it primarily depends upon the bandwidth between the CPU and the memory. On the SMP box - depending upon how the memory subsystem is designed - the effective memory bandwidth per cpu could be a small fraction of the peak memory bandwidth [when all cpus are used] So you'll have to look at the memory subsystem design of each of these machines and compare the 'memory bandwidth per cpu]. The performance from log_summary - for ex: in MatMult will reflect this. [ including the above communication overhead] Satish On Fri, 2 Feb 2007, Shi Jin wrote: > Hi there, > > I am fairly new to PETSc but have 5 years of MPI > programming already. I recently took on a project of > analyzing a finite element code written in C with > PETSc. > I found out that on a shared-memory machine (60GB RAM, > 16 CPUS), the code runs around 4 times slower than > on a distributed memory cluster (4GB Ram, 4CPU/node), > although they yield identical results. > There are 1.6Million finite elements in the problem so > it is a fairly large calculation. The total memory > used is 3GBx16=48GB. > > Both the two systems run Linux as OS and the same code > is compiled against the same version of MPICH-2 and > PETSc. > > The shared-memory machine is actually a little faster > than the cluster machines in terms of single process > runs. > > I am surprised at this result since we usually tend to > think that shared-memory would be much faster since > the in-memory operation is much faster that the > network communication. > > However, I read the PETSc FAQ and found that "the > speed of sparse matrix computations is almost totally > determined by the speed of the memory, not the speed > of the CPU". > This makes me wonder whether the poor performance of > my code on a shared-memory machine is due to the > competition of different process on the same memory > bus. Since the code is still MPI based, a lot of data > are moving around inside the memory. Is this a > reasonable explanation of what I observed? > > Thank you very much. > > Shi > > > > ____________________________________________________________________________________ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail beta. > http://new.mail.yahoo.com > > From balay at mcs.anl.gov Fri Feb 2 16:01:49 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Fri, 2 Feb 2007 16:01:49 -0600 (CST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: References: <419301.13342.qm@web36213.mail.mud.yahoo.com> Message-ID: On Fri, 2 Feb 2007, Satish Balay wrote: > However with the sequential numerical codes - it primarily depends > upon the bandwidth between the CPU and the memory. On the SMP box - > depending upon how the memory subsystem is designed - the effective > memory bandwidth per cpu could be a small fraction of the peak memory > bandwidth [when all cpus are used] > > The shared-memory machine is actually a little faster > > than the cluster machines in terms of single process > > runs. To understand this better - think of comparing the performance in the following 2 cases: - run the sequential code when no other job is on the machine. - run the sequential code when there is another [memory intensive] job using the other 15 nodes] In a distributed cluster the performance numbers for both cases will be same. For a SMP machine - the performance of the first run will be much better than the second one [because of the sharing of memory bandwidth with competing processors] Satish From jinzishuai at yahoo.com Fri Feb 2 17:02:38 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 2 Feb 2007 15:02:38 -0800 (PST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: Message-ID: <20070202230238.92601.qmail@web36206.mail.mud.yahoo.com> > There is a point which is not clear for me. > > When you run in your shared-memory machine... > > - Are you running your as a 'sequential' program > with a global,shared > memory space? > > - Or are you running it through MPI, as a > distributed memory > application using MPI message passing (where shared > mem is the > underlying communication 'channel') ? Thank you for replying. I run the code on a shared memory machine through MPI, just like what I do on a cluster. I simply did: petscmpirun -np 18 ./code I am not 100% sure whether MPICH-2 will automatically use shared memory as the underlying commnunication channel instead of the network but I know most MPI implementations are smart enough to do so (like LAM-MPI I used before). Could anyone confirm this? Thank you. Shi ____________________________________________________________________________________ Sucker-punch spam with award-winning protection. Try the free Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/features_spam.html From dalcinl at gmail.com Fri Feb 2 17:16:57 2007 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Fri, 2 Feb 2007 20:16:57 -0300 Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <20070202230238.92601.qmail@web36206.mail.mud.yahoo.com> References: <20070202230238.92601.qmail@web36206.mail.mud.yahoo.com> Message-ID: On 2/2/07, Shi Jin wrote: > Thank you for replying. > I run the code on a shared memory machine through MPI, > just like what I do on a cluster. I simply did: > petscmpirun -np 18 ./code > > I am not 100% sure whether MPICH-2 will automatically > use shared memory as the underlying commnunication > channel instead of the network but I know most MPI > implementations are smart enough to do so (like > LAM-MPI I used before). Could anyone confirm this? Please read the following... http://www-unix.mcs.anl.gov/mpi/mpich/downloads/mpich2-doc-README.txt I think for shared-memory you should try to configure MPICH2 with the following: --with-device=ch3:shm If not, perhaps configure will default to --with-device=ch3:sock and MPICH2 will use TCP sockets. I hope this help you. Regards, -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From knepley at gmail.com Fri Feb 2 18:20:47 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 2 Feb 2007 18:20:47 -0600 Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <20070202230238.92601.qmail@web36206.mail.mud.yahoo.com> References: <20070202230238.92601.qmail@web36206.mail.mud.yahoo.com> Message-ID: On 2/2/07, Shi Jin wrote: > > > There is a point which is not clear for me. > > > > When you run in your shared-memory machine... > > > > - Are you running your as a 'sequential' program > > with a global,shared > > memory space? > > > > - Or are you running it through MPI, as a > > distributed memory > > application using MPI message passing (where shared > > mem is the > > underlying communication 'channel') ? > > Thank you for replying. > I run the code on a shared memory machine through MPI, > just like what I do on a cluster. I simply did: > petscmpirun -np 18 ./code > > I am not 100% sure whether MPICH-2 will automatically > use shared memory as the underlying commnunication > channel instead of the network but I know most MPI > implementations are smart enough to do so (like > LAM-MPI I used before). Could anyone confirm this? > Thank you. This is missing the point I think. It is just as Satish pointed out. Sparse matrix multiply is completely dominated by memory bandwidth and the shared memory machine has contention between the processes. I guarantee you that the performance problem is in the effective memory bandwidth per process. Matt Shi > > > > > > ____________________________________________________________________________________ > Sucker-punch spam with award-winning protection. > Try the free Yahoo! Mail Beta. > http://advision.webevents.yahoo.com/mailbeta/features_spam.html > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie -------------- next part -------------- An HTML attachment was scrubbed... URL: From jinzishuai at yahoo.com Sat Feb 3 15:46:29 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Sat, 3 Feb 2007 13:46:29 -0800 (PST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: Message-ID: <917118.25233.qm@web36202.mail.mud.yahoo.com> Thank you. I did the same runs again with -log_summary. Here is the part that I think is most important. On cluster: --- Event Stage 5: Projection Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ [x]rhsLu 99 1.0 2.3875e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 9.9e+01 7 0 0 0 0 14 0 0 0 0 0 VecMDot 133334 1.0 4.1386e+02 1.6 3.43e+08 1.6 0.0e+00 0.0e+00 1.3e+05 10 18 0 0 45 21 27 0 0 49 883 VecNorm 137829 1.0 6.9839e+01 1.5 1.27e+08 1.5 0.0e+00 0.0e+00 1.4e+05 2 1 0 0 46 4 2 0 0 51 350 VecScale 137928 1.0 5.5639e+00 1.1 5.79e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 2197 VecCopy 4495 1.0 8.4510e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 142522 1.0 1.7712e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 VecAXPY 8990 1.0 9.9013e-01 1.1 4.34e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1610 VecMAXPY 137829 1.0 2.1687e+02 1.1 4.92e+08 1.1 0.0e+00 0.0e+00 0.0e+00 6 20 0 0 0 12 29 0 0 0 1793 VecScatterBegin 137829 1.0 2.1816e+01 1.9 0.00e+00 0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 1 0100100 0 0 VecScatterEnd 137730 1.0 3.0302e+01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 VecNormalize 137829 1.0 7.6565e+01 1.4 1.68e+08 1.4 0.0e+00 0.0e+00 1.4e+05 2 2 0 0 46 4 3 0 0 51 479 MatMult 137730 1.0 3.5652e+02 1.3 2.58e+08 1.2 8.3e+05 3.4e+04 0.0e+00 9 15 91 74 0 19 21100100 0 815 MatSolve 137829 1.0 5.0916e+02 1.2 1.56e+08 1.2 0.0e+00 0.0e+00 0.0e+00 13 14 0 0 0 28 20 0 0 0 531 MatGetRow 44110737 1.0 1.1846e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 7 0 0 0 0 0 KSPGMRESOrthog 133334 1.0 6.0430e+02 1.3 3.87e+08 1.3 0.0e+00 0.0e+00 1.3e+05 15 37 0 0 45 32 54 0 0 49 1209 KSPSolve 99 1.0 1.4336e+03 1.0 2.37e+08 1.0 8.3e+05 3.4e+04 2.7e+05 40 68 91 74 91 86100100100100 944 PCSetUpOnBlocks 99 1.0 3.2687e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 PCApply 137829 1.0 5.3316e+02 1.2 1.50e+08 1.2 0.0e+00 0.0e+00 0.0e+00 14 14 0 0 0 30 20 0 0 0 507 --------------------------------------------------- On the shared memory machine: --- Event Stage 5: Projection Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ [x]rhsLu 99 1.0 2.0673e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 9.9e+01 5 0 0 0 0 9 0 0 0 0 0 VecMDot 133334 1.0 7.0932e+02 2.1 2.70e+08 2.1 0.0e+00 0.0e+00 1.3e+05 11 18 0 0 45 22 27 0 0 49 515 VecNorm 137829 1.0 1.2860e+02 7.0 3.32e+08 7.0 0.0e+00 0.0e+00 1.4e+05 2 1 0 0 46 3 2 0 0 51 190 VecScale 137928 1.0 5.0018e+00 1.0 6.36e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 2444 VecCopy 4495 1.0 1.4161e+00 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 142522 1.0 1.9602e+01 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 VecAXPY 8990 1.0 1.5128e+00 1.4 3.67e+08 1.4 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1054 VecMAXPY 137829 1.0 3.5204e+02 1.4 3.82e+08 1.4 0.0e+00 0.0e+00 0.0e+00 7 20 0 0 0 13 29 0 0 0 1105 VecScatterBegin 137829 1.0 1.4310e+01 2.2 0.00e+00 0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 0 0100100 0 0 VecScatterEnd 137730 1.0 1.5035e+02 6.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 3 0 0 0 0 0 VecNormalize 137829 1.0 1.3453e+02 5.6 3.80e+08 5.6 0.0e+00 0.0e+00 1.4e+05 2 2 0 0 46 3 3 0 0 51 272 MatMult 137730 1.0 5.4179e+02 1.5 1.99e+08 1.4 8.3e+05 3.4e+04 0.0e+00 11 15 91 74 0 21 21100100 0 536 MatSolve 137829 1.0 7.9682e+02 1.4 1.18e+08 1.4 0.0e+00 0.0e+00 0.0e+00 16 14 0 0 0 30 20 0 0 0 339 MatGetRow 44110737 1.0 1.0296e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 5 0 0 0 0 0 KSPGMRESOrthog 133334 1.0 9.4927e+02 1.4 2.75e+08 1.4 0.0e+00 0.0e+00 1.3e+05 18 37 0 0 45 34 54 0 0 49 770 KSPSolve 99 1.0 2.0562e+03 1.0 1.65e+08 1.0 8.3e+05 3.4e+04 2.7e+05 47 68 91 74 91 91100100100100 658 PCSetUpOnBlocks 99 1.0 3.3998e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 PCApply 137829 1.0 8.2326e+02 1.4 1.14e+08 1.4 0.0e+00 0.0e+00 0.0e+00 16 14 0 0 0 31 20 0 0 0 328 I do see that the cluster run is faster than the shared-memory case. However, I am not sure how I can tell the reason for this behavior is due to the memory subsystem. I don't know what evidence in the log to look for. Thanks again. Shi --- Satish Balay wrote: > There are 2 aspects to performance. > > - MPI performance [while message passing] > - sequential performance for the numerical stuff. > > So it could be that the SMP box has better MPI > performance. This can > be verified with -log_summary from both the runs > [and looking at > VecScatter times] > > However with the sequential numerical codes - it > primarily depends > upon the bandwidth between the CPU and the memory. > On the SMP box - > depending upon how the memory subsystem is designed > - the effective > memory bandwidth per cpu could be a small fraction > of the peak memory > bandwidth [when all cpus are used] > > So you'll have to look at the memory subsystem > design of each of these > machines and compare the 'memory bandwidth per cpu]. > The performance > from log_summary - for ex: in MatMult will reflect > this. [ including > the above communication overhead] > > Satish > > On Fri, 2 Feb 2007, Shi Jin wrote: > > > Hi there, > > > > I am fairly new to PETSc but have 5 years of MPI > > programming already. I recently took on a project > of > > analyzing a finite element code written in C with > > PETSc. > > I found out that on a shared-memory machine (60GB > RAM, > > 16 CPUS), the code runs around 4 times slower > than > > on a distributed memory cluster (4GB Ram, > 4CPU/node), > > although they yield identical results. > > There are 1.6Million finite elements in the > problem so > > it is a fairly large calculation. The total memory > > used is 3GBx16=48GB. > > > > Both the two systems run Linux as OS and the same > code > > is compiled against the same version of MPICH-2 > and > > PETSc. > > > > The shared-memory machine is actually a little > faster > > than the cluster machines in terms of single > process > > runs. > > > > I am surprised at this result since we usually > tend to > > think that shared-memory would be much faster > since > > the in-memory operation is much faster that the > > network communication. > > > > However, I read the PETSc FAQ and found that "the > > speed of sparse matrix computations is almost > totally > > determined by the speed of the memory, not the > speed > > of the CPU". > > This makes me wonder whether the poor performance > of > > my code on a shared-memory machine is due to the > > competition of different process on the same > memory > > bus. Since the code is still MPI based, a lot of > data > > are moving around inside the memory. Is this a > > reasonable explanation of what I observed? > > > > Thank you very much. > > > > Shi > > > > > > > > > ____________________________________________________________________________________ > > Do you Yahoo!? > > Everyone is raving about the all-new Yahoo! Mail > beta. > > http://new.mail.yahoo.com > > > > > > ____________________________________________________________________________________ Need a quick answer? Get one in minutes from people who know. Ask your question on www.Answers.yahoo.com From jinzishuai at yahoo.com Sat Feb 3 15:50:01 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Sat, 3 Feb 2007 13:50:01 -0800 (PST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: Message-ID: <323724.17234.qm@web36208.mail.mud.yahoo.com> Thank you I rebuilt MPICH-2 with --with-device=ch3:shm and --with-pm=gforker I did see a slight improvement in speed. However, compared with the cluster runs, the shared-memory performance is still not as good at all. So I think the problem is indeed in the memory subsystem as Satith said. Shi --- Lisandro Dalcin wrote: > Please read the following... > > http://www-unix.mcs.anl.gov/mpi/mpich/downloads/mpich2-doc-README.txt > > I think for shared-memory you should try to > configure MPICH2 with the following: > > --with-device=ch3:shm > > If not, perhaps configure will default to > --with-device=ch3:sock and > MPICH2 will use TCP sockets. > > I hope this help you. > > Regards, > > -- > Lisandro Dalc?n > --------------- > Centro Internacional de M?todos Computacionales en > Ingenier?a (CIMEC) > Instituto de Desarrollo Tecnol?gico para la > Industria Qu?mica (INTEC) > Consejo Nacional de Investigaciones Cient?ficas y > T?cnicas (CONICET) > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > Tel/Fax: +54-(0)342-451.1594 > > ____________________________________________________________________________________ Food fight? Enjoy some healthy debate in the Yahoo! Answers Food & Drink Q&A. http://answers.yahoo.com/dir/?link=list&sid=396545367 From bsmith at mcs.anl.gov Sat Feb 3 18:57:29 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sat, 3 Feb 2007 18:57:29 -0600 (CST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <917118.25233.qm@web36202.mail.mud.yahoo.com> References: <917118.25233.qm@web36202.mail.mud.yahoo.com> Message-ID: Total Flop rate Cluster VecMAXPY 1793 MatSolve 815 Shared memory VecMAXPY 1105 MatSolve 339 The vector operations in MAXPY and the triangular solves in MatSolve are memory bandwidth limited (triangular solves extremely). When all the processers are demanding their needed memory bandwidth in the triangular solves the performance suffers 339 vs 815 from the distributed memory case where each processor has its own memory. Barry On Sat, 3 Feb 2007, Shi Jin wrote: > Thank you. > I did the same runs again with -log_summary. Here is > the part that I think is most important. > On cluster: > --- Event Stage 5: Projection > Event Count Time (sec) > Flops/sec --- Global --- --- > Stage --- Total > Max Ratio Max Ratio Max > Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M > %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > [x]rhsLu 99 1.0 2.3875e+02 1.0 0.00e+00 > 0.0 0.0e+00 0.0e+00 9.9e+01 7 0 0 0 0 14 0 0 > 0 0 0 > VecMDot 133334 1.0 4.1386e+02 1.6 3.43e+08 > 1.6 0.0e+00 0.0e+00 1.3e+05 10 18 0 0 45 21 27 0 > 0 49 883 > VecNorm 137829 1.0 6.9839e+01 1.5 1.27e+08 > 1.5 0.0e+00 0.0e+00 1.4e+05 2 1 0 0 46 4 2 0 > 0 51 350 > VecScale 137928 1.0 5.5639e+00 1.1 5.79e+08 > 1.1 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 > 0 0 2197 > VecCopy 4495 1.0 8.4510e-01 1.1 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 0 > VecSet 142522 1.0 1.7712e+01 1.5 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 > 0 0 0 > VecAXPY 8990 1.0 9.9013e-01 1.1 4.34e+08 > 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 1610 > VecMAXPY 137829 1.0 2.1687e+02 1.1 4.92e+08 > 1.1 0.0e+00 0.0e+00 0.0e+00 6 20 0 0 0 12 29 0 > 0 0 1793 > VecScatterBegin 137829 1.0 2.1816e+01 1.9 0.00e+00 > 0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 1 > 0100100 0 0 > VecScatterEnd 137730 1.0 3.0302e+01 1.6 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 > 0 0 0 > VecNormalize 137829 1.0 7.6565e+01 1.4 1.68e+08 > 1.4 0.0e+00 0.0e+00 1.4e+05 2 2 0 0 46 4 3 0 > 0 51 479 > MatMult 137730 1.0 3.5652e+02 1.3 2.58e+08 > 1.2 8.3e+05 3.4e+04 0.0e+00 9 15 91 74 0 19 > 21100100 0 815 > MatSolve 137829 1.0 5.0916e+02 1.2 1.56e+08 > 1.2 0.0e+00 0.0e+00 0.0e+00 13 14 0 0 0 28 20 0 > 0 0 531 > MatGetRow 44110737 1.0 1.1846e+02 1.0 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 7 0 0 > 0 0 0 > KSPGMRESOrthog 133334 1.0 6.0430e+02 1.3 3.87e+08 > 1.3 0.0e+00 0.0e+00 1.3e+05 15 37 0 0 45 32 54 0 > 0 49 1209 > KSPSolve 99 1.0 1.4336e+03 1.0 2.37e+08 > 1.0 8.3e+05 3.4e+04 2.7e+05 40 68 91 74 91 > 86100100100100 944 > PCSetUpOnBlocks 99 1.0 3.2687e-04 1.2 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 0 > PCApply 137829 1.0 5.3316e+02 1.2 1.50e+08 > 1.2 0.0e+00 0.0e+00 0.0e+00 14 14 0 0 0 30 20 0 > 0 0 507 > --------------------------------------------------- > On the shared memory machine: > --- Event Stage 5: Projection > Event Count Time (sec) > Flops/sec --- Global --- --- > Stage --- Total > Max Ratio Max Ratio Max > Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M > %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > [x]rhsLu 99 1.0 2.0673e+02 1.0 0.00e+00 > 0.0 0.0e+00 0.0e+00 9.9e+01 5 0 0 0 0 9 0 0 > 0 0 0 > VecMDot 133334 1.0 7.0932e+02 2.1 2.70e+08 > 2.1 0.0e+00 0.0e+00 1.3e+05 11 18 0 0 45 22 27 0 > 0 49 515 > VecNorm 137829 1.0 1.2860e+02 7.0 3.32e+08 > 7.0 0.0e+00 0.0e+00 1.4e+05 2 1 0 0 46 3 2 0 > 0 51 190 > VecScale 137928 1.0 5.0018e+00 1.0 6.36e+08 > 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 > 0 0 2444 > VecCopy 4495 1.0 1.4161e+00 1.8 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 0 > VecSet 142522 1.0 1.9602e+01 2.1 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 > 0 0 0 > VecAXPY 8990 1.0 1.5128e+00 1.4 3.67e+08 > 1.4 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 1054 > VecMAXPY 137829 1.0 3.5204e+02 1.4 3.82e+08 > 1.4 0.0e+00 0.0e+00 0.0e+00 7 20 0 0 0 13 29 0 > 0 0 1105 > VecScatterBegin 137829 1.0 1.4310e+01 2.2 0.00e+00 > 0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 0 > 0100100 0 0 > VecScatterEnd 137730 1.0 1.5035e+02 6.5 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 3 0 0 > 0 0 0 > VecNormalize 137829 1.0 1.3453e+02 5.6 3.80e+08 > 5.6 0.0e+00 0.0e+00 1.4e+05 2 2 0 0 46 3 3 0 > 0 51 272 > MatMult 137730 1.0 5.4179e+02 1.5 1.99e+08 > 1.4 8.3e+05 3.4e+04 0.0e+00 11 15 91 74 0 21 > 21100100 0 536 > MatSolve 137829 1.0 7.9682e+02 1.4 1.18e+08 > 1.4 0.0e+00 0.0e+00 0.0e+00 16 14 0 0 0 30 20 0 > 0 0 339 > MatGetRow 44110737 1.0 1.0296e+02 1.0 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 5 0 0 > 0 0 0 > KSPGMRESOrthog 133334 1.0 9.4927e+02 1.4 2.75e+08 > 1.4 0.0e+00 0.0e+00 1.3e+05 18 37 0 0 45 34 54 0 > 0 49 770 > KSPSolve 99 1.0 2.0562e+03 1.0 1.65e+08 > 1.0 8.3e+05 3.4e+04 2.7e+05 47 68 91 74 91 > 91100100100100 658 > PCSetUpOnBlocks 99 1.0 3.3998e-04 1.5 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 0 > PCApply 137829 1.0 8.2326e+02 1.4 1.14e+08 > 1.4 0.0e+00 0.0e+00 0.0e+00 16 14 0 0 0 31 20 0 > 0 0 328 > > I do see that the cluster run is faster than the > shared-memory case. However, I am not sure how I can > tell the reason for this behavior is due to the memory > subsystem. I don't know what evidence in the log to > look for. > Thanks again. > > Shi > --- Satish Balay wrote: > > > There are 2 aspects to performance. > > > > - MPI performance [while message passing] > > - sequential performance for the numerical stuff. > > > > So it could be that the SMP box has better MPI > > performance. This can > > be verified with -log_summary from both the runs > > [and looking at > > VecScatter times] > > > > However with the sequential numerical codes - it > > primarily depends > > upon the bandwidth between the CPU and the memory. > > On the SMP box - > > depending upon how the memory subsystem is designed > > - the effective > > memory bandwidth per cpu could be a small fraction > > of the peak memory > > bandwidth [when all cpus are used] > > > > So you'll have to look at the memory subsystem > > design of each of these > > machines and compare the 'memory bandwidth per cpu]. > > The performance > > from log_summary - for ex: in MatMult will reflect > > this. [ including > > the above communication overhead] > > > > Satish > > > > On Fri, 2 Feb 2007, Shi Jin wrote: > > > > > Hi there, > > > > > > I am fairly new to PETSc but have 5 years of MPI > > > programming already. I recently took on a project > > of > > > analyzing a finite element code written in C with > > > PETSc. > > > I found out that on a shared-memory machine (60GB > > RAM, > > > 16 CPUS), the code runs around 4 times slower > > than > > > on a distributed memory cluster (4GB Ram, > > 4CPU/node), > > > although they yield identical results. > > > There are 1.6Million finite elements in the > > problem so > > > it is a fairly large calculation. The total memory > > > used is 3GBx16=48GB. > > > > > > Both the two systems run Linux as OS and the same > > code > > > is compiled against the same version of MPICH-2 > > and > > > PETSc. > > > > > > The shared-memory machine is actually a little > > faster > > > than the cluster machines in terms of single > > process > > > runs. > > > > > > I am surprised at this result since we usually > > tend to > > > think that shared-memory would be much faster > > since > > > the in-memory operation is much faster that the > > > network communication. > > > > > > However, I read the PETSc FAQ and found that "the > > > speed of sparse matrix computations is almost > > totally > > > determined by the speed of the memory, not the > > speed > > > of the CPU". > > > This makes me wonder whether the poor performance > > of > > > my code on a shared-memory machine is due to the > > > competition of different process on the same > > memory > > > bus. Since the code is still MPI based, a lot of > > data > > > are moving around inside the memory. Is this a > > > reasonable explanation of what I observed? > > > > > > Thank you very much. > > > > > > Shi > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Do you Yahoo!? > > > Everyone is raving about the all-new Yahoo! Mail > > beta. > > > http://new.mail.yahoo.com > > > > > > > > > > > > > > > ____________________________________________________________________________________ > Need a quick answer? Get one in minutes from people who know. > Ask your question on www.Answers.yahoo.com > > From balay at mcs.anl.gov Sat Feb 3 19:00:04 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Sat, 3 Feb 2007 19:00:04 -0600 (CST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <917118.25233.qm@web36202.mail.mud.yahoo.com> References: <917118.25233.qm@web36202.mail.mud.yahoo.com> Message-ID: On Sat, 3 Feb 2007, Shi Jin wrote: > I do see that the cluster run is faster than the shared-memory > case. However, I am not sure how I can tell the reason for this > behavior is due to the memory subsystem. I don't know what evidence > in the log to look for. There were too many linewraps in the e-mailed text. Its best to send such text as attachments so that the format is preserved [and readable] Event Count Time (sec) Flops/sec --- Global --- ---Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ VecScatterBegin 137829 1.0 2.1816e+01 1.9 0.00e+00 0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 1 0100100 0 0 VecScatterEnd 137730 1.0 3.0302e+01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 MatMult 137730 1.0 3.5652e+02 1.3 2.58e+08 1.2 8.3e+05 3.4e+04 0.0e+00 9 15 91 74 0 1921100100 0 815 VecScatterBegin 137829 1.0 1.4310e+01 2.2 0.00e+00 0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 0 0100100 0 0 VecScatterEnd 137730 1.0 1.5035e+02 6.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 3 0 0 0 0 0 MatMult 137730 1.0 5.4179e+02 1.5 1.99e+08 1.4 8.3e+05 3.4e+04 0.0e+00 11 15 91 74 0 2121100100 0 536 Just looking at the time [in seconds] for VecScatterBegin() ,VecScatterEnd() ,MatMult() [which is the 4th column in the table] we have: [time in seconds] Cluster SMP VecScatterBegin 21 14 VecScatterEnd 30 150 MatMult 356 541 ----------------------------------- And MatMult is basically some local computation + Communication [which is scatter time], then if you consider just the local coputation time - and not the communication time, its its '356 -(21+30)' on the cluster and '541-(14+150)' on the SMP box. ----------------------------------- Communication cost 51 164 MatMult - (comm) 305 377 Considering this info - we can conclude the following: ** the communication cost on the the SMP box [164 seconds] is lot higher than communication cost on the cluster [51 seconds]. Part of the issue here is the load balance between all procs. [This is shown by the 5th column in the table] [load balance ratio] Cluster SMP VecScatterBegin 1.9 2.2 VecScatterEnd 1.6 6.5 MatMult 1.3 1.5 Somehow things are more balanced on the cluster than on the SMP, causing some procs to run slower than others - resulting in higher communication cost on the SMP box. ** The numerical part of MatMult is faster on the cluster [305 seconds] compared to the SMP box [377 seconds]. This is very likely due to the memory bandwidth issues. So both computation and communicaton times are better on the cluster [for MatMult - which is an essential kernel in sparse matrix solve]. Satish From dalcinl at gmail.com Sat Feb 3 19:37:42 2007 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Sat, 3 Feb 2007 22:37:42 -0300 Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <323724.17234.qm@web36208.mail.mud.yahoo.com> References: <323724.17234.qm@web36208.mail.mud.yahoo.com> Message-ID: On 2/3/07, Shi Jin wrote: > Thank you > I rebuilt MPICH-2 with --with-device=ch3:shm and > --with-pm=gforker > I did see a slight improvement in speed. However, > compared with the cluster runs, the shared-memory > performance is still not as good at all. > So I think the problem is indeed in the memory > subsystem as Satith said. Shi, can you provide me some more info about all this? - What kind of problem are you solving? - Are you using MATMPIAIJ or MATMPIBAIJ? - What do you use to partition your problem (ParMetis)? - How many processes do you have in your run (-np option) ? - When you run in your cluster, you launc 1 process in each CPU of your node? I mean, do you have 4 processes runing in each node? - What kind of network do you have in your cluster? GiE? or something better? I ask all this regarding previous comments of Barry and Shatish. If you have 4 processes running on each node, them surely communicate each other using the loopback interface, and this will have a bandwidth similar to your memory bandwidth, so in your case not all communication will go through the wires... Sorry for my English, Regards, -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From jinzishuai at yahoo.com Mon Feb 5 17:23:24 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Mon, 5 Feb 2007 15:23:24 -0800 (PST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <419301.13342.qm@web36213.mail.mud.yahoo.com> Message-ID: <321985.44474.qm@web36201.mail.mud.yahoo.com> Hi there, I have made some new progress on the issue of SMP performance. Since my shared memory machine is a 8 dual-core Opteron machine. I think the two cores on a single CPU chip shares the memory bandwidth. Therefore, if I can avoid using the same core on the chip, I can get some performance improvement. Indeed, I am able to do this by the linux command taskset. Here is what I did: petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ../spAF This way, I specifically ask the processes to be run on the first core on the CPUs. By doing this, my performance is doubled compared with the simple petscmpirun -n 8 ../spAF So this test shows that we do suffer from the competition of resources of multiple processes, especially when we use 16 processes. However, I should point out that even with the help taskset, the shared-memory performance is still 30% less than that on the cluster. I am not sure whether this problem exists specifically for the AMD machines or it applys to any shared-memory architecture. Thanks. Shi --- Shi Jin wrote: > Hi there, > > I am fairly new to PETSc but have 5 years of MPI > programming already. I recently took on a project of > analyzing a finite element code written in C with > PETSc. > I found out that on a shared-memory machine (60GB > RAM, > 16 CPUS), the code runs around 4 times slower > than > on a distributed memory cluster (4GB Ram, > 4CPU/node), > although they yield identical results. > There are 1.6Million finite elements in the problem > so > it is a fairly large calculation. The total memory > used is 3GBx16=48GB. > > Both the two systems run Linux as OS and the same > code > is compiled against the same version of MPICH-2 and > PETSc. > > The shared-memory machine is actually a little > faster > than the cluster machines in terms of single process > runs. > > I am surprised at this result since we usually tend > to > think that shared-memory would be much faster since > the in-memory operation is much faster that the > network communication. > > However, I read the PETSc FAQ and found that "the > speed of sparse matrix computations is almost > totally > determined by the speed of the memory, not the speed > of the CPU". > This makes me wonder whether the poor performance of > my code on a shared-memory machine is due to the > competition of different process on the same memory > bus. Since the code is still MPI based, a lot of > data > are moving around inside the memory. Is this a > reasonable explanation of what I observed? > > Thank you very much. > > Shi > > > > ____________________________________________________________________________________ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail > beta. > http://new.mail.yahoo.com > > ____________________________________________________________________________________ Expecting? Get great news right away with email Auto-Check. Try the Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html From balay at mcs.anl.gov Mon Feb 5 18:33:15 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 5 Feb 2007 18:33:15 -0600 (CST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <321985.44474.qm@web36201.mail.mud.yahoo.com> References: <321985.44474.qm@web36201.mail.mud.yahoo.com> Message-ID: A couple of comments: - with the dual core opteron - the memorybandwith per core is now reduced by half - so the performance suffers. However memory bandwidth across CPUs is scalable. [6.4 Gb/s per each node or 3.2Gb/s per core] - Current generation Intel Core 2 duo appears to claim having sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s per core?] so from this bandwidth number - this chip might do better than the AMD chip. However I'm not sure if there is a SMP with this chip - which has scalable memory system [across say 8 nodes - as you currently have..] - Older intel SMP boxes has a single memory bank shared across all the CPUs [so effective bandwidth per CPU was pretty small. Optrons' scalable architecture looked much better than the older intel SMPs] - From previous log_summary - part of the inefficiency of the SMP box [when compared to the cluster] was in the MPI performance. Do you still see this effect in the '-np 8' runs? If so this could be the [part of the] reason for this 30% reduction in performance. Satish On Mon, 5 Feb 2007, Shi Jin wrote: > Hi there, > > I have made some new progress on the issue of SMP > performance. Since my shared memory machine is a 8 > dual-core Opteron machine. I think the two cores on a > single CPU chip shares the memory bandwidth. > Therefore, if I can avoid using the same core on the > chip, I can get some performance improvement. Indeed, > I am able to do this by the linux command taskset. > Here is what I did: > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ../spAF > This way, I specifically ask the processes to be run > on the first core on the CPUs. > By doing this, my performance is doubled compared with > the simple petscmpirun -n 8 ../spAF > > So this test shows that we do suffer from the > competition of resources of multiple processes, > especially when we use 16 processes. > > However, I should point out that even with the help > taskset, the shared-memory performance is still 30% > less than that on the cluster. > > I am not sure whether this problem exists specifically > for the AMD machines or it applys to any shared-memory > architecture. > > Thanks. > Shi > > --- Shi Jin wrote: > > > Hi there, > > > > I am fairly new to PETSc but have 5 years of MPI > > programming already. I recently took on a project of > > analyzing a finite element code written in C with > > PETSc. > > I found out that on a shared-memory machine (60GB > > RAM, > > 16 CPUS), the code runs around 4 times slower > > than > > on a distributed memory cluster (4GB Ram, > > 4CPU/node), > > although they yield identical results. > > There are 1.6Million finite elements in the problem > > so > > it is a fairly large calculation. The total memory > > used is 3GBx16=48GB. > > > > Both the two systems run Linux as OS and the same > > code > > is compiled against the same version of MPICH-2 and > > PETSc. > > > > The shared-memory machine is actually a little > > faster > > than the cluster machines in terms of single process > > runs. > > > > I am surprised at this result since we usually tend > > to > > think that shared-memory would be much faster since > > the in-memory operation is much faster that the > > network communication. > > > > However, I read the PETSc FAQ and found that "the > > speed of sparse matrix computations is almost > > totally > > determined by the speed of the memory, not the speed > > of the CPU". > > This makes me wonder whether the poor performance of > > my code on a shared-memory machine is due to the > > competition of different process on the same memory > > bus. Since the code is still MPI based, a lot of > > data > > are moving around inside the memory. Is this a > > reasonable explanation of what I observed? > > > > Thank you very much. > > > > Shi > > > > > > > > > ____________________________________________________________________________________ > > Do you Yahoo!? > > Everyone is raving about the all-new Yahoo! Mail > > beta. > > http://new.mail.yahoo.com > > > > > > > > > ____________________________________________________________________________________ > Expecting? Get great news right away with email Auto-Check. > Try the Yahoo! Mail Beta. > http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html > > From balay at mcs.anl.gov Mon Feb 5 19:05:46 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 5 Feb 2007 19:05:46 -0600 (CST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: References: <321985.44474.qm@web36201.mail.mud.yahoo.com> Message-ID: One more comment in regards to single core vs dual core opteron: There are two ways to evaluate the performance. Performance per core - or performance for the price [of the machine]. Ideally we'd like the performance per core be scalable [for publishing pretty graphs]. However the dual core machine does not cost twice the cost of single core machine. [Its probably costs 10-30% more]. So realistically - if one can get the same factor of improvement in performance with 16nodes vs 8nodes, one can consider the dual core machine as providing reasonable performance. Satish On Mon, 5 Feb 2007, Satish Balay wrote: > A couple of comments: > > - with the dual core opteron - the memorybandwith per core is now > reduced by half - so the performance suffers. However memory > bandwidth across CPUs is scalable. [6.4 Gb/s per each node or 3.2Gb/s > per core] > > - Current generation Intel Core 2 duo appears to claim having > sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s per core?] so from > this bandwidth number - this chip might do better than the AMD > chip. However I'm not sure if there is a SMP with this chip - which > has scalable memory system [across say 8 nodes - as you currently > have..] > > - Older intel SMP boxes has a single memory bank shared across all the > CPUs [so effective bandwidth per CPU was pretty small. Optrons' > scalable architecture looked much better than the older intel SMPs] > > - From previous log_summary - part of the inefficiency of the SMP box > [when compared to the cluster] was in the MPI performance. Do you > still see this effect in the '-np 8' runs? If so this could be the > [part of the] reason for this 30% reduction in performance. > > Satish > > On Mon, 5 Feb 2007, Shi Jin wrote: > > > Hi there, > > > > I have made some new progress on the issue of SMP > > performance. Since my shared memory machine is a 8 > > dual-core Opteron machine. I think the two cores on a > > single CPU chip shares the memory bandwidth. > > Therefore, if I can avoid using the same core on the > > chip, I can get some performance improvement. Indeed, > > I am able to do this by the linux command taskset. > > Here is what I did: > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ../spAF > > This way, I specifically ask the processes to be run > > on the first core on the CPUs. > > By doing this, my performance is doubled compared with > > the simple petscmpirun -n 8 ../spAF > > > > So this test shows that we do suffer from the > > competition of resources of multiple processes, > > especially when we use 16 processes. > > > > However, I should point out that even with the help > > taskset, the shared-memory performance is still 30% > > less than that on the cluster. > > > > I am not sure whether this problem exists specifically > > for the AMD machines or it applys to any shared-memory > > architecture. > > > > Thanks. > > Shi > > > > --- Shi Jin wrote: > > > > > Hi there, > > > > > > I am fairly new to PETSc but have 5 years of MPI > > > programming already. I recently took on a project of > > > analyzing a finite element code written in C with > > > PETSc. > > > I found out that on a shared-memory machine (60GB > > > RAM, > > > 16 CPUS), the code runs around 4 times slower > > > than > > > on a distributed memory cluster (4GB Ram, > > > 4CPU/node), > > > although they yield identical results. > > > There are 1.6Million finite elements in the problem > > > so > > > it is a fairly large calculation. The total memory > > > used is 3GBx16=48GB. > > > > > > Both the two systems run Linux as OS and the same > > > code > > > is compiled against the same version of MPICH-2 and > > > PETSc. > > > > > > The shared-memory machine is actually a little > > > faster > > > than the cluster machines in terms of single process > > > runs. > > > > > > I am surprised at this result since we usually tend > > > to > > > think that shared-memory would be much faster since > > > the in-memory operation is much faster that the > > > network communication. > > > > > > However, I read the PETSc FAQ and found that "the > > > speed of sparse matrix computations is almost > > > totally > > > determined by the speed of the memory, not the speed > > > of the CPU". > > > This makes me wonder whether the poor performance of > > > my code on a shared-memory machine is due to the > > > competition of different process on the same memory > > > bus. Since the code is still MPI based, a lot of > > > data > > > are moving around inside the memory. Is this a > > > reasonable explanation of what I observed? > > > > > > Thank you very much. > > > > > > Shi > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Do you Yahoo!? > > > Everyone is raving about the all-new Yahoo! Mail > > > beta. > > > http://new.mail.yahoo.com > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Expecting? Get great news right away with email Auto-Check. > > Try the Yahoo! Mail Beta. > > http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html > > > > > > From yaron at oak-research.com Wed Feb 7 00:33:25 2007 From: yaron at oak-research.com (yaron at oak-research.com) Date: Tue, 06 Feb 2007 22:33:25 -0800 Subject: Non-uniform 2D mesh questions Message-ID: <20070207063325.20285.qmail@s402.sureserver.com> Barry- Maybe i'd better provide more details on what I'm trying to do *) I'm modeling current flowing through several different "block types", each of which describes a section of a semiconductor device. Each block type has a different geometry, which is triangulated to create an AIJ matrix (Each row/column in the matrix represents a coordinate, and the matrix values represent electrical admittance). There are about 100 different types of these blocks, and since they have quite convoluted geometries , their triangulation takes quite a while. *) My complete problem is composed of many (up to 10K) tiles, each of which is one of the 100 blocks . I want to reuse the triangulation which was done for each of the block, do I'd like to have a way of taking the matrix objects of the individual blcks, and combine them into a large matrix, taking into account their relative locations. *) This means that for each block instance, I would need to to translate every internal coordinate/node, and map it to a global coordinate/node. *) Once I have a mapping of local to global indices, I'd like to take the matrix values of the instances, and combine them to form a large matrix which describes the complete problem. So my question is : *) What data structures (DA/IS/AO/???) should I use to achieve the above? Best Regards Yaron -------Original Message------- From: Barry Smith Subject: Re: Non-uniform 2D mesh questions Sent: 02 Feb '07 13:38 Yaron, Anything is possible :-) and maybe not terribly difficult to get started. You could use DAGetMatrx() to give you the properly pre-allocated "huge" Mat. Have each process loop over the "rectangular portion[s] of the domain" that it mostly owns (that is if a rectangular portion lies across two processes just assign it to one of them for this loop.) Then loop over the locations inside the rectangular portion calling MatSetValuesStencil() for that row of the huge matrix to put the entries from the smaller matrix INTO the huge matrix using the natural grid i,j coordindates (so not have to map the coordinates from the grid location to the location in the matrix). This may require some thought to get right but should require little coding (if you are writting hundreds and hundreds of lines of code then likely something is wrong). Good luck, Barry On Tue, 30 Jan 2007, [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] yaron at oak-research.com wrote: > Barry- > So far I only thought of having a single large sparse matrix. > > Yaron > > > -------Original Message------- > From: Barry Smith > Subject: Re: Non-uniform 2D mesh questions > Sent: 30 Jan '07 10:58 > > > Yaron, > > Do you want to end up generating a single large sparse matrix? Like a > MPIAIJ > matrix? Or do you want to somehow not store the entire huge matrix but > still > be able to solve with the composed matrix? Or both? > > Barry > > > On Mon, 29 Jan 2007, [LINK: > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] yaron at oak-research.com wrote: > > > Barry- > > Yes, each block is a rectangular portion of the domain. Not so small > > though (more like 100 x 100 nodes) > > > > Yaron > > > > > > -------Original Message------- > > From: Barry Smith > > Subject: Re: Non-uniform 2D mesh questions > > Sent: 29 Jan '07 19:40 > > > > > > Yaron, > > > > Is each one of these "blocks" a small rectangular part of the > > domain (like a 4 by 5 set of nodes)? I don't understand what you > > want to do. > > > > Barry > > > > > > On Mon, 29 Jan 2007, [LINK: > > [LINK: > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > > [LINK: > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] yaron at oak-research.com wrote: > > > > > Hi all > > > I have a laplace-type problem that's physically built from repeating > > > instances of the same block. > > > I'm creaing matrices for the individual blocks, and I'd like to > reuse > > > the individual block matrices in order to compose the complete > > problem. > > > (i.e if there 10K instances of 20 blocks, I'd like to build 20 > > matrices, > > > then use them to compose the large complete matrix) > > > Is a 2D DA the right object to do that? And if so, where can I find > a > > > small example of building the DA object in parallel, then using the > > > different (for every instance) mappings of local nodes to global > nodes > > in > > > order to build the complete matrix? > > > > > > > > > Thanks > > > Yaron > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yaron at oak-research.com Wed Feb 7 00:33:25 2007 From: yaron at oak-research.com (yaron at oak-research.com) Date: Tue, 06 Feb 2007 22:33:25 -0800 Subject: Non-uniform 2D mesh questions Message-ID: <20070207063325.20285.qmail@s402.sureserver.com> Barry- Maybe i'd better provide more details on what I'm trying to do *) I'm modeling current flowing through several different "block types", each of which describes a section of a semiconductor device. Each block type has a different geometry, which is triangulated to create an AIJ matrix (Each row/column in the matrix represents a coordinate, and the matrix values represent electrical admittance). There are about 100 different types of these blocks, and since they have quite convoluted geometries , their triangulation takes quite a while. *) My complete problem is composed of many (up to 10K) tiles, each of which is one of the 100 blocks . I want to reuse the triangulation which was done for each of the block, do I'd like to have a way of taking the matrix objects of the individual blcks, and combine them into a large matrix, taking into account their relative locations. *) This means that for each block instance, I would need to to translate every internal coordinate/node, and map it to a global coordinate/node. *) Once I have a mapping of local to global indices, I'd like to take the matrix values of the instances, and combine them to form a large matrix which describes the complete problem. So my question is : *) What data structures (DA/IS/AO/???) should I use to achieve the above? Best Regards Yaron -------Original Message------- From: Barry Smith Subject: Re: Non-uniform 2D mesh questions Sent: 02 Feb '07 13:38 Yaron, Anything is possible :-) and maybe not terribly difficult to get started. You could use DAGetMatrx() to give you the properly pre-allocated "huge" Mat. Have each process loop over the "rectangular portion[s] of the domain" that it mostly owns (that is if a rectangular portion lies across two processes just assign it to one of them for this loop.) Then loop over the locations inside the rectangular portion calling MatSetValuesStencil() for that row of the huge matrix to put the entries from the smaller matrix INTO the huge matrix using the natural grid i,j coordindates (so not have to map the coordinates from the grid location to the location in the matrix). This may require some thought to get right but should require little coding (if you are writting hundreds and hundreds of lines of code then likely something is wrong). Good luck, Barry On Tue, 30 Jan 2007, [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] yaron at oak-research.com wrote: > Barry- > So far I only thought of having a single large sparse matrix. > > Yaron > > > -------Original Message------- > From: Barry Smith > Subject: Re: Non-uniform 2D mesh questions > Sent: 30 Jan '07 10:58 > > > Yaron, > > Do you want to end up generating a single large sparse matrix? Like a > MPIAIJ > matrix? Or do you want to somehow not store the entire huge matrix but > still > be able to solve with the composed matrix? Or both? > > Barry > > > On Mon, 29 Jan 2007, [LINK: > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] yaron at oak-research.com wrote: > > > Barry- > > Yes, each block is a rectangular portion of the domain. Not so small > > though (more like 100 x 100 nodes) > > > > Yaron > > > > > > -------Original Message------- > > From: Barry Smith > > Subject: Re: Non-uniform 2D mesh questions > > Sent: 29 Jan '07 19:40 > > > > > > Yaron, > > > > Is each one of these "blocks" a small rectangular part of the > > domain (like a 4 by 5 set of nodes)? I don't understand what you > > want to do. > > > > Barry > > > > > > On Mon, 29 Jan 2007, [LINK: > > [LINK: > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > > [LINK: > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] > [LINK: http://webmail.oak-research.com/compose.php?to=yaron at oak-research.com] yaron at oak-research.com wrote: > > > > > Hi all > > > I have a laplace-type problem that's physically built from repeating > > > instances of the same block. > > > I'm creaing matrices for the individual blocks, and I'd like to > reuse > > > the individual block matrices in order to compose the complete > > problem. > > > (i.e if there 10K instances of 20 blocks, I'd like to build 20 > > matrices, > > > then use them to compose the large complete matrix) > > > Is a 2D DA the right object to do that? And if so, where can I find > a > > > small example of building the DA object in parallel, then using the > > > different (for every instance) mappings of local nodes to global > nodes > > in > > > order to build the complete matrix? > > > > > > > > > Thanks > > > Yaron > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jinzishuai at yahoo.com Wed Feb 7 10:27:48 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Wed, 7 Feb 2007 08:27:48 -0800 (PST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: Message-ID: <218214.27889.qm@web36214.mail.mud.yahoo.com> Thank you very much, Satish. You are right. From the log_summary, the communication takes slightly more time on the shared memory than the cluster even after using the taskset. This is still hard to understand since I think in-memory operations have to been orders of magnitude faster than network opertations(gigabit ethernet). By the way, I took a look my the specs of my shared-memory machine( Sun Fire Server 4600). It seems that each CPU socket has its own DIMMS of RAM. I wonder if there is a speed issue if one has to copy data from the RAM of one CPU to another. Thanks. Shi --- Satish Balay wrote: > A couple of comments: > > - with the dual core opteron - the memorybandwith > per core is now > reduced by half - so the performance suffers. > However memory > bandwidth across CPUs is scalable. [6.4 Gb/s per > each node or 3.2Gb/s > per core] > > - Current generation Intel Core 2 duo appears to > claim having > sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s > per core?] so from > this bandwidth number - this chip might do better > than the AMD > chip. However I'm not sure if there is a SMP with > this chip - which > has scalable memory system [across say 8 nodes - as > you currently > have..] > > - Older intel SMP boxes has a single memory bank > shared across all the > CPUs [so effective bandwidth per CPU was pretty > small. Optrons' > scalable architecture looked much better than the > older intel SMPs] > > - From previous log_summary - part of the > inefficiency of the SMP box > [when compared to the cluster] was in the MPI > performance. Do you > still see this effect in the '-np 8' runs? If so > this could be the > [part of the] reason for this 30% reduction in > performance. > > Satish > > On Mon, 5 Feb 2007, Shi Jin wrote: > > > Hi there, > > > > I have made some new progress on the issue of SMP > > performance. Since my shared memory machine is a 8 > > dual-core Opteron machine. I think the two cores > on a > > single CPU chip shares the memory bandwidth. > > Therefore, if I can avoid using the same core on > the > > chip, I can get some performance improvement. > Indeed, > > I am able to do this by the linux command taskset. > > > Here is what I did: > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 > ../spAF > > This way, I specifically ask the processes to be > run > > on the first core on the CPUs. > > By doing this, my performance is doubled compared > with > > the simple petscmpirun -n 8 ../spAF > > > > So this test shows that we do suffer from the > > competition of resources of multiple processes, > > especially when we use 16 processes. > > > > However, I should point out that even with the > help > > taskset, the shared-memory performance is still > 30% > > less than that on the cluster. > > > > I am not sure whether this problem exists > specifically > > for the AMD machines or it applys to any > shared-memory > > architecture. > > > > Thanks. > > Shi > > > > --- Shi Jin wrote: > > > > > Hi there, > > > > > > I am fairly new to PETSc but have 5 years of MPI > > > programming already. I recently took on a > project of > > > analyzing a finite element code written in C > with > > > PETSc. > > > I found out that on a shared-memory machine > (60GB > > > RAM, > > > 16 CPUS), the code runs around 4 times slower > > > than > > > on a distributed memory cluster (4GB Ram, > > > 4CPU/node), > > > although they yield identical results. > > > There are 1.6Million finite elements in the > problem > > > so > > > it is a fairly large calculation. The total > memory > > > used is 3GBx16=48GB. > > > > > > Both the two systems run Linux as OS and the > same > > > code > > > is compiled against the same version of MPICH-2 > and > > > PETSc. > > > > > > The shared-memory machine is actually a little > > > faster > > > than the cluster machines in terms of single > process > > > runs. > > > > > > I am surprised at this result since we usually > tend > > > to > > > think that shared-memory would be much faster > since > > > the in-memory operation is much faster that the > > > network communication. > > > > > > However, I read the PETSc FAQ and found that > "the > > > speed of sparse matrix computations is almost > > > totally > > > determined by the speed of the memory, not the > speed > > > of the CPU". > > > This makes me wonder whether the poor > performance of > > > my code on a shared-memory machine is due to the > > > competition of different process on the same > memory > > > bus. Since the code is still MPI based, a lot of > > > data > > > are moving around inside the memory. Is this a > > > reasonable explanation of what I observed? > > > > > > Thank you very much. > > > > > > Shi > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Do you Yahoo!? > > > Everyone is raving about the all-new Yahoo! Mail > > > beta. > > > http://new.mail.yahoo.com > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Expecting? Get great news right away with email > Auto-Check. > > Try the Yahoo! Mail Beta. > > > http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html > > > > > > > ____________________________________________________________________________________ Don't get soaked. Take a quick peak at the forecast with the Yahoo! Search weather shortcut. http://tools.search.yahoo.com/shortcuts/#loc_weather From balay at mcs.anl.gov Wed Feb 7 10:58:06 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 7 Feb 2007 10:58:06 -0600 (CST) Subject: PETSc runs slower on a shared memory machine than on a cluster In-Reply-To: <218214.27889.qm@web36214.mail.mud.yahoo.com> References: <218214.27889.qm@web36214.mail.mud.yahoo.com> Message-ID: Can you run the app with the following options [use one per run] - and see if it makes any difference in performance [in VecScatters] -vecscatter_rr -vecscatter_ssend -vecscatter_sendfirst Also - you might want to try using the latest mpich to see if there are any improvements. Regarding the hardware issues - yeah - AMD has a NUMA architecture [i.e access from memory from a different cpu is slower than the memory on the local CPU]. There could also be some OS issues wrt memory layout for MPI messages - or some other contention [perhaps IO interrupts from the OS?] that could be causing the slowdown. All of this is just a guess.. Satish On Wed, 7 Feb 2007, Shi Jin wrote: > Thank you very much, Satish. > You are right. From the log_summary, the communication > takes slightly more time on the shared memory than the > cluster even after using the taskset. > This is still hard to understand since I think > in-memory operations have to been orders of magnitude > faster than network opertations(gigabit ethernet). > > By the way, I took a look my the specs of my > shared-memory machine( Sun Fire Server 4600). > It seems that each CPU socket has its own DIMMS of > RAM. > I wonder if there is a speed issue if one has to copy > data from the RAM of one CPU to another. > > Thanks. > > Shi > --- Satish Balay wrote: > > > A couple of comments: > > > > - with the dual core opteron - the memorybandwith > > per core is now > > reduced by half - so the performance suffers. > > However memory > > bandwidth across CPUs is scalable. [6.4 Gb/s per > > each node or 3.2Gb/s > > per core] > > > > - Current generation Intel Core 2 duo appears to > > claim having > > sufficient bandwidth [15.3Gb/s per node = 7.6Gb/s > > per core?] so from > > this bandwidth number - this chip might do better > > than the AMD > > chip. However I'm not sure if there is a SMP with > > this chip - which > > has scalable memory system [across say 8 nodes - as > > you currently > > have..] > > > > - Older intel SMP boxes has a single memory bank > > shared across all the > > CPUs [so effective bandwidth per CPU was pretty > > small. Optrons' > > scalable architecture looked much better than the > > older intel SMPs] > > > > - From previous log_summary - part of the > > inefficiency of the SMP box > > [when compared to the cluster] was in the MPI > > performance. Do you > > still see this effect in the '-np 8' runs? If so > > this could be the > > [part of the] reason for this 30% reduction in > > performance. > > > > Satish > > > > On Mon, 5 Feb 2007, Shi Jin wrote: > > > > > Hi there, > > > > > > I have made some new progress on the issue of SMP > > > performance. Since my shared memory machine is a 8 > > > dual-core Opteron machine. I think the two cores > > on a > > > single CPU chip shares the memory bandwidth. > > > Therefore, if I can avoid using the same core on > > the > > > chip, I can get some performance improvement. > > Indeed, > > > I am able to do this by the linux command taskset. > > > > > Here is what I did: > > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 > > ../spAF > > > This way, I specifically ask the processes to be > > run > > > on the first core on the CPUs. > > > By doing this, my performance is doubled compared > > with > > > the simple petscmpirun -n 8 ../spAF > > > > > > So this test shows that we do suffer from the > > > competition of resources of multiple processes, > > > especially when we use 16 processes. > > > > > > However, I should point out that even with the > > help > > > taskset, the shared-memory performance is still > > 30% > > > less than that on the cluster. > > > > > > I am not sure whether this problem exists > > specifically > > > for the AMD machines or it applys to any > > shared-memory > > > architecture. > > > > > > Thanks. > > > Shi > > > > > > --- Shi Jin wrote: > > > > > > > Hi there, > > > > > > > > I am fairly new to PETSc but have 5 years of MPI > > > > programming already. I recently took on a > > project of > > > > analyzing a finite element code written in C > > with > > > > PETSc. > > > > I found out that on a shared-memory machine > > (60GB > > > > RAM, > > > > 16 CPUS), the code runs around 4 times slower > > > > than > > > > on a distributed memory cluster (4GB Ram, > > > > 4CPU/node), > > > > although they yield identical results. > > > > There are 1.6Million finite elements in the > > problem > > > > so > > > > it is a fairly large calculation. The total > > memory > > > > used is 3GBx16=48GB. > > > > > > > > Both the two systems run Linux as OS and the > > same > > > > code > > > > is compiled against the same version of MPICH-2 > > and > > > > PETSc. > > > > > > > > The shared-memory machine is actually a little > > > > faster > > > > than the cluster machines in terms of single > > process > > > > runs. > > > > > > > > I am surprised at this result since we usually > > tend > > > > to > > > > think that shared-memory would be much faster > > since > > > > the in-memory operation is much faster that the > > > > network communication. > > > > > > > > However, I read the PETSc FAQ and found that > > "the > > > > speed of sparse matrix computations is almost > > > > totally > > > > determined by the speed of the memory, not the > > speed > > > > of the CPU". > > > > This makes me wonder whether the poor > > performance of > > > > my code on a shared-memory machine is due to the > > > > competition of different process on the same > > memory > > > > bus. Since the code is still MPI based, a lot of > > > > data > > > > are moving around inside the memory. Is this a > > > > reasonable explanation of what I observed? > > > > > > > > Thank you very much. > > > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > Do you Yahoo!? > > > > Everyone is raving about the all-new Yahoo! Mail > > > > beta. > > > > http://new.mail.yahoo.com > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Expecting? Get great news right away with email > > Auto-Check. > > > Try the Yahoo! Mail Beta. > > > > > > http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > Don't get soaked. Take a quick peak at the forecast > with the Yahoo! Search weather shortcut. > http://tools.search.yahoo.com/shortcuts/#loc_weather > > From zonexo at gmail.com Thu Feb 8 09:47:24 2007 From: zonexo at gmail.com (Ben Tay) Date: Thu, 8 Feb 2007 23:47:24 +0800 Subject: understanding the output from -info Message-ID: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> Hi, i'm trying to solve my cfd code using PETSc in parallel. Besides the linear eqns for PETSc, other parts of the code has also been parallelized using MPI. however i find that the parallel version of the code running on 4 processors is even slower than the sequential version. in order to find out why, i've used the -info option to print out the details. there are 2 linear equations being solved - momentum and poisson. the momentum one is twice the size of the poisson. it is shown below: [0] User provided function(): (Fortran):PETSc successfully started: procs 4 [1] User provided function(): (Fortran):PETSc successfully started: procs 4 [3] User provided function(): (Fortran):PETSc successfully started: procs 4 [2] User provided function(): (Fortran):PETSc successfully started: procs 4 [0] PetscGetHostName(): Rejecting domainname, likely is NIS atlas2-c12.(none) [0] User provided function(): Running on machine: atlas2-c12 [1] PetscGetHostName(): Rejecting domainname, likely is NIS atlas2-c12.(none) [1] User provided function(): Running on machine: atlas2-c12 [3] PetscGetHostName(): Rejecting domainname, likely is NIS atlas2-c08.(none) [3] User provided function(): Running on machine: atlas2-c08 [2] PetscGetHostName(): Rejecting domainname, likely is NIS atlas2-c08.(none) [2] User provided function(): Running on machine: atlas2-c08 [0] PetscCommDuplicate(): Duplicating a communicator 91 141 max tags = 1073741823 [1] PetscCommDuplicate(): Duplicating a communicator 91 141 max tags = 1073741823 [2] PetscCommDuplicate(): Duplicating a communicator 91 141 max tags = 1073741823 [3] PetscCommDuplicate(): Duplicating a communicator 91 141 max tags = 1073741823 [0] PetscCommDuplicate(): Duplicating a communicator 92 143 max tags = 1073741823 [2] PetscCommDuplicate(): Duplicating a communicator 92 143 max tags = 1073741823 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [0] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [1] PetscCommDuplicate(): Duplicating a communicator 92 143 max tags = 1073741823 [3] PetscCommDuplicate(): Duplicating a communicator 92 143 max tags = 1073741823 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 0 3200 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 3200 6400 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 6400 9600 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 9600 12800 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [0] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [1] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [2] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [2] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [0] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [1] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [1] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [1] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [1] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [1] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 3200 6400 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [3] PetscCommDuplicate(): Using internal PETSc communicator 91 141 9600 12800 [0] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [2] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [0] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [2] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [0] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [2] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [0] PetscCommDuplicate(): Using internal PETSc communicator 91 141 [2] PetscCommDuplicate(): Using internal PETSc communicator 91 141 0 3200 6400 9600 [1] MatStashScatterBegin_Private(): No of messages: 0 [1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. [3] MatStashScatterBegin_Private(): No of messages: 0 [3] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. [3] MatAssemblyEnd_SeqAIJ(): Matrix size: 3200 X 3200; storage space: 4064 unneeded,53536 used [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 3200 X 3200; storage space: 4064 unneeded,53536 used [3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 18 [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 18 [3] Mat_CheckInode(): Found 1600 nodes of 3200. Limit used: 5. Using Inode routines [1] Mat_CheckInode(): Found 1600 nodes of 3200. Limit used: 5. Using Inode routines [0] MatStashScatterBegin_Private(): No of messages: 0 [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. [2] MatStashScatterBegin_Private(): No of messages: 0 [2] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. [2] MatAssemblyEnd_SeqAIJ(): Matrix size: 3200 X 3200; storage space: 4064 unneeded,53536 used [2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 18 [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 3200 X 3200; storage space: 3120 unneeded,54480 used [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 18 [2] Mat_CheckInode(): Found 1600 nodes of 3200. Limit used: 5. Using Inode routines [0] Mat_CheckInode(): Found 1600 nodes of 3200. Limit used: 5. Using Inode routines [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [1] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 3200 X 640; storage space: 53776 unneeded,3824 used [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 6 [1] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 2560)/(num_localrows 3200) > 0.6. Use CompressedRow routines. [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [0] VecScatterCreate(): General case: MPI to Seq [0] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 3200 X 320; storage space: 55688 unneeded,1912 used [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 6 [0] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 2880)/(num_localrows 3200) > 0.6. Use CompressedRow routines. [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 6 [0] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 2880)/(num_localrows 3200) > 0.6. Use CompressedRow routines. [3] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [3] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [3] MatAssemblyEnd_SeqAIJ(): Matrix size: 3200 X 320; storage space: 55688 unneeded,1912 used [3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 6 [3] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 2880)/(num_localrows 3200) > 0.6. Use CompressedRow routines. [2] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [2] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [2] MatAssemblyEnd_SeqAIJ(): Matrix size: 3200 X 640; storage space: 53776 unneeded,3824 used [2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 6 [2] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 2560)/(num_localrows 3200) > 0.6. Use CompressedRow routines. [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [2] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [2] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [1] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [1] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [3] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [3] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [1] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [1] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [3] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [3] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [2] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [2] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [0] PCSetUp(): Setting up new PC [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PCSetUp(): Setting up new PC [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PCSetUp(): Setting up new PC [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PCSetUp(): Setting up new PC [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PCSetUp(): Setting up new PC [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] KSPDefaultConverged(): user has provided nonzero initial guess, computing 2-norm of preconditioned RHS [0] KSPDefaultConverged(): Linear solver has converged. Residual norm 1.00217e-05 is less than relative tolerance 1e-05 times initial right hand side norm 6.98447 at iteration 5 [0] MatStashScatterBegin_Private(): No of messages: 0 [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 1600; storage space: 774 unneeded,13626 used [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 9 [0] Mat_CheckInode(): Found 1600 nodes out of 1600 rows. Not using Inode routines [1] MatStashScatterBegin_Private(): No of messages: 0 [1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 1600; storage space: 1016 unneeded,13384 used [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 9 [1] Mat_CheckInode(): Found 1600 nodes out of 1600 rows. Not using Inode routines [2] MatStashScatterBegin_Private(): No of messages: 0 [2] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. [2] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 1600; storage space: 1016 unneeded,13384 used [2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 9 [2] Mat_CheckInode(): Found 1600 nodes out of 1600 rows. Not using Inode routines [3] MatStashScatterBegin_Private(): No of messages: 0 [3] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. [3] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 1600; storage space: 1016 unneeded,13384 used [3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 9 [3] Mat_CheckInode(): Found 1600 nodes out of 1600 rows. Not using Inode routines [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [2] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [2] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [2] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 320; storage space: 13444 unneeded,956 used [2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [0] VecScatterCreate(): General case: MPI to Seq [2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 3 [2] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 1280)/(num_localrows 1600) > 0.6. Use CompressedRow routines. [0] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 160; storage space: 13922 unneeded,478 used [0] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 160; storage space: 13922 unneeded,478 used [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 3 [0] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 1440)/(num_localrows 1600) > 0.6. Use CompressedRow routines. [3] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [3] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [3] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 160; storage space: 13922 unneeded,478 used [3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 3 [3] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 1440)/(num_localrows 1600) > 0.6. Use CompressedRow routines. [1] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter [1] MatSetOption_Inode(): Not using Inode routines due to MatSetOption(MAT_DO_NOT_USE_INODES [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 1600 X 320; storage space: 13444 unneeded,956 used [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 3 [1] Mat_CheckCompressedRow(): Found the ratio (num_zerorows 1280)/(num_localrows 1600) > 0.6. Use CompressedRow routines. [1] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [3] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [1] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [3] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [2] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [2] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [2] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [2] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [1] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [3] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. [3] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [1] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PCSetUp(): Setting up new PC [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [3] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PCSetUp(): Setting up new PC [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [1] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PCSetUp(): Setting up new PC [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [2] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PCSetUp(): Setting up new PC [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PCSetUp(): Setting up new PC [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PCSetUp(): Setting up new PC [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] PetscCommDuplicate(): Using internal PETSc communicator 92 143 [0] KSPDefaultConverged(): Linear solver has converged. Residual norm 8.84097e-05 is less than relative tolerance 1e-05 times initial right hand side norm 8.96753 at iteration 212 1 1.000000000000000E-002 1.15678640520876 0.375502846664950 i saw some statements stating "seq". am i running in sequential or parallel mode? have i preallocated too much space? lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange and b_sta and b_end from VecGetOwnershipRange should always be the same value, right? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dalcinl at gmail.com Thu Feb 8 10:50:17 2007 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Thu, 8 Feb 2007 13:50:17 -0300 Subject: understanding the output from -info In-Reply-To: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> Message-ID: On 2/8/07, Ben Tay wrote: > i'm trying to solve my cfd code using PETSc in parallel. Besides the linear > eqns for PETSc, other parts of the code has also been parallelized using > MPI. Finite elements or finite differences, or what? > however i find that the parallel version of the code running on 4 processors > is even slower than the sequential version. Can you monitor the convergence and iteration count of momentum and poisson steps? > in order to find out why, i've used the -info option to print out the > details. there are 2 linear equations being solved - momentum and poisson. > the momentum one is twice the size of the poisson. it is shown below: Can you use -log_summary command line option and send the output attached? > i saw some statements stating "seq". am i running in sequential or parallel > mode? have i preallocated too much space? It seems you are running in parallel. The "Seq" are related to local, internal objects. In PETSc, parallel matrices have inner sequential matrices. > lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange and b_sta and > b_end from VecGetOwnershipRange should always be the same value, right? I should. If not, you are likely going to get an runtime error. Regards, -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From zonexo at gmail.com Fri Feb 9 06:34:34 2007 From: zonexo at gmail.com (Ben Tay) Date: Fri, 9 Feb 2007 20:34:34 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> Message-ID: <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> Hi, I've tried to use log_summary but nothing came out? Did I miss out something? It worked when I used -info... On 2/9/07, Lisandro Dalcin wrote: > > On 2/8/07, Ben Tay wrote: > > i'm trying to solve my cfd code using PETSc in parallel. Besides the > linear > > eqns for PETSc, other parts of the code has also been parallelized using > > MPI. > > Finite elements or finite differences, or what? > > > however i find that the parallel version of the code running on 4 > processors > > is even slower than the sequential version. > > Can you monitor the convergence and iteration count of momentum and > poisson steps? > > > > in order to find out why, i've used the -info option to print out the > > details. there are 2 linear equations being solved - momentum and > poisson. > > the momentum one is twice the size of the poisson. it is shown below: > > Can you use -log_summary command line option and send the output attached? > > > i saw some statements stating "seq". am i running in sequential or > parallel > > mode? have i preallocated too much space? > > It seems you are running in parallel. The "Seq" are related to local, > internal objects. In PETSc, parallel matrices have inner sequential > matrices. > > > lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange and b_sta > and > > b_end from VecGetOwnershipRange should always be the same value, right? > > I should. If not, you are likely going to get an runtime error. > > Regards, > > -- > Lisandro Dalc?n > --------------- > Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > Tel/Fax: +54-(0)342-451.1594 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Fri Feb 9 08:01:09 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 9 Feb 2007 08:01:09 -0600 (CST) Subject: understanding the output from -info In-Reply-To: <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> Message-ID: -log_summary On Fri, 9 Feb 2007, Ben Tay wrote: > Hi, > > I've tried to use log_summary but nothing came out? Did I miss out > something? It worked when I used -info... > > > On 2/9/07, Lisandro Dalcin wrote: > > > > On 2/8/07, Ben Tay wrote: > > > i'm trying to solve my cfd code using PETSc in parallel. Besides the > > linear > > > eqns for PETSc, other parts of the code has also been parallelized using > > > MPI. > > > > Finite elements or finite differences, or what? > > > > > however i find that the parallel version of the code running on 4 > > processors > > > is even slower than the sequential version. > > > > Can you monitor the convergence and iteration count of momentum and > > poisson steps? > > > > > > > in order to find out why, i've used the -info option to print out the > > > details. there are 2 linear equations being solved - momentum and > > poisson. > > > the momentum one is twice the size of the poisson. it is shown below: > > > > Can you use -log_summary command line option and send the output attached? > > > > > i saw some statements stating "seq". am i running in sequential or > > parallel > > > mode? have i preallocated too much space? > > > > It seems you are running in parallel. The "Seq" are related to local, > > internal objects. In PETSc, parallel matrices have inner sequential > > matrices. > > > > > lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange and b_sta > > and > > > b_end from VecGetOwnershipRange should always be the same value, right? > > > > I should. If not, you are likely going to get an runtime error. > > > > Regards, > > > > -- > > Lisandro Dalc??n > > --------------- > > Centro Internacional de M??todos Computacionales en Ingenier??a (CIMEC) > > Instituto de Desarrollo Tecnol??gico para la Industria Qu??mica (INTEC) > > Consejo Nacional de Investigaciones Cient??ficas y T??cnicas (CONICET) > > PTLC - G??emes 3450, (3000) Santa Fe, Argentina > > Tel/Fax: +54-(0)342-451.1594 > > > > > From zonexo at gmail.com Fri Feb 9 08:20:47 2007 From: zonexo at gmail.com (Ben Tay) Date: Fri, 9 Feb 2007 22:20:47 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> Message-ID: <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> ya, i did use -log_summary. but no output..... On 2/9/07, Barry Smith wrote: > > > -log_summary > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > Hi, > > > > I've tried to use log_summary but nothing came out? Did I miss out > > something? It worked when I used -info... > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > On 2/8/07, Ben Tay wrote: > > > > i'm trying to solve my cfd code using PETSc in parallel. Besides the > > > linear > > > > eqns for PETSc, other parts of the code has also been parallelized > using > > > > MPI. > > > > > > Finite elements or finite differences, or what? > > > > > > > however i find that the parallel version of the code running on 4 > > > processors > > > > is even slower than the sequential version. > > > > > > Can you monitor the convergence and iteration count of momentum and > > > poisson steps? > > > > > > > > > > in order to find out why, i've used the -info option to print out > the > > > > details. there are 2 linear equations being solved - momentum and > > > poisson. > > > > the momentum one is twice the size of the poisson. it is shown > below: > > > > > > Can you use -log_summary command line option and send the output > attached? > > > > > > > i saw some statements stating "seq". am i running in sequential or > > > parallel > > > > mode? have i preallocated too much space? > > > > > > It seems you are running in parallel. The "Seq" are related to local, > > > internal objects. In PETSc, parallel matrices have inner sequential > > > matrices. > > > > > > > lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange and > b_sta > > > and > > > > b_end from VecGetOwnershipRange should always be the same value, > right? > > > > > > I should. If not, you are likely going to get an runtime error. > > > > > > Regards, > > > > > > -- > > > Lisandro Dalc?n > > > --------------- > > > Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) > > > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Feb 9 08:59:16 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 9 Feb 2007 08:59:16 -0600 Subject: understanding the output from -info In-Reply-To: <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> Message-ID: Impossible, please check the spelling, and make sure your command line was not truncated. Matt On 2/9/07, Ben Tay wrote: > > ya, i did use -log_summary. but no output..... > > On 2/9/07, Barry Smith wrote: > > > > > > -log_summary > > > > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > > > Hi, > > > > > > I've tried to use log_summary but nothing came out? Did I miss out > > > something? It worked when I used -info... > > > > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > > > On 2/8/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > i'm trying to solve my cfd code using PETSc in parallel. Besides > > the > > > > linear > > > > > eqns for PETSc, other parts of the code has also been parallelized > > using > > > > > MPI. > > > > > > > > Finite elements or finite differences, or what? > > > > > > > > > however i find that the parallel version of the code running on 4 > > > > processors > > > > > is even slower than the sequential version. > > > > > > > > Can you monitor the convergence and iteration count of momentum and > > > > poisson steps? > > > > > > > > > > > > > in order to find out why, i've used the -info option to print out > > the > > > > > details. there are 2 linear equations being solved - momentum and > > > > poisson. > > > > > the momentum one is twice the size of the poisson. it is shown > > below: > > > > > > > > Can you use -log_summary command line option and send the output > > attached? > > > > > > > > > i saw some statements stating "seq". am i running in sequential or > > > > parallel > > > > > mode? have i preallocated too much space? > > > > > > > > It seems you are running in parallel. The "Seq" are related to > > local, > > > > internal objects. In PETSc, parallel matrices have inner sequential > > > > matrices. > > > > > > > > > lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange and > > b_sta > > > > and > > > > > b_end from VecGetOwnershipRange should always be the same value, > > right? > > > > > > > > I should. If not, you are likely going to get an runtime error. > > > > > > > > Regards, > > > > > > > > -- > > > > Lisandro Dalc?n > > > > --------------- > > > > Centro Internacional de M?todos Computacionales en Ingenier?a > > (CIMEC) > > > > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica > > (INTEC) > > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie -------------- next part -------------- An HTML attachment was scrubbed... URL: From zonexo at gmail.com Fri Feb 9 09:24:07 2007 From: zonexo at gmail.com (Ben Tay) Date: Fri, 9 Feb 2007 23:24:07 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> Message-ID: <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> Well, I don't know what's wrong. I did the same thing for -info and it worked. Anyway, is there any other way? Like I can use -mat_view or call matview( ... ) to view a matrix. Is there a similar subroutine for me to call? Thank you. On 2/9/07, Matthew Knepley wrote: > > Impossible, please check the spelling, and make sure your > command line was not truncated. > > Matt > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > ya, i did use -log_summary. but no output..... > > > > On 2/9/07, Barry Smith wrote: > > > > > > > > > -log_summary > > > > > > > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > > > > > Hi, > > > > > > > > I've tried to use log_summary but nothing came out? Did I miss out > > > > something? It worked when I used -info... > > > > > > > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > > > > > On 2/8/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > i'm trying to solve my cfd code using PETSc in parallel. Besides > > > the > > > > > linear > > > > > > eqns for PETSc, other parts of the code has also been > > > parallelized using > > > > > > MPI. > > > > > > > > > > Finite elements or finite differences, or what? > > > > > > > > > > > however i find that the parallel version of the code running on > > > 4 > > > > > processors > > > > > > is even slower than the sequential version. > > > > > > > > > > Can you monitor the convergence and iteration count of momentum > > > and > > > > > poisson steps? > > > > > > > > > > > > > > > > in order to find out why, i've used the -info option to print > > > out the > > > > > > details. there are 2 linear equations being solved - momentum > > > and > > > > > poisson. > > > > > > the momentum one is twice the size of the poisson. it is shown > > > below: > > > > > > > > > > Can you use -log_summary command line option and send the output > > > attached? > > > > > > > > > > > i saw some statements stating "seq". am i running in sequential > > > or > > > > > parallel > > > > > > mode? have i preallocated too much space? > > > > > > > > > > It seems you are running in parallel. The "Seq" are related to > > > local, > > > > > internal objects. In PETSc, parallel matrices have inner > > > sequential > > > > > matrices. > > > > > > > > > > > lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange and > > > b_sta > > > > > and > > > > > > b_end from VecGetOwnershipRange should always be the same value, > > > right? > > > > > > > > > > I should. If not, you are likely going to get an runtime error. > > > > > > > > > > Regards, > > > > > > > > > > -- > > > > > Lisandro Dalc?n > > > > > --------------- > > > > > Centro Internacional de M?todos Computacionales en Ingenier?a > > > (CIMEC) > > > > > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica > > > (INTEC) > > > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas > > > (CONICET) > > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > > > > > > > > > > -- > One trouble is that despite this system, anyone who reads journals widely > and critically is forced to realize that there are scarcely any bars to > eventual > publication. There seems to be no study too fragmented, no hypothesis too > trivial, no literature citation too biased or too egotistical, no design > too > warped, no methodology too bungled, no presentation of results too > inaccurate, too obscure, and too contradictory, no analysis too > self-serving, > no argument too circular, no conclusions too trifling or too unjustified, > and > no grammar and syntax too offensive for a paper to end up in print. -- > Drummond Rennie -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Feb 9 09:27:30 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 9 Feb 2007 09:27:30 -0600 Subject: understanding the output from -info In-Reply-To: <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> Message-ID: Problems do not go away by ignoring them. Something is wrong here, and it may affect the rest of your program. Please try to run an example: cd src/ksp/ksp/examples/tutorials make ex2 ./ex2 -log_summary Matt On 2/9/07, Ben Tay wrote: > > Well, I don't know what's wrong. I did the same thing for -info and it > worked. Anyway, is there any other way? > > Like I can use -mat_view or call matview( ... ) to view a matrix. Is there > a similar subroutine for me to call? > > Thank you. > > > On 2/9/07, Matthew Knepley wrote: > > > > Impossible, please check the spelling, and make sure your > > command line was not truncated. > > > > Matt > > > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > ya, i did use -log_summary. but no output..... > > > > > > On 2/9/07, Barry Smith wrote: > > > > > > > > > > > > -log_summary > > > > > > > > > > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > > > > > > > Hi, > > > > > > > > > > I've tried to use log_summary but nothing came out? Did I miss out > > > > > > > > > something? It worked when I used -info... > > > > > > > > > > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > > > > > > > On 2/8/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > i'm trying to solve my cfd code using PETSc in parallel. > > > > Besides the > > > > > > linear > > > > > > > eqns for PETSc, other parts of the code has also been > > > > parallelized using > > > > > > > MPI. > > > > > > > > > > > > Finite elements or finite differences, or what? > > > > > > > > > > > > > however i find that the parallel version of the code running > > > > on 4 > > > > > > processors > > > > > > > is even slower than the sequential version. > > > > > > > > > > > > Can you monitor the convergence and iteration count of momentum > > > > and > > > > > > poisson steps? > > > > > > > > > > > > > > > > > > > in order to find out why, i've used the -info option to print > > > > out the > > > > > > > details. there are 2 linear equations being solved - momentum > > > > and > > > > > > poisson. > > > > > > > the momentum one is twice the size of the poisson. it is shown > > > > below: > > > > > > > > > > > > Can you use -log_summary command line option and send the output > > > > attached? > > > > > > > > > > > > > i saw some statements stating "seq". am i running in > > > > sequential or > > > > > > parallel > > > > > > > mode? have i preallocated too much space? > > > > > > > > > > > > It seems you are running in parallel. The "Seq" are related to > > > > local, > > > > > > internal objects. In PETSc, parallel matrices have inner > > > > sequential > > > > > > matrices. > > > > > > > > > > > > > lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange > > > > and b_sta > > > > > > and > > > > > > > b_end from VecGetOwnershipRange should always be the same > > > > value, right? > > > > > > > > > > > > I should. If not, you are likely going to get an runtime error. > > > > > > > > > > > > Regards, > > > > > > > > > > > > -- > > > > > > Lisandro Dalc?n > > > > > > --------------- > > > > > > Centro Internacional de M?todos Computacionales en Ingenier?a > > > > (CIMEC) > > > > > > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica > > > > (INTEC) > > > > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas > > > > (CONICET) > > > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > One trouble is that despite this system, anyone who reads journals > > widely > > and critically is forced to realize that there are scarcely any bars to > > eventual > > publication. There seems to be no study too fragmented, no hypothesis > > too > > trivial, no literature citation too biased or too egotistical, no design > > too > > warped, no methodology too bungled, no presentation of results too > > inaccurate, too obscure, and too contradictory, no analysis too > > self-serving, > > no argument too circular, no conclusions too trifling or too > > unjustified, and > > no grammar and syntax too offensive for a paper to end up in print. -- > > Drummond Rennie > > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie -------------- next part -------------- An HTML attachment was scrubbed... URL: From zonexo at gmail.com Fri Feb 9 10:16:56 2007 From: zonexo at gmail.com (Ben Tay) Date: Sat, 10 Feb 2007 00:16:56 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> Message-ID: <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> ops.... it worked for ex2 and ex2f ;-) so what could be wrong? is there some commands or subroutine which i must call? btw, i'm programming in fortran. thank you. On 2/9/07, Matthew Knepley wrote: > > Problems do not go away by ignoring them. Something is wrong here, and it > may > affect the rest of your program. Please try to run an example: > > cd src/ksp/ksp/examples/tutorials > make ex2 > ./ex2 -log_summary > > Matt > > On 2/9/07, Ben Tay wrote: > > > > Well, I don't know what's wrong. I did the same thing for -info and it > > worked. Anyway, is there any other way? > > > > Like I can use -mat_view or call matview( ... ) to view a matrix. Is > > there a similar subroutine for me to call? > > > > Thank you. > > > > > > On 2/9/07, Matthew Knepley wrote: > > > > > > Impossible, please check the spelling, and make sure your > > > command line was not truncated. > > > > > > Matt > > > > > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > ya, i did use -log_summary. but no output..... > > > > > > > > On 2/9/07, Barry Smith wrote: > > > > > > > > > > > > > > > -log_summary > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I've tried to use log_summary but nothing came out? Did I miss > > > > > out > > > > > > something? It worked when I used -info... > > > > > > > > > > > > > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > > > > > > > > > On 2/8/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > i'm trying to solve my cfd code using PETSc in parallel. > > > > > Besides the > > > > > > > linear > > > > > > > > eqns for PETSc, other parts of the code has also been > > > > > parallelized using > > > > > > > > MPI. > > > > > > > > > > > > > > Finite elements or finite differences, or what? > > > > > > > > > > > > > > > however i find that the parallel version of the code running > > > > > on 4 > > > > > > > processors > > > > > > > > is even slower than the sequential version. > > > > > > > > > > > > > > Can you monitor the convergence and iteration count of > > > > > momentum and > > > > > > > poisson steps? > > > > > > > > > > > > > > > > > > > > > > in order to find out why, i've used the -info option to > > > > > print out the > > > > > > > > details. there are 2 linear equations being solved - > > > > > momentum and > > > > > > > poisson. > > > > > > > > the momentum one is twice the size of the poisson. it is > > > > > shown below: > > > > > > > > > > > > > > Can you use -log_summary command line option and send the > > > > > output attached? > > > > > > > > > > > > > > > i saw some statements stating "seq". am i running in > > > > > sequential or > > > > > > > parallel > > > > > > > > mode? have i preallocated too much space? > > > > > > > > > > > > > > It seems you are running in parallel. The "Seq" are related to > > > > > local, > > > > > > > internal objects. In PETSc, parallel matrices have inner > > > > > sequential > > > > > > > matrices. > > > > > > > > > > > > > > > lastly, if Ax=b, A_sta and A_end from MatGetOwnershipRange > > > > > and b_sta > > > > > > > and > > > > > > > > b_end from VecGetOwnershipRange should always be the same > > > > > value, right? > > > > > > > > > > > > > > I should. If not, you are likely going to get an runtime > > > > > error. > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > -- > > > > > > > Lisandro Dalc?n > > > > > > > --------------- > > > > > > > Centro Internacional de M?todos Computacionales en Ingenier?a > > > > > (CIMEC) > > > > > > > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica > > > > > (INTEC) > > > > > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas > > > > > (CONICET) > > > > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > One trouble is that despite this system, anyone who reads journals > > > widely > > > and critically is forced to realize that there are scarcely any bars > > > to eventual > > > publication. There seems to be no study too fragmented, no hypothesis > > > too > > > trivial, no literature citation too biased or too egotistical, no > > > design too > > > warped, no methodology too bungled, no presentation of results too > > > inaccurate, too obscure, and too contradictory, no analysis too > > > self-serving, > > > no argument too circular, no conclusions too trifling or too > > > unjustified, and > > > no grammar and syntax too offensive for a paper to end up in print. -- > > > Drummond Rennie > > > > > > > > > -- > One trouble is that despite this system, anyone who reads journals widely > and critically is forced to realize that there are scarcely any bars to > eventual > publication. There seems to be no study too fragmented, no hypothesis too > trivial, no literature citation too biased or too egotistical, no design > too > warped, no methodology too bungled, no presentation of results too > inaccurate, too obscure, and too contradictory, no analysis too > self-serving, > no argument too circular, no conclusions too trifling or too unjustified, > and > no grammar and syntax too offensive for a paper to end up in print. -- > Drummond Rennie > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Feb 9 10:20:13 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 9 Feb 2007 10:20:13 -0600 Subject: understanding the output from -info In-Reply-To: <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> Message-ID: On 2/9/07, Ben Tay wrote: > > ops.... it worked for ex2 and ex2f ;-) > > so what could be wrong? is there some commands or subroutine which i must > call? btw, i'm programming in fortran. > Yes, you must call PetscFinalize() in your code. Matt thank you. > > > On 2/9/07, Matthew Knepley wrote: > > > > Problems do not go away by ignoring them. Something is wrong here, and > > it may > > affect the rest of your program. Please try to run an example: > > > > cd src/ksp/ksp/examples/tutorials > > make ex2 > > ./ex2 -log_summary > > > > Matt > > > > On 2/9/07, Ben Tay wrote: > > > > > > Well, I don't know what's wrong. I did the same thing for -info and it > > > worked. Anyway, is there any other way? > > > > > > Like I can use -mat_view or call matview( ... ) to view a matrix. Is > > > there a similar subroutine for me to call? > > > > > > Thank you. > > > > > > > > > On 2/9/07, Matthew Knepley wrote: > > > > > > > > Impossible, please check the spelling, and make sure your > > > > command line was not truncated. > > > > > > > > Matt > > > > > > > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > > > ya, i did use -log_summary. but no output..... > > > > > > > > > > On 2/9/07, Barry Smith wrote: > > > > > > > > > > > > > > > > > > -log_summary > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I've tried to use log_summary but nothing came out? Did I miss > > > > > > out > > > > > > > something? It worked when I used -info... > > > > > > > > > > > > > > > > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > > > > > > > > > > > On 2/8/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > > i'm trying to solve my cfd code using PETSc in parallel. > > > > > > Besides the > > > > > > > > linear > > > > > > > > > eqns for PETSc, other parts of the code has also been > > > > > > parallelized using > > > > > > > > > MPI. > > > > > > > > > > > > > > > > Finite elements or finite differences, or what? > > > > > > > > > > > > > > > > > however i find that the parallel version of the code > > > > > > running on 4 > > > > > > > > processors > > > > > > > > > is even slower than the sequential version. > > > > > > > > > > > > > > > > Can you monitor the convergence and iteration count of > > > > > > momentum and > > > > > > > > poisson steps? > > > > > > > > > > > > > > > > > > > > > > > > > in order to find out why, i've used the -info option to > > > > > > print out the > > > > > > > > > details. there are 2 linear equations being solved - > > > > > > momentum and > > > > > > > > poisson. > > > > > > > > > the momentum one is twice the size of the poisson. it is > > > > > > shown below: > > > > > > > > > > > > > > > > Can you use -log_summary command line option and send the > > > > > > output attached? > > > > > > > > > > > > > > > > > i saw some statements stating "seq". am i running in > > > > > > sequential or > > > > > > > > parallel > > > > > > > > > mode? have i preallocated too much space? > > > > > > > > > > > > > > > > It seems you are running in parallel. The "Seq" are related > > > > > > to local, > > > > > > > > internal objects. In PETSc, parallel matrices have inner > > > > > > sequential > > > > > > > > matrices. > > > > > > > > > > > > > > > > > lastly, if Ax=b, A_sta and A_end > > > > > > from MatGetOwnershipRange and b_sta > > > > > > > > and > > > > > > > > > b_end from VecGetOwnershipRange should always be the same > > > > > > value, right? > > > > > > > > > > > > > > > > I should. If not, you are likely going to get an runtime > > > > > > error. > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > -- > > > > > > > > Lisandro Dalc?n > > > > > > > > --------------- > > > > > > > > Centro Internacional de M?todos Computacionales en > > > > > > Ingenier?a (CIMEC) > > > > > > > > Instituto de Desarrollo Tecnol?gico para la Industria > > > > > > Qu?mica (INTEC) > > > > > > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas > > > > > > (CONICET) > > > > > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > One trouble is that despite this system, anyone who reads journals > > > > widely > > > > and critically is forced to realize that there are scarcely any bars > > > > to eventual > > > > publication. There seems to be no study too fragmented, no > > > > hypothesis too > > > > trivial, no literature citation too biased or too egotistical, no > > > > design too > > > > warped, no methodology too bungled, no presentation of results too > > > > inaccurate, too obscure, and too contradictory, no analysis too > > > > self-serving, > > > > no argument too circular, no conclusions too trifling or too > > > > unjustified, and > > > > no grammar and syntax too offensive for a paper to end up in print. > > > > -- Drummond Rennie > > > > > > > > > > > > > > > -- > > One trouble is that despite this system, anyone who reads journals > > widely > > and critically is forced to realize that there are scarcely any bars to > > eventual > > publication. There seems to be no study too fragmented, no hypothesis > > too > > trivial, no literature citation too biased or too egotistical, no design > > too > > warped, no methodology too bungled, no presentation of results too > > inaccurate, too obscure, and too contradictory, no analysis too > > self-serving, > > no argument too circular, no conclusions too trifling or too > > unjustified, and > > no grammar and syntax too offensive for a paper to end up in print. -- > > Drummond Rennie > > > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie -------------- next part -------------- An HTML attachment was scrubbed... URL: From dimitri.lecas at c-s.fr Fri Feb 9 11:33:40 2007 From: dimitri.lecas at c-s.fr (LECAS Dimitri) Date: Fri, 09 Feb 2007 18:33:40 +0100 Subject: Partitioning on a mpiaij matrix Message-ID: <38643393c5.393c538643@c-s.fr> Hello, I thinks i find a "bug". I try to use parmetis for partitioning a matrix created with MatCreateMPIAIJ. Here the output : [0]PETSC ERROR: No support for this operation for this object type! [0]PETSC ERROR: Mat type mpiadj! [0]PETSC ERROR: MatSetValues() line 825 in src/mat/interface/matrix.c [0]PETSC ERROR: MatConvert_Basic() line 34 in src/mat/utils/convert.c [0]PETSC ERROR: MatConvert() line 3134 in src/mat/interface/matrix.c [0]PETSC ERROR: MatPartitioningApply_Parmetis() line 47 in src/mat/partition/impls/pmetis/pmetis.c [0]PETSC ERROR: MatPartitioningApply() line 238 in src/mat/partition/partition.c If i understand correctly, MatPartitioningApply_Parmetis try to convert the matrix in format MPIAdj and failed because we can't use MatSetValues on a MPIAdj. It's possible to easily avoid this bug ? Best regards -- Dimitri Lecas From bsmith at mcs.anl.gov Fri Feb 9 13:09:17 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 9 Feb 2007 13:09:17 -0600 (CST) Subject: Partitioning on a mpiaij matrix In-Reply-To: <38643393c5.393c538643@c-s.fr> References: <38643393c5.393c538643@c-s.fr> Message-ID: MatConvert() checks for a variety of converts; from the code /* 3) See if a good general converter is registered for the desired class */ conv = B->ops->convertfrom; ierr = MatDestroy(B);CHKERRQ(ierr); if (conv) goto foundconv; now MATMPIADJ has a MatConvertFrom that SHOULD be listed in the function table so it should not fall into the default MatConvert_Basic(). What version of PETSc are you using? Maybe an older one that does not have this converter? If you are using 2.3.2 or petsc-dev you can put a breakpoint in MatConvert() and try to see why it is not picking up the convertfrom function? It is possible some bug that we are not aware of but I have difficulty seeing what could be going wrong. Good luck, Barry On Fri, 9 Feb 2007, LECAS Dimitri wrote: > Hello, > > I thinks i find a "bug". I try to use parmetis for partitioning a matrix > created with MatCreateMPIAIJ. > > Here the output : > > [0]PETSC ERROR: No support for this operation for this object type! > [0]PETSC ERROR: Mat type mpiadj! > [0]PETSC ERROR: MatSetValues() line 825 in src/mat/interface/matrix.c > [0]PETSC ERROR: MatConvert_Basic() line 34 in src/mat/utils/convert.c > [0]PETSC ERROR: MatConvert() line 3134 in src/mat/interface/matrix.c > [0]PETSC ERROR: MatPartitioningApply_Parmetis() line 47 in > src/mat/partition/impls/pmetis/pmetis.c > [0]PETSC ERROR: MatPartitioningApply() line 238 in > src/mat/partition/partition.c > > If i understand correctly, MatPartitioningApply_Parmetis try to convert > the matrix in format MPIAdj and failed because we can't use MatSetValues > on a MPIAdj. > > It's possible to easily avoid this bug ? > > Best regards > > From dimitri.lecas at c-s.fr Fri Feb 9 14:21:03 2007 From: dimitri.lecas at c-s.fr (LECAS Dimitri) Date: Fri, 09 Feb 2007 21:21:03 +0100 Subject: Partitioning on a mpiaij matrix Message-ID: <399fc3cc1b.3cc1b399fc@c-s.fr> ----- Original Message ----- From: Barry Smith Date: Friday, February 9, 2007 8:09 pm Subject: Re: Partitioning on a mpiaij matrix > > MatConvert() checks for a variety of converts; from the code > > /* 3) See if a good general converter is registered for the > desired class */ > conv = B->ops->convertfrom; > ierr = MatDestroy(B);CHKERRQ(ierr); > if (conv) goto foundconv; > > now MATMPIADJ has a MatConvertFrom that SHOULD be listed in the > function table > so it should not fall into the default MatConvert_Basic(). > > What version of PETSc are you using? Maybe an older one that > does not have > this converter? If you are using 2.3.2 or petsc-dev you can put a > breakpoint in MatConvert() and try to see why it is not picking up > the > convertfrom function? It is possible some bug that we are not > aware of > but I have difficulty seeing what could be going wrong. > > Good luck, > > Barry > > I used the 2.3.2-p8 from the lite package (the one without the documentation). I'm sorry i'm not longer at work so i can't test anything before monday. -- Dimitri Lecas From jinzishuai at yahoo.com Fri Feb 9 16:59:22 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 9 Feb 2007 14:59:22 -0800 (PST) Subject: A 3D example of KSPSolve? Message-ID: <930330.25934.qm@web36210.mail.mud.yahoo.com> Hi there, I am tuning our 3D FEM CFD code written with PETSc. The code doesn't scale very well. For example, with 8 processes on a linux cluster, the speedup we achieve with a fairly large problem size(million of elements) is only 3 to 4 using the Congugate gradient solver. We can achieve a speed up of a 6.5 using a GMRes solver but the wall clock time of a GMRes is longer than a CG solver which indicates that CG is the faster solver and it scales not as good as GMRes. Is this generally true? I then went to the examples and find a 2D example of KSPSolve (ex2.c). I let the code ran with a 1000x1000 mesh and get a linear scaling of the CG solver and a super linear scaling of the GMRes. These are both much better than our code. However, I think the 2D nature of the sample problem might help the scaling of the code. So I would like to try some 3D example using the KSPSolve. Unfortunately, I couldn't find such an example either in the src/ksp/ksp/examples/tutorials directory or by google search. There are a couple of 3D examples in the src/ksp/ksp/examples/tutorials but they are about the SNES not KSPSolve. If anyone can provide me with such an example, I would really appreciate it. Thanks a lot. Shi ____________________________________________________________________________________ Finding fabulous fares is fun. Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains. http://farechase.yahoo.com/promo-generic-14795097 From bsmith at mcs.anl.gov Fri Feb 9 18:53:09 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 9 Feb 2007 18:53:09 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: <930330.25934.qm@web36210.mail.mud.yahoo.com> References: <930330.25934.qm@web36210.mail.mud.yahoo.com> Message-ID: Shi, There is never a better test problem then your actual problem. Send the results from running on 1, 4, and 8 processes with the options -log_summary -ksp_view (use the optimized version of PETSc (running config/configure.py --with-debugging=0)) Barry On Fri, 9 Feb 2007, Shi Jin wrote: > Hi there, > > I am tuning our 3D FEM CFD code written with PETSc. > The code doesn't scale very well. For example, with 8 > processes on a linux cluster, the speedup we achieve > with a fairly large problem size(million of elements) > is only 3 to 4 using the Congugate gradient solver. We > can achieve a speed up of a 6.5 using a GMRes solver > but the wall clock time of a GMRes is longer than a CG > solver which indicates that CG is the faster solver > and it scales not as good as GMRes. Is this generally > true? > > I then went to the examples and find a 2D example of > KSPSolve (ex2.c). I let the code ran with a 1000x1000 > mesh and get a linear scaling of the CG solver and a > super linear scaling of the GMRes. These are both much > better than our code. However, I think the 2D nature > of the sample problem might help the scaling of the > code. So I would like to try some 3D example using the > KSPSolve. Unfortunately, I couldn't find such an > example either in the src/ksp/ksp/examples/tutorials > directory or by google search. There are a couple of > 3D examples in the src/ksp/ksp/examples/tutorials but > they are about the SNES not KSPSolve. If anyone can > provide me with such an example, I would really > appreciate it. > Thanks a lot. > > Shi > > > > ____________________________________________________________________________________ > Finding fabulous fares is fun. > Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains. > http://farechase.yahoo.com/promo-generic-14795097 > > From zonexo at gmail.com Fri Feb 9 18:51:43 2007 From: zonexo at gmail.com (Ben Tay) Date: Sat, 10 Feb 2007 08:51:43 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> Message-ID: <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> Ya, that's the mistake. I changed part of the code resulting in PetscFinalize not being called. Here's the output: ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- /home/enduser/g0306332/ns2d/a.out on a linux-mpi named atlas00.nus.edu.sgwith 4 processors, by g0306332 Sat Feb 10 08:32:08 2007 Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST 2007 HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 Max Max/Min Avg Total Time (sec): 2.826e+02 2.08192 1.725e+02 Objects: 1.110e+02 1.00000 1.110e+02 Flops: 6.282e+08 1.00736 6.267e+08 2.507e+09 Flops/sec: 4.624e+06 2.08008 4.015e+06 1.606e+07 Memory: 1.411e+07 1.01142 5.610e+07 MPI Messages: 8.287e+03 1.90156 6.322e+03 2.529e+04 MPI Message Lengths: 6.707e+07 1.11755 1.005e+04 2.542e+08 MPI Reductions: 3.112e+03 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 1.7247e+02 100.0% 2.5069e+09 100.0% 2.529e+04 100.0% 1.005e+04 100.0% 1.245e+04 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ ########################################################## # # # WARNING!!! # # # # This code was compiled with a debugging option, # # To get timing results run config/configure.py # # using --with-debugging=no, the performance will # # be generally two or three times faster. # # # ########################################################## ########################################################## ########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # ########################################################## Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 2.4e+04 1.3e+03 0.0e+00 12 18 93 12 0 12 18 93 12 0 19 MatSolve 3967 1.0 2.5914e+00 1.9 7.99e+07 1.9 0.0e+00 0.0e+00 0.0e+00 1 17 0 0 0 1 17 0 0 0 168 MatLUFactorNum 40 1.0 4.4779e-01 1.5 3.14e+07 1.5 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 0 2 0 0 0 85 MatILUFactorSym 2 1.0 3.1099e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatScale 20 1.0 1.1487e-01 8.7 8.73e+07 8.9 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 39 MatAssemblyBegin 40 1.0 7.8844e+00 1.3 0.00e+00 0.0 7.6e+02 2.8e+05 8.0e+01 4 0 3 83 1 4 0 3 83 1 0 MatAssemblyEnd 40 1.0 6.9408e+00 1.2 0.00e+00 0.0 1.2e+01 9.6e+02 6.4e+01 4 0 0 0 1 4 0 0 0 1 0 MatGetOrdering 2 1.0 8.0509e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatZeroEntries 21 1.0 1.4379e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecMDot 3792 1.0 4.7372e+01 1.4 5.20e+06 1.4 0.0e+00 0.0e+00 3.8e+03 24 29 0 0 30 24 29 0 0 30 15 VecNorm 3967 1.0 3.9513e+01 1.2 4.11e+05 1.2 0.0e+00 0.0e+00 4.0e+03 21 2 0 0 32 21 2 0 0 32 1 VecScale 3947 1.0 3.4941e-02 1.2 2.18e+08 1.2 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 738 VecCopy 155 1.0 1.0029e-0125.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 4142 1.0 3.4638e-01 6.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 290 1.0 5.9618e-03 1.2 2.14e+08 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 709 VecMAXPY 3947 1.0 1.5566e+00 1.3 1.64e+08 1.3 0.0e+00 0.0e+00 0.0e+00 1 31 0 0 0 1 31 0 0 0 498 VecAssemblyBegin 80 1.0 4.1793e+00 1.1 0.00e+00 0.0 9.6e+02 1.4e+04 2.4e+02 2 0 4 5 2 2 0 4 5 2 0 VecAssemblyEnd 80 1.0 2.0682e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 2.4e+04 1.3e+03 0.0e+00 0 0 93 12 0 0 0 93 12 0 0 VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 11 0 0 0 0 11 0 0 0 0 0 VecNormalize 3947 1.0 3.9593e+01 1.2 6.11e+05 1.2 0.0e+00 0.0e+00 3.9e+03 21 3 0 0 32 21 3 0 0 32 2 KSPGMRESOrthog 3792 1.0 4.8670e+01 1.3 9.92e+06 1.3 0.0e+00 0.0e+00 3.8e+03 25 58 0 0 30 25 58 0 0 30 30 KSPSetup 80 1.0 2.0014e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+01 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 40 1.0 1.0660e+02 1.0 5.90e+06 1.0 2.4e+04 1.3e+03 1.2e+04 62100 93 12 97 62100 93 12 97 23 PCSetUp 80 1.0 4.5669e-01 1.5 3.05e+07 1.5 0.0e+00 0.0e+00 1.4e+01 0 2 0 0 0 0 2 0 0 0 83 PCSetUpOnBlocks 40 1.0 4.5418e-01 1.5 3.07e+07 1.5 0.0e+00 0.0e+00 1.0e+01 0 2 0 0 0 0 2 0 0 0 84 PCApply 3967 1.0 4.1737e+00 2.0 5.30e+07 2.0 0.0e+00 0.0e+00 4.0e+03 2 17 0 0 32 2 17 0 0 32 104 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 8 8 21136 0 Index Set 12 12 74952 0 Vec 81 81 1447476 0 Vec Scatter 2 2 0 0 Krylov Solver 4 4 33760 0 Preconditioner 4 4 392 0 ======================================================================================================================== Average time to get PetscTime(): 1.09673e-06 Average time for MPI_Barrier(): 3.90053e-05 Average time for zero size MPI_Send(): 1.65105e-05 OptionTable: -log_summary Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 sizeof(PetscScalar) 8 Configure run at: Thu Jan 18 12:23:31 2007 Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 --with-mpi-dir=/opt/mpich/myrinet/intel/ ----------------------------------------- Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on atlas1.nus.edu.sg Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 SMP Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 Using PETSc arch: linux-mpif90 ----------------------------------------- Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g -w90 -w ----------------------------------------- Using include paths: -I/nas/lsftmp/g0306332/petsc-2.3.2-p8-I/nas/lsftmp/g0306332/petsc- 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8/include -I/opt/mpich/myrinet/intel/include ------------------------------------------ Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g -w90 -w Using libraries: -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 -lm -Wl,-rpath,\ -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl This is the result I get for running 20 steps. There are 2 matrix to be solved. I've only parallize the solving of linear equations and kept the rest of the code serial for this test. However, I found that it's much slower than the sequential version. From the ratio, it seems that MatScale and VecSet 's ratio are very high. I've done a scaling of 0.5 for momentum eqn. Is that the reason for the slowness? That is all I can decipher .... Thank you. On 2/10/07, Matthew Knepley wrote: > > On 2/9/07, Ben Tay wrote: > > > > ops.... it worked for ex2 and ex2f ;-) > > > > so what could be wrong? is there some commands or subroutine which i > > must call? btw, i'm programming in fortran. > > > > Yes, you must call PetscFinalize() in your code. > > Matt > > > thank you. > > > > > > On 2/9/07, Matthew Knepley wrote: > > > > > > Problems do not go away by ignoring them. Something is wrong here, and > > > it may > > > affect the rest of your program. Please try to run an example: > > > > > > cd src/ksp/ksp/examples/tutorials > > > make ex2 > > > ./ex2 -log_summary > > > > > > Matt > > > > > > On 2/9/07, Ben Tay wrote: > > > > > > > > Well, I don't know what's wrong. I did the same thing for -info and > > > > it worked. Anyway, is there any other way? > > > > > > > > Like I can use -mat_view or call matview( ... ) to view a matrix. Is > > > > there a similar subroutine for me to call? > > > > > > > > Thank you. > > > > > > > > > > > > On 2/9/07, Matthew Knepley wrote: > > > > > > > > > > Impossible, please check the spelling, and make sure your > > > > > command line was not truncated. > > > > > > > > > > Matt > > > > > > > > > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > > > > > ya, i did use -log_summary. but no output..... > > > > > > > > > > > > On 2/9/07, Barry Smith wrote: > > > > > > > > > > > > > > > > > > > > > -log_summary > > > > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I've tried to use log_summary but nothing came out? Did I > > > > > > > miss out > > > > > > > > something? It worked when I used -info... > > > > > > > > > > > > > > > > > > > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > > > > > > > > > > > > > On 2/8/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > > > i'm trying to solve my cfd code using PETSc in parallel. > > > > > > > Besides the > > > > > > > > > linear > > > > > > > > > > eqns for PETSc, other parts of the code has also been > > > > > > > parallelized using > > > > > > > > > > MPI. > > > > > > > > > > > > > > > > > > Finite elements or finite differences, or what? > > > > > > > > > > > > > > > > > > > however i find that the parallel version of the code > > > > > > > running on 4 > > > > > > > > > processors > > > > > > > > > > is even slower than the sequential version. > > > > > > > > > > > > > > > > > > Can you monitor the convergence and iteration count of > > > > > > > momentum and > > > > > > > > > poisson steps? > > > > > > > > > > > > > > > > > > > > > > > > > > > > in order to find out why, i've used the -info option to > > > > > > > print out the > > > > > > > > > > details. there are 2 linear equations being solved - > > > > > > > momentum and > > > > > > > > > poisson. > > > > > > > > > > the momentum one is twice the size of the poisson. it is > > > > > > > shown below: > > > > > > > > > > > > > > > > > > Can you use -log_summary command line option and send the > > > > > > > output attached? > > > > > > > > > > > > > > > > > > > i saw some statements stating "seq". am i running in > > > > > > > sequential or > > > > > > > > > parallel > > > > > > > > > > mode? have i preallocated too much space? > > > > > > > > > > > > > > > > > > It seems you are running in parallel. The "Seq" are > > > > > > > related to local, > > > > > > > > > internal objects. In PETSc, parallel matrices have inner > > > > > > > sequential > > > > > > > > > matrices. > > > > > > > > > > > > > > > > > > > lastly, if Ax=b, A_sta and A_end > > > > > > > from MatGetOwnershipRange and b_sta > > > > > > > > > and > > > > > > > > > > b_end from VecGetOwnershipRange should always be the > > > > > > > same value, right? > > > > > > > > > > > > > > > > > > I should. If not, you are likely going to get an runtime > > > > > > > error. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Lisandro Dalc?n > > > > > > > > > --------------- > > > > > > > > > Centro Internacional de M?todos Computacionales en > > > > > > > Ingenier?a (CIMEC) > > > > > > > > > Instituto de Desarrollo Tecnol?gico para la Industria > > > > > > > Qu?mica (INTEC) > > > > > > > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas > > > > > > > (CONICET) > > > > > > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > > > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > One trouble is that despite this system, anyone who reads journals > > > > > widely > > > > > and critically is forced to realize that there are scarcely any > > > > > bars to eventual > > > > > publication. There seems to be no study too fragmented, no > > > > > hypothesis too > > > > > trivial, no literature citation too biased or too egotistical, no > > > > > design too > > > > > warped, no methodology too bungled, no presentation of results too > > > > > > > > > > inaccurate, too obscure, and too contradictory, no analysis too > > > > > self-serving, > > > > > no argument too circular, no conclusions too trifling or too > > > > > unjustified, and > > > > > no grammar and syntax too offensive for a paper to end up in > > > > > print. -- Drummond Rennie > > > > > > > > > > > > > > > > > > > > > -- > > > One trouble is that despite this system, anyone who reads journals > > > widely > > > and critically is forced to realize that there are scarcely any bars > > > to eventual > > > publication. There seems to be no study too fragmented, no hypothesis > > > too > > > trivial, no literature citation too biased or too egotistical, no > > > design too > > > warped, no methodology too bungled, no presentation of results too > > > inaccurate, too obscure, and too contradictory, no analysis too > > > self-serving, > > > no argument too circular, no conclusions too trifling or too > > > unjustified, and > > > no grammar and syntax too offensive for a paper to end up in print. -- > > > Drummond Rennie > > > > > > > > > > -- > One trouble is that despite this system, anyone who reads journals widely > and critically is forced to realize that there are scarcely any bars to > eventual > publication. There seems to be no study too fragmented, no hypothesis too > trivial, no literature citation too biased or too egotistical, no design > too > warped, no methodology too bungled, no presentation of results too > inaccurate, too obscure, and too contradictory, no analysis too > self-serving, > no argument too circular, no conclusions too trifling or too unjustified, > and > no grammar and syntax too offensive for a paper to end up in print. -- > Drummond Rennie > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Feb 9 19:15:49 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 9 Feb 2007 19:15:49 -0600 Subject: understanding the output from -info In-Reply-To: <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> Message-ID: 1) These MFlop rates are terrible. It seems like your problem is way too small. 2) The load balance is not good. Matt On 2/9/07, Ben Tay wrote: > > Ya, that's the mistake. I changed part of the code resulting in > PetscFinalize not being called. > > Here's the output: > > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > /home/enduser/g0306332/ns2d/a.out on a linux-mpi named atlas00.nus.edu.sgwith 4 processors, by g0306332 Sat Feb 10 08:32:08 2007 > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST 2007 > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 > > Max Max/Min Avg Total > Time (sec): 2.826e+02 2.08192 1.725e+02 > Objects: 1.110e+02 1.00000 1.110e+02 > Flops: 6.282e+08 1.00736 6.267e+08 2.507e+09 > Flops/sec: 4.624e+06 2.08008 4.015e+06 1.606e+07 > Memory: 1.411e+07 1.01142 5.610e+07 > MPI Messages: 8.287e+03 1.90156 6.322e+03 2.529e+04 > MPI Message Lengths: 6.707e+07 1.11755 1.005e+04 2.542e+08 > MPI Reductions: 3.112e+03 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages > --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts > %Total Avg %Total counts %Total > 0: Main Stage: 1.7247e+02 100.0% 2.5069e+09 100.0% 2.529e+04 > 100.0% 1.005e+04 100.0% 1.245e+04 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > over all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was compiled with a debugging option, # > # To get timing results run config/configure.py # > # using --with-debugging=no, the performance will # > # be generally two or three times faster. # > # # > ########################################################## > > > > > ########################################################## > > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > Event Count Time (sec) > Flops/sec --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 2.4e+04 1.3e+03 > 0.0e+00 12 18 93 12 0 12 18 93 12 0 19 > MatSolve 3967 1.0 2.5914e+00 1.9 7.99e+07 1.9 0.0e+00 0.0e+00 > 0.0e+00 1 17 0 0 0 1 17 0 0 0 168 > MatLUFactorNum 40 1.0 4.4779e-01 1.5 3.14e+07 1.5 0.0e+00 0.0e+00 > 0.0e+00 0 2 0 0 0 0 2 0 0 0 85 > MatILUFactorSym 2 1.0 3.1099e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatScale 20 1.0 1.1487e-01 8.7 8.73e+07 8.9 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 39 > MatAssemblyBegin 40 1.0 7.8844e+00 1.3 0.00e+00 0.0 7.6e+02 2.8e+05 > 8.0e+01 4 0 3 83 1 4 0 3 83 1 0 > MatAssemblyEnd 40 1.0 6.9408e+00 1.2 0.00e+00 0.0 1.2e+01 9.6e+02 > 6.4e+01 4 0 0 0 1 4 0 0 0 1 0 > MatGetOrdering 2 1.0 8.0509e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatZeroEntries 21 1.0 1.4379e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 3792 1.0 4.7372e+01 1.4 5.20e+06 1.4 0.0e+00 0.0e+00 > 3.8e+03 24 29 0 0 30 24 29 0 0 30 15 > VecNorm 3967 1.0 3.9513e+01 1.2 4.11e+05 1.2 0.0e+00 0.0e+00 > 4.0e+03 21 2 0 0 32 21 2 0 0 32 1 > VecScale 3947 1.0 3.4941e-02 1.2 2.18e+08 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 738 > VecCopy 155 1.0 1.0029e-0125.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 4142 1.0 3.4638e-01 6.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 290 1.0 5.9618e-03 1.2 2.14e+08 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 709 > VecMAXPY 3947 1.0 1.5566e+00 1.3 1.64e+08 1.3 0.0e+00 0.0e+00 > 0.0e+00 1 31 0 0 0 1 31 0 0 0 498 > VecAssemblyBegin 80 1.0 4.1793e+00 1.1 0.00e+00 0.0 9.6e+02 1.4e+04 > 2.4e+02 2 0 4 5 2 2 0 4 5 2 0 > VecAssemblyEnd 80 1.0 2.0682e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 2.4e+04 1.3e+03 > 0.0e+00 0 0 93 12 0 0 0 93 12 0 0 > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 11 0 0 0 0 11 0 0 0 0 0 > VecNormalize 3947 1.0 3.9593e+01 1.2 6.11e+05 1.2 0.0e+00 0.0e+00 > 3.9e+03 21 3 0 0 32 21 3 0 0 32 2 > KSPGMRESOrthog 3792 1.0 4.8670e+01 1.3 9.92e+06 1.3 0.0e+00 0.0e+00 > 3.8e+03 25 58 0 0 30 25 58 0 0 30 30 > KSPSetup 80 1.0 2.0014e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+01 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 40 1.0 1.0660e+02 1.0 5.90e+06 1.0 2.4e+04 1.3e+03 > 1.2e+04 62100 93 12 97 62100 93 12 97 23 > PCSetUp 80 1.0 4.5669e-01 1.5 3.05e+07 1.5 0.0e+00 0.0e+00 > 1.4e+01 0 2 0 0 0 0 2 0 0 0 83 > PCSetUpOnBlocks 40 1.0 4.5418e-01 1.5 3.07e+07 1.5 0.0e+00 0.0e+00 > 1.0e+01 0 2 0 0 0 0 2 0 0 0 84 > PCApply 3967 1.0 4.1737e+00 2.0 5.30e+07 2.0 0.0e+00 0.0e+00 > 4.0e+03 2 17 0 0 32 2 17 0 0 32 104 > ------------------------------------------------------------------------------------------------------------------------ > > > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 8 8 21136 0 > Index Set 12 12 74952 0 > Vec 81 81 1447476 0 > Vec Scatter 2 2 0 0 > Krylov Solver 4 4 33760 0 > Preconditioner 4 4 392 0 > ======================================================================================================================== > > Average time to get PetscTime(): 1.09673e-06 > Average time for MPI_Barrier(): 3.90053e-05 > Average time for zero size MPI_Send(): 1.65105e-05 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > sizeof(PetscScalar) 8 > Configure run at: Thu Jan 18 12:23:31 2007 > Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > --with-mpi-dir=/opt/mpich/myrinet/intel/ > ----------------------------------------- > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on atlas1.nus.edu.sg > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 SMP > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 > Using PETSc arch: linux-mpif90 > ----------------------------------------- > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > -w90 -w > ----------------------------------------- > Using include paths: -I/nas/lsftmp/g0306332/petsc- 2.3.2-p8-I/nas/lsftmp/g0306332/petsc- > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8/include > -I/opt/mpich/myrinet/intel/include > ------------------------------------------ > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > -w90 -w > Using libraries: -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib -lPEPCF90 > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 -lm -Wl,-rpath,\ > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl > > > This is the result I get for running 20 steps. There are 2 matrix to be > solved. I've only parallize the solving of linear equations and kept the > rest of the code serial for this test. However, I found that it's much > slower than the sequential version. > > From the ratio, it seems that MatScale and VecSet 's ratio are very high. > I've done a scaling of 0.5 for momentum eqn. Is that the reason for the > slowness? That is all I can decipher .... > > Thank you. > > > > > > On 2/10/07, Matthew Knepley wrote: > > > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > ops.... it worked for ex2 and ex2f ;-) > > > > > > so what could be wrong? is there some commands or subroutine which i > > > must call? btw, i'm programming in fortran. > > > > > > > Yes, you must call PetscFinalize() in your code. > > > > Matt > > > > > > thank you. > > > > > > > > > On 2/9/07, Matthew Knepley wrote: > > > > > > > > Problems do not go away by ignoring them. Something is wrong here, > > > > and it may > > > > affect the rest of your program. Please try to run an example: > > > > > > > > cd src/ksp/ksp/examples/tutorials > > > > make ex2 > > > > ./ex2 -log_summary > > > > > > > > Matt > > > > > > > > On 2/9/07, Ben Tay wrote: > > > > > > > > > > Well, I don't know what's wrong. I did the same thing for -info > > > > > and it worked. Anyway, is there any other way? > > > > > > > > > > Like I can use -mat_view or call matview( ... ) to view a matrix. > > > > > Is there a similar subroutine for me to call? > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > On 2/9/07, Matthew Knepley wrote: > > > > > > > > > > > > Impossible, please check the spelling, and make sure your > > > > > > command line was not truncated. > > > > > > > > > > > > Matt > > > > > > > > > > > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > > > > > > > ya, i did use -log_summary. but no output..... > > > > > > > > > > > > > > On 2/9/07, Barry Smith wrote: > > > > > > > > > > > > > > > > > > > > > > > > -log_summary > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I've tried to use log_summary but nothing came out? Did I > > > > > > > > miss out > > > > > > > > > something? It worked when I used -info... > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > > > > > > > > > > > > > > > On 2/8/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > > > > i'm trying to solve my cfd code using PETSc in > > > > > > > > parallel. Besides the > > > > > > > > > > linear > > > > > > > > > > > eqns for PETSc, other parts of the code has also been > > > > > > > > parallelized using > > > > > > > > > > > MPI. > > > > > > > > > > > > > > > > > > > > Finite elements or finite differences, or what? > > > > > > > > > > > > > > > > > > > > > however i find that the parallel version of the code > > > > > > > > running on 4 > > > > > > > > > > processors > > > > > > > > > > > is even slower than the sequential version. > > > > > > > > > > > > > > > > > > > > Can you monitor the convergence and iteration count of > > > > > > > > momentum and > > > > > > > > > > poisson steps? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > in order to find out why, i've used the -info option > > > > > > > > to print out the > > > > > > > > > > > details. there are 2 linear equations being solved - > > > > > > > > momentum and > > > > > > > > > > poisson. > > > > > > > > > > > the momentum one is twice the size of the poisson. it > > > > > > > > is shown below: > > > > > > > > > > > > > > > > > > > > Can you use -log_summary command line option and send > > > > > > > > the output attached? > > > > > > > > > > > > > > > > > > > > > i saw some statements stating "seq". am i running in > > > > > > > > sequential or > > > > > > > > > > parallel > > > > > > > > > > > mode? have i preallocated too much space? > > > > > > > > > > > > > > > > > > > > It seems you are running in parallel. The "Seq" are > > > > > > > > related to local, > > > > > > > > > > internal objects. In PETSc, parallel matrices have inner > > > > > > > > sequential > > > > > > > > > > matrices. > > > > > > > > > > > > > > > > > > > > > lastly, if Ax=b, A_sta and A_end > > > > > > > > from MatGetOwnershipRange and b_sta > > > > > > > > > > and > > > > > > > > > > > b_end from VecGetOwnershipRange should always be the > > > > > > > > same value, right? > > > > > > > > > > > > > > > > > > > > I should. If not, you are likely going to get an runtime > > > > > > > > error. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Lisandro Dalc?n > > > > > > > > > > --------------- > > > > > > > > > > Centro Internacional de M?todos Computacionales en > > > > > > > > Ingenier?a (CIMEC) > > > > > > > > > > Instituto de Desarrollo Tecnol?gico para la Industria > > > > > > > > Qu?mica (INTEC) > > > > > > > > > > Consejo Nacional de Investigaciones Cient?ficas y > > > > > > > > T?cnicas (CONICET) > > > > > > > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > > > > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > One trouble is that despite this system, anyone who reads > > > > > > journals widely > > > > > > and critically is forced to realize that there are scarcely any > > > > > > bars to eventual > > > > > > publication. There seems to be no study too fragmented, no > > > > > > hypothesis too > > > > > > trivial, no literature citation too biased or too egotistical, > > > > > > no design too > > > > > > warped, no methodology too bungled, no presentation of results > > > > > > too > > > > > > inaccurate, too obscure, and too contradictory, no analysis too > > > > > > self-serving, > > > > > > no argument too circular, no conclusions too trifling or too > > > > > > unjustified, and > > > > > > no grammar and syntax too offensive for a paper to end up in > > > > > > print. -- Drummond Rennie > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > One trouble is that despite this system, anyone who reads journals > > > > widely > > > > and critically is forced to realize that there are scarcely any bars > > > > to eventual > > > > publication. There seems to be no study too fragmented, no > > > > hypothesis too > > > > trivial, no literature citation too biased or too egotistical, no > > > > design too > > > > warped, no methodology too bungled, no presentation of results too > > > > inaccurate, too obscure, and too contradictory, no analysis too > > > > self-serving, > > > > no argument too circular, no conclusions too trifling or too > > > > unjustified, and > > > > no grammar and syntax too offensive for a paper to end up in print. > > > > -- Drummond Rennie > > > > > > > > > > > > > > > > -- > > One trouble is that despite this system, anyone who reads journals > > widely > > and critically is forced to realize that there are scarcely any bars to > > eventual > > publication. There seems to be no study too fragmented, no hypothesis > > too > > trivial, no literature citation too biased or too egotistical, no design > > too > > warped, no methodology too bungled, no presentation of results too > > inaccurate, too obscure, and too contradictory, no analysis too > > self-serving, > > no argument too circular, no conclusions too trifling or too > > unjustified, and > > no grammar and syntax too offensive for a paper to end up in print. -- > > Drummond Rennie > > > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Fri Feb 9 19:27:33 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 9 Feb 2007 19:27:33 -0600 (CST) Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> Message-ID: Ben, 1) > > > > > > ########################################################## > > # # > > # WARNING!!! # > > # # > > # This code was compiled with a debugging option, # > > # To get timing results run config/configure.py # > > # using --with-debugging=no, the performance will # > > # be generally two or three times faster. # > > # # > > ########################################################## 2) In general to get any decent parallel performance you need to have at least 10,000 unknowns per process. 3) It is important that each proces have roughly the same number of nonzeros in the matrix. > > Event Count Time (sec) > > Flops/sec --- Global --- --- Stage --- Total > > Max Ratio Max Ratio Max Ratio Mess Avg len > > MatSolve 3967 1.0 2.5914e+00 1.9 7.99e+07 1.9 0.0e+00 0.0e+00 ^^^^^^ One process is taking 1.9 times for the matsolves then the fastest one. Since the MatSolves are not parallel this likely means that the "slow process" has much more nonzeros thant the "fast process" Barry On Fri, 9 Feb 2007, Matthew Knepley wrote: > 1) These MFlop rates are terrible. It seems like your problem is way too > small. > > 2) The load balance is not good. > > Matt > > On 2/9/07, Ben Tay wrote: > > > > Ya, that's the mistake. I changed part of the code resulting in > > PetscFinalize not being called. > > > > Here's the output: > > > > > > ---------------------------------------------- PETSc Performance Summary: > > ---------------------------------------------- > > > > /home/enduser/g0306332/ns2d/a.out on a linux-mpi named > > atlas00.nus.edu.sgwith 4 processors, by g0306332 Sat Feb 10 08:32:08 2007 > > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST 2007 > > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 > > > > Max Max/Min Avg Total > > Time (sec): 2.826e+02 2.08192 1.725e+02 > > Objects: 1.110e+02 1.00000 1.110e+02 > > Flops: 6.282e+08 1.00736 6.267e+08 2.507e+09 > > Flops/sec: 4.624e+06 2.08008 4.015e+06 1.606e+07 > > Memory: 1.411e+07 1.01142 5.610e+07 > > MPI Messages: 8.287e+03 1.90156 6.322e+03 2.529e+04 > > MPI Message Lengths: 6.707e+07 1.11755 1.005e+04 2.542e+08 > > MPI Reductions: 3.112e+03 1.00000 > > > > Flop counting convention: 1 flop = 1 real number operation of type > > (multiply/divide/add/subtract) > > e.g., VecAXPY() for real vectors of length N > > --> 2N flops > > and VecAXPY() for complex vectors of length N > > --> 8N flops > > > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages > > --- -- Message Lengths -- -- Reductions -- > > Avg %Total Avg %Total counts > > %Total Avg %Total counts %Total > > 0: Main Stage: 1.7247e+02 100.0% 2.5069e+09 100.0% 2.529e+04 > > 100.0% 1.005e+04 100.0% 1.245e+04 100.0% > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > See the 'Profiling' chapter of the users' manual for details on > > interpreting output. > > Phase summary info: > > Count: number of times phase was executed > > Time and Flops/sec: Max - maximum over all processors > > Ratio - ratio of maximum to minimum over all > > processors > > Mess: number of messages sent > > Avg. len: average message length > > Reduct: number of global reductions > > Global: entire computation > > Stage: stages of a computation. Set stages with PetscLogStagePush() and > > PetscLogStagePop(). > > %T - percent time in this phase %F - percent flops in this > > phase > > %M - percent messages in this phase %L - percent message lengths > > in this phase > > %R - percent reductions in this phase > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > > over all processors) > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > ########################################################## > > # # > > # WARNING!!! # > > # # > > # This code was compiled with a debugging option, # > > # To get timing results run config/configure.py # > > # using --with-debugging=no, the performance will # > > # be generally two or three times faster. # > > # # > > ########################################################## > > > > > > > > > > ########################################################## > > > > > > > > ########################################################## > > # # > > # WARNING!!! # > > # # > > # This code was run without the PreLoadBegin() # > > # macros. To get timing results we always recommend # > > # preloading. otherwise timing numbers may be # > > # meaningless. # > > ########################################################## > > > > > > Event Count Time (sec) > > Flops/sec --- Global --- --- Stage --- Total > > Max Ratio Max Ratio Max Ratio Mess Avg len > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > --- Event Stage 0: Main Stage > > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 2.4e+04 1.3e+03 > > 0.0e+00 12 18 93 12 0 12 18 93 12 0 19 > > MatSolve 3967 1.0 2.5914e+00 1.9 7.99e+07 1.9 0.0e+00 0.0e+00 > > 0.0e+00 1 17 0 0 0 1 17 0 0 0 168 > > MatLUFactorNum 40 1.0 4.4779e-01 1.5 3.14e+07 1.5 0.0e+00 0.0e+00 > > 0.0e+00 0 2 0 0 0 0 2 0 0 0 85 > > MatILUFactorSym 2 1.0 3.1099e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatScale 20 1.0 1.1487e-01 8.7 8.73e+07 8.9 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 39 > > MatAssemblyBegin 40 1.0 7.8844e+00 1.3 0.00e+00 0.0 7.6e+02 2.8e+05 > > 8.0e+01 4 0 3 83 1 4 0 3 83 1 0 > > MatAssemblyEnd 40 1.0 6.9408e+00 1.2 0.00e+00 0.0 1.2e+01 9.6e+02 > > 6.4e+01 4 0 0 0 1 4 0 0 0 1 0 > > MatGetOrdering 2 1.0 8.0509e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatZeroEntries 21 1.0 1.4379e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecMDot 3792 1.0 4.7372e+01 1.4 5.20e+06 1.4 0.0e+00 0.0e+00 > > 3.8e+03 24 29 0 0 30 24 29 0 0 30 15 > > VecNorm 3967 1.0 3.9513e+01 1.2 4.11e+05 1.2 0.0e+00 0.0e+00 > > 4.0e+03 21 2 0 0 32 21 2 0 0 32 1 > > VecScale 3947 1.0 3.4941e-02 1.2 2.18e+08 1.2 0.0e+00 0.0e+00 > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 738 > > VecCopy 155 1.0 1.0029e-0125.1 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecSet 4142 1.0 3.4638e-01 6.6 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecAXPY 290 1.0 5.9618e-03 1.2 2.14e+08 1.2 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 709 > > VecMAXPY 3947 1.0 1.5566e+00 1.3 1.64e+08 1.3 0.0e+00 0.0e+00 > > 0.0e+00 1 31 0 0 0 1 31 0 0 0 498 > > VecAssemblyBegin 80 1.0 4.1793e+00 1.1 0.00e+00 0.0 9.6e+02 1.4e+04 > > 2.4e+02 2 0 4 5 2 2 0 4 5 2 0 > > VecAssemblyEnd 80 1.0 2.0682e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 2.4e+04 1.3e+03 > > 0.0e+00 0 0 93 12 0 0 0 93 12 0 0 > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 11 0 0 0 0 11 0 0 0 0 0 > > VecNormalize 3947 1.0 3.9593e+01 1.2 6.11e+05 1.2 0.0e+00 0.0e+00 > > 3.9e+03 21 3 0 0 32 21 3 0 0 32 2 > > KSPGMRESOrthog 3792 1.0 4.8670e+01 1.3 9.92e+06 1.3 0.0e+00 0.0e+00 > > 3.8e+03 25 58 0 0 30 25 58 0 0 30 30 > > KSPSetup 80 1.0 2.0014e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+01 0 0 0 0 0 0 0 0 0 0 0 > > KSPSolve 40 1.0 1.0660e+02 1.0 5.90e+06 1.0 2.4e+04 1.3e+03 > > 1.2e+04 62100 93 12 97 62100 93 12 97 23 > > PCSetUp 80 1.0 4.5669e-01 1.5 3.05e+07 1.5 0.0e+00 0.0e+00 > > 1.4e+01 0 2 0 0 0 0 2 0 0 0 83 > > PCSetUpOnBlocks 40 1.0 4.5418e-01 1.5 3.07e+07 1.5 0.0e+00 0.0e+00 > > 1.0e+01 0 2 0 0 0 0 2 0 0 0 84 > > PCApply 3967 1.0 4.1737e+00 2.0 5.30e+07 2.0 0.0e+00 0.0e+00 > > 4.0e+03 2 17 0 0 32 2 17 0 0 32 104 > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > > > Memory usage is given in bytes: > > > > Object Type Creations Destructions Memory Descendants' Mem. > > > > --- Event Stage 0: Main Stage > > > > Matrix 8 8 21136 0 > > Index Set 12 12 74952 0 > > Vec 81 81 1447476 0 > > Vec Scatter 2 2 0 0 > > Krylov Solver 4 4 33760 0 > > Preconditioner 4 4 392 0 > > ======================================================================================================================== > > > > Average time to get PetscTime(): 1.09673e-06 > > Average time for MPI_Barrier(): 3.90053e-05 > > Average time for zero size MPI_Send(): 1.65105e-05 > > OptionTable: -log_summary > > Compiled without FORTRAN kernels > > Compiled with full precision matrices (default) > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > > sizeof(PetscScalar) 8 > > Configure run at: Thu Jan 18 12:23:31 2007 > > Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > > --with-mpi-dir=/opt/mpich/myrinet/intel/ > > ----------------------------------------- > > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on atlas1.nus.edu.sg > > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 SMP > > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux > > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 > > Using PETSc arch: linux-mpif90 > > ----------------------------------------- > > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > > -w90 -w > > ----------------------------------------- > > Using include paths: -I/nas/lsftmp/g0306332/petsc- > > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc- > > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8/include > > -I/opt/mpich/myrinet/intel/include > > ------------------------------------------ > > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > > -w90 -w > > Using libraries: > > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 > > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts > > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 > > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide > > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib > > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm > > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa > > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib -lPEPCF90 > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 -lm -Wl,-rpath,\ > > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl > > > > > > This is the result I get for running 20 steps. There are 2 matrix to be > > solved. I've only parallize the solving of linear equations and kept the > > rest of the code serial for this test. However, I found that it's much > > slower than the sequential version. > > > > From the ratio, it seems that MatScale and VecSet 's ratio are very high. > > I've done a scaling of 0.5 for momentum eqn. Is that the reason for the > > slowness? That is all I can decipher .... > > > > Thank you. > > > > > > > > > > > > On 2/10/07, Matthew Knepley wrote: > > > > > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > ops.... it worked for ex2 and ex2f ;-) > > > > > > > > so what could be wrong? is there some commands or subroutine which i > > > > must call? btw, i'm programming in fortran. > > > > > > > > > > Yes, you must call PetscFinalize() in your code. > > > > > > Matt > > > > > > > > > thank you. > > > > > > > > > > > > On 2/9/07, Matthew Knepley wrote: > > > > > > > > > > Problems do not go away by ignoring them. Something is wrong here, > > > > > and it may > > > > > affect the rest of your program. Please try to run an example: > > > > > > > > > > cd src/ksp/ksp/examples/tutorials > > > > > make ex2 > > > > > ./ex2 -log_summary > > > > > > > > > > Matt > > > > > > > > > > On 2/9/07, Ben Tay wrote: > > > > > > > > > > > > Well, I don't know what's wrong. I did the same thing for -info > > > > > > and it worked. Anyway, is there any other way? > > > > > > > > > > > > Like I can use -mat_view or call matview( ... ) to view a matrix. > > > > > > Is there a similar subroutine for me to call? > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > On 2/9/07, Matthew Knepley wrote: > > > > > > > > > > > > > > Impossible, please check the spelling, and make sure your > > > > > > > command line was not truncated. > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > On 2/9/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > > > > > > > > > ya, i did use -log_summary. but no output..... > > > > > > > > > > > > > > > > On 2/9/07, Barry Smith wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > -log_summary > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Ben Tay wrote: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I've tried to use log_summary but nothing came out? Did I > > > > > > > > > miss out > > > > > > > > > > something? It worked when I used -info... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2/9/07, Lisandro Dalcin wrote: > > > > > > > > > > > > > > > > > > > > > > On 2/8/07, Ben Tay < zonexo at gmail.com> wrote: > > > > > > > > > > > > i'm trying to solve my cfd code using PETSc in > > > > > > > > > parallel. Besides the > > > > > > > > > > > linear > > > > > > > > > > > > eqns for PETSc, other parts of the code has also been > > > > > > > > > parallelized using > > > > > > > > > > > > MPI. > > > > > > > > > > > > > > > > > > > > > > Finite elements or finite differences, or what? > > > > > > > > > > > > > > > > > > > > > > > however i find that the parallel version of the code > > > > > > > > > running on 4 > > > > > > > > > > > processors > > > > > > > > > > > > is even slower than the sequential version. > > > > > > > > > > > > > > > > > > > > > > Can you monitor the convergence and iteration count of > > > > > > > > > momentum and > > > > > > > > > > > poisson steps? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > in order to find out why, i've used the -info option > > > > > > > > > to print out the > > > > > > > > > > > > details. there are 2 linear equations being solved - > > > > > > > > > momentum and > > > > > > > > > > > poisson. > > > > > > > > > > > > the momentum one is twice the size of the poisson. it > > > > > > > > > is shown below: > > > > > > > > > > > > > > > > > > > > > > Can you use -log_summary command line option and send > > > > > > > > > the output attached? > > > > > > > > > > > > > > > > > > > > > > > i saw some statements stating "seq". am i running in > > > > > > > > > sequential or > > > > > > > > > > > parallel > > > > > > > > > > > > mode? have i preallocated too much space? > > > > > > > > > > > > > > > > > > > > > > It seems you are running in parallel. The "Seq" are > > > > > > > > > related to local, > > > > > > > > > > > internal objects. In PETSc, parallel matrices have inner > > > > > > > > > sequential > > > > > > > > > > > matrices. > > > > > > > > > > > > > > > > > > > > > > > lastly, if Ax=b, A_sta and A_end > > > > > > > > > from MatGetOwnershipRange and b_sta > > > > > > > > > > > and > > > > > > > > > > > > b_end from VecGetOwnershipRange should always be the > > > > > > > > > same value, right? > > > > > > > > > > > > > > > > > > > > > > I should. If not, you are likely going to get an runtime > > > > > > > > > error. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Lisandro Dalc?n > > > > > > > > > > > --------------- > > > > > > > > > > > Centro Internacional de M?todos Computacionales en > > > > > > > > > Ingenier?a (CIMEC) > > > > > > > > > > > Instituto de Desarrollo Tecnol?gico para la Industria > > > > > > > > > Qu?mica (INTEC) > > > > > > > > > > > Consejo Nacional de Investigaciones Cient?ficas y > > > > > > > > > T?cnicas (CONICET) > > > > > > > > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > > > > > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > One trouble is that despite this system, anyone who reads > > > > > > > journals widely > > > > > > > and critically is forced to realize that there are scarcely any > > > > > > > bars to eventual > > > > > > > publication. There seems to be no study too fragmented, no > > > > > > > hypothesis too > > > > > > > trivial, no literature citation too biased or too egotistical, > > > > > > > no design too > > > > > > > warped, no methodology too bungled, no presentation of results > > > > > > > too > > > > > > > inaccurate, too obscure, and too contradictory, no analysis too > > > > > > > self-serving, > > > > > > > no argument too circular, no conclusions too trifling or too > > > > > > > unjustified, and > > > > > > > no grammar and syntax too offensive for a paper to end up in > > > > > > > print. -- Drummond Rennie > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > One trouble is that despite this system, anyone who reads journals > > > > > widely > > > > > and critically is forced to realize that there are scarcely any bars > > > > > to eventual > > > > > publication. There seems to be no study too fragmented, no > > > > > hypothesis too > > > > > trivial, no literature citation too biased or too egotistical, no > > > > > design too > > > > > warped, no methodology too bungled, no presentation of results too > > > > > inaccurate, too obscure, and too contradictory, no analysis too > > > > > self-serving, > > > > > no argument too circular, no conclusions too trifling or too > > > > > unjustified, and > > > > > no grammar and syntax too offensive for a paper to end up in print. > > > > > -- Drummond Rennie > > > > > > > > > > > > > > > > > > > > > > -- > > > One trouble is that despite this system, anyone who reads journals > > > widely > > > and critically is forced to realize that there are scarcely any bars to > > > eventual > > > publication. There seems to be no study too fragmented, no hypothesis > > > too > > > trivial, no literature citation too biased or too egotistical, no design > > > too > > > warped, no methodology too bungled, no presentation of results too > > > inaccurate, too obscure, and too contradictory, no analysis too > > > self-serving, > > > no argument too circular, no conclusions too trifling or too > > > unjustified, and > > > no grammar and syntax too offensive for a paper to end up in print. -- > > > Drummond Rennie > > > > > > > > > > From balay at mcs.anl.gov Fri Feb 9 19:41:16 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Fri, 9 Feb 2007 19:41:16 -0600 (CST) Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090434w4f0674e6s1c936cb410f3744a@mail.gmail.com> <804ab5d40702090620u5cf86c51s4e1b7b724eaf4f98@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> Message-ID: Looks like MatMult = 24sec Out of this the scatter time is: 22sec. Either something is wrong with your run - or MPI is really broken.. Satish > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 2.4e+04 1.3e+03 > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 2.4e+04 1.3e+03 > > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 From jinzishuai at yahoo.com Fri Feb 9 20:42:29 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 9 Feb 2007 18:42:29 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: Message-ID: <790923.93477.qm@web36208.mail.mud.yahoo.com> Thank you. But my code has 10 calls to KSPSolve of three different linear systems at each time update. Should I strip it down to a single KSPSolve so that it is easier to analysis? I might have the code dump the Matrix and vector and write another code to read them into and call KSPSolve. I don't know whether this is worth doing or should I just send in the messy log file of the whole run. Thanks for any advice. Shi --- Barry Smith wrote: > > Shi, > > There is never a better test problem then your > actual problem. > Send the results from running on 1, 4, and 8 > processes with the options > -log_summary -ksp_view (use the optimized version of > PETSc (running > config/configure.py --with-debugging=0)) > > Barry > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > Hi there, > > > > I am tuning our 3D FEM CFD code written with > PETSc. > > The code doesn't scale very well. For example, > with 8 > > processes on a linux cluster, the speedup we > achieve > > with a fairly large problem size(million of > elements) > > is only 3 to 4 using the Congugate gradient > solver. We > > can achieve a speed up of a 6.5 using a GMRes > solver > > but the wall clock time of a GMRes is longer than > a CG > > solver which indicates that CG is the faster > solver > > and it scales not as good as GMRes. Is this > generally > > true? > > > > I then went to the examples and find a 2D example > of > > KSPSolve (ex2.c). I let the code ran with a > 1000x1000 > > mesh and get a linear scaling of the CG solver and > a > > super linear scaling of the GMRes. These are both > much > > better than our code. However, I think the 2D > nature > > of the sample problem might help the scaling of > the > > code. So I would like to try some 3D example using > the > > KSPSolve. Unfortunately, I couldn't find such an > > example either in the > src/ksp/ksp/examples/tutorials > > directory or by google search. There are a couple > of > > 3D examples in the src/ksp/ksp/examples/tutorials > but > > they are about the SNES not KSPSolve. If anyone > can > > provide me with such an example, I would really > > appreciate it. > > Thanks a lot. > > > > Shi > > > > > > > > > ____________________________________________________________________________________ > > Finding fabulous fares is fun. > > Let Yahoo! FareChase search your favorite travel > sites to find flight and hotel bargains. > > http://farechase.yahoo.com/promo-generic-14795097 > > > > > > ____________________________________________________________________________________ 8:00? 8:25? 8:40? Find a flick in no time with the Yahoo! Search movie showtime shortcut. http://tools.search.yahoo.com/shortcuts/#news From bsmith at mcs.anl.gov Fri Feb 9 20:47:17 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 9 Feb 2007 20:47:17 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: <790923.93477.qm@web36208.mail.mud.yahoo.com> References: <790923.93477.qm@web36208.mail.mud.yahoo.com> Message-ID: NO, NO, don't spend time stripping your code! Unproductive See the manul pages for PetscLogStageRegister(), PetscLogStagePush() and PetscLogStagePop(). All you need to do is maintain a seperate stage for each of your KSPSolves; in your case you'll create 3 stages. Barry On Fri, 9 Feb 2007, Shi Jin wrote: > Thank you. > But my code has 10 calls to KSPSolve of three > different linear systems at each time update. Should I > strip it down to a single KSPSolve so that it is > easier to analysis? I might have the code dump the > Matrix and vector and write another code to read them > into and call KSPSolve. I don't know whether this is > worth doing or should I just send in the messy log > file of the whole run. > Thanks for any advice. > > Shi > > --- Barry Smith wrote: > > > > > Shi, > > > > There is never a better test problem then your > > actual problem. > > Send the results from running on 1, 4, and 8 > > processes with the options > > -log_summary -ksp_view (use the optimized version of > > PETSc (running > > config/configure.py --with-debugging=0)) > > > > Barry > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > Hi there, > > > > > > I am tuning our 3D FEM CFD code written with > > PETSc. > > > The code doesn't scale very well. For example, > > with 8 > > > processes on a linux cluster, the speedup we > > achieve > > > with a fairly large problem size(million of > > elements) > > > is only 3 to 4 using the Congugate gradient > > solver. We > > > can achieve a speed up of a 6.5 using a GMRes > > solver > > > but the wall clock time of a GMRes is longer than > > a CG > > > solver which indicates that CG is the faster > > solver > > > and it scales not as good as GMRes. Is this > > generally > > > true? > > > > > > I then went to the examples and find a 2D example > > of > > > KSPSolve (ex2.c). I let the code ran with a > > 1000x1000 > > > mesh and get a linear scaling of the CG solver and > > a > > > super linear scaling of the GMRes. These are both > > much > > > better than our code. However, I think the 2D > > nature > > > of the sample problem might help the scaling of > > the > > > code. So I would like to try some 3D example using > > the > > > KSPSolve. Unfortunately, I couldn't find such an > > > example either in the > > src/ksp/ksp/examples/tutorials > > > directory or by google search. There are a couple > > of > > > 3D examples in the src/ksp/ksp/examples/tutorials > > but > > > they are about the SNES not KSPSolve. If anyone > > can > > > provide me with such an example, I would really > > > appreciate it. > > > Thanks a lot. > > > > > > Shi > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Finding fabulous fares is fun. > > > Let Yahoo! FareChase search your favorite travel > > sites to find flight and hotel bargains. > > > http://farechase.yahoo.com/promo-generic-14795097 > > > > > > > > > > > > > > > ____________________________________________________________________________________ > 8:00? 8:25? 8:40? Find a flick in no time > with the Yahoo! Search movie showtime shortcut. > http://tools.search.yahoo.com/shortcuts/#news > > From jinzishuai at yahoo.com Fri Feb 9 21:01:09 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 9 Feb 2007 19:01:09 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: Message-ID: <867640.48509.qm@web36210.mail.mud.yahoo.com> Dear Barry, Thank you. I actually have done the staging already. I summarized the timing of the runs in google online spreadsheets. I have two runs. 1. with 400,000 finite elements: http://spreadsheets.google.com/pub?key=pZHoqlL60quZeDZlucTjEIA 2. with 1,600,000 finite elements: http://spreadsheets.google.com/pub?key=pZHoqlL60quZcCVLAqmzqQQ If you can take a look at them and give me some advice, I will be deeply grateful. Shi --- Barry Smith wrote: > > NO, NO, don't spend time stripping your code! > Unproductive > > See the manul pages for PetscLogStageRegister(), > PetscLogStagePush() and > PetscLogStagePop(). All you need to do is maintain a > seperate stage for each > of your KSPSolves; in your case you'll create 3 > stages. > > Barry > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > Thank you. > > But my code has 10 calls to KSPSolve of three > > different linear systems at each time update. > Should I > > strip it down to a single KSPSolve so that it is > > easier to analysis? I might have the code dump the > > Matrix and vector and write another code to read > them > > into and call KSPSolve. I don't know whether this > is > > worth doing or should I just send in the messy > log > > file of the whole run. > > Thanks for any advice. > > > > Shi > > > > --- Barry Smith wrote: > > > > > > > > Shi, > > > > > > There is never a better test problem then > your > > > actual problem. > > > Send the results from running on 1, 4, and 8 > > > processes with the options > > > -log_summary -ksp_view (use the optimized > version of > > > PETSc (running > > > config/configure.py --with-debugging=0)) > > > > > > Barry > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > Hi there, > > > > > > > > I am tuning our 3D FEM CFD code written with > > > PETSc. > > > > The code doesn't scale very well. For example, > > > with 8 > > > > processes on a linux cluster, the speedup we > > > achieve > > > > with a fairly large problem size(million of > > > elements) > > > > is only 3 to 4 using the Congugate gradient > > > solver. We > > > > can achieve a speed up of a 6.5 using a GMRes > > > solver > > > > but the wall clock time of a GMRes is longer > than > > > a CG > > > > solver which indicates that CG is the faster > > > solver > > > > and it scales not as good as GMRes. Is this > > > generally > > > > true? > > > > > > > > I then went to the examples and find a 2D > example > > > of > > > > KSPSolve (ex2.c). I let the code ran with a > > > 1000x1000 > > > > mesh and get a linear scaling of the CG solver > and > > > a > > > > super linear scaling of the GMRes. These are > both > > > much > > > > better than our code. However, I think the 2D > > > nature > > > > of the sample problem might help the scaling > of > > > the > > > > code. So I would like to try some 3D example > using > > > the > > > > KSPSolve. Unfortunately, I couldn't find such > an > > > > example either in the > > > src/ksp/ksp/examples/tutorials > > > > directory or by google search. There are a > couple > > > of > > > > 3D examples in the > src/ksp/ksp/examples/tutorials > > > but > > > > they are about the SNES not KSPSolve. If > anyone > > > can > > > > provide me with such an example, I would > really > > > > appreciate it. > > > > Thanks a lot. > > > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > Finding fabulous fares is fun. > > > > Let Yahoo! FareChase search your favorite > travel > > > sites to find flight and hotel bargains. > > > > > http://farechase.yahoo.com/promo-generic-14795097 > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > 8:00? 8:25? 8:40? Find a flick in no time > > with the Yahoo! Search movie showtime shortcut. > > http://tools.search.yahoo.com/shortcuts/#news > > > > > > ____________________________________________________________________________________ Looking for earth-friendly autos? Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. http://autos.yahoo.com/green_center/ From knepley at gmail.com Fri Feb 9 21:06:43 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 9 Feb 2007 21:06:43 -0600 Subject: A 3D example of KSPSolve? In-Reply-To: <867640.48509.qm@web36210.mail.mud.yahoo.com> References: <867640.48509.qm@web36210.mail.mud.yahoo.com> Message-ID: You really have to give us the log summary output. None of the relevant numbers are in your summary. Thanks, Matt On 2/9/07, Shi Jin wrote: > > Dear Barry, > > Thank you. > I actually have done the staging already. > I summarized the timing of the runs in google online > spreadsheets. I have two runs. > 1. with 400,000 finite elements: > http://spreadsheets.google.com/pub?key=pZHoqlL60quZeDZlucTjEIA > 2. with 1,600,000 finite elements: > http://spreadsheets.google.com/pub?key=pZHoqlL60quZcCVLAqmzqQQ > > If you can take a look at them and give me some > advice, I will be deeply grateful. > > Shi > --- Barry Smith wrote: > > > > > NO, NO, don't spend time stripping your code! > > Unproductive > > > > See the manul pages for PetscLogStageRegister(), > > PetscLogStagePush() and > > PetscLogStagePop(). All you need to do is maintain a > > seperate stage for each > > of your KSPSolves; in your case you'll create 3 > > stages. > > > > Barry > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > Thank you. > > > But my code has 10 calls to KSPSolve of three > > > different linear systems at each time update. > > Should I > > > strip it down to a single KSPSolve so that it is > > > easier to analysis? I might have the code dump the > > > Matrix and vector and write another code to read > > them > > > into and call KSPSolve. I don't know whether this > > is > > > worth doing or should I just send in the messy > > log > > > file of the whole run. > > > Thanks for any advice. > > > > > > Shi > > > > > > --- Barry Smith wrote: > > > > > > > > > > > Shi, > > > > > > > > There is never a better test problem then > > your > > > > actual problem. > > > > Send the results from running on 1, 4, and 8 > > > > processes with the options > > > > -log_summary -ksp_view (use the optimized > > version of > > > > PETSc (running > > > > config/configure.py --with-debugging=0)) > > > > > > > > Barry > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > Hi there, > > > > > > > > > > I am tuning our 3D FEM CFD code written with > > > > PETSc. > > > > > The code doesn't scale very well. For example, > > > > with 8 > > > > > processes on a linux cluster, the speedup we > > > > achieve > > > > > with a fairly large problem size(million of > > > > elements) > > > > > is only 3 to 4 using the Congugate gradient > > > > solver. We > > > > > can achieve a speed up of a 6.5 using a GMRes > > > > solver > > > > > but the wall clock time of a GMRes is longer > > than > > > > a CG > > > > > solver which indicates that CG is the faster > > > > solver > > > > > and it scales not as good as GMRes. Is this > > > > generally > > > > > true? > > > > > > > > > > I then went to the examples and find a 2D > > example > > > > of > > > > > KSPSolve (ex2.c). I let the code ran with a > > > > 1000x1000 > > > > > mesh and get a linear scaling of the CG solver > > and > > > > a > > > > > super linear scaling of the GMRes. These are > > both > > > > much > > > > > better than our code. However, I think the 2D > > > > nature > > > > > of the sample problem might help the scaling > > of > > > > the > > > > > code. So I would like to try some 3D example > > using > > > > the > > > > > KSPSolve. Unfortunately, I couldn't find such > > an > > > > > example either in the > > > > src/ksp/ksp/examples/tutorials > > > > > directory or by google search. There are a > > couple > > > > of > > > > > 3D examples in the > > src/ksp/ksp/examples/tutorials > > > > but > > > > > they are about the SNES not KSPSolve. If > > anyone > > > > can > > > > > provide me with such an example, I would > > really > > > > > appreciate it. > > > > > Thanks a lot. > > > > > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > Finding fabulous fares is fun. > > > > > Let Yahoo! FareChase search your favorite > > travel > > > > sites to find flight and hotel bargains. > > > > > > > http://farechase.yahoo.com/promo-generic-14795097 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > 8:00? 8:25? 8:40? Find a flick in no time > > > with the Yahoo! Search movie showtime shortcut. > > > http://tools.search.yahoo.com/shortcuts/#news > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > Looking for earth-friendly autos? > Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. > http://autos.yahoo.com/green_center/ > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie -------------- next part -------------- An HTML attachment was scrubbed... URL: From jinzishuai at yahoo.com Fri Feb 9 21:23:10 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 9 Feb 2007 19:23:10 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: Message-ID: <292553.57576.qm@web36210.mail.mud.yahoo.com> Sorry that is not informative. So I decide to attach the 5 files for NP=1,2,4,8,16 for the 400,000 finite element case. Please note that the simulation runs over 100 steps. The 1st step is first order update, named as stage 1. The rest 99 steps are second order updates. Within that, stage 2-9 are created for the 8 stages of a second order update. We should concentrate on the second order updates. So four calls to KSPSolve in the log file are important, in stage 4,5,6,and 8 separately. Pleaes let me know if you need any other information or explanation. Thank you very much. Shi --- Matthew Knepley wrote: > You really have to give us the log summary output. > None of the relevant > numbers are in your summary. > > Thanks, > > Matt > > On 2/9/07, Shi Jin wrote: > > > > Dear Barry, > > > > Thank you. > > I actually have done the staging already. > > I summarized the timing of the runs in google > online > > spreadsheets. I have two runs. > > 1. with 400,000 finite elements: > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZeDZlucTjEIA > > 2. with 1,600,000 finite elements: > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZcCVLAqmzqQQ > > > > If you can take a look at them and give me some > > advice, I will be deeply grateful. > > > > Shi > > --- Barry Smith wrote: > > > > > > > > NO, NO, don't spend time stripping your code! > > > Unproductive > > > > > > See the manul pages for > PetscLogStageRegister(), > > > PetscLogStagePush() and > > > PetscLogStagePop(). All you need to do is > maintain a > > > seperate stage for each > > > of your KSPSolves; in your case you'll create 3 > > > stages. > > > > > > Barry > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > Thank you. > > > > But my code has 10 calls to KSPSolve of three > > > > different linear systems at each time update. > > > Should I > > > > strip it down to a single KSPSolve so that it > is > > > > easier to analysis? I might have the code dump > the > > > > Matrix and vector and write another code to > read > > > them > > > > into and call KSPSolve. I don't know whether > this > > > is > > > > worth doing or should I just send in the > messy > > > log > > > > file of the whole run. > > > > Thanks for any advice. > > > > > > > > Shi > > > > > > > > --- Barry Smith wrote: > > > > > > > > > > > > > > Shi, > > > > > > > > > > There is never a better test problem then > > > your > > > > > actual problem. > > > > > Send the results from running on 1, 4, and 8 > > > > > processes with the options > > > > > -log_summary -ksp_view (use the optimized > > > version of > > > > > PETSc (running > > > > > config/configure.py --with-debugging=0)) > > > > > > > > > > Barry > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > Hi there, > > > > > > > > > > > > I am tuning our 3D FEM CFD code written > with > > > > > PETSc. > > > > > > The code doesn't scale very well. For > example, > > > > > with 8 > > > > > > processes on a linux cluster, the speedup > we > > > > > achieve > > > > > > with a fairly large problem size(million > of > > > > > elements) > > > > > > is only 3 to 4 using the Congugate > gradient > > > > > solver. We > > > > > > can achieve a speed up of a 6.5 using a > GMRes > > > > > solver > > > > > > but the wall clock time of a GMRes is > longer > > > than > > > > > a CG > > > > > > solver which indicates that CG is the > faster > > > > > solver > > > > > > and it scales not as good as GMRes. Is > this > > > > > generally > > > > > > true? > > > > > > > > > > > > I then went to the examples and find a 2D > > > example > > > > > of > > > > > > KSPSolve (ex2.c). I let the code ran with > a > > > > > 1000x1000 > > > > > > mesh and get a linear scaling of the CG > solver > > > and > > > > > a > > > > > > super linear scaling of the GMRes. These > are > > > both > > > > > much > > > > > > better than our code. However, I think the > 2D > > > > > nature > > > > > > of the sample problem might help the > scaling > > > of > > > > > the > > > > > > code. So I would like to try some 3D > example > > > using > > > > > the > > > > > > KSPSolve. Unfortunately, I couldn't find > such > > > an > > > > > > example either in the > > > > > src/ksp/ksp/examples/tutorials > > > > > > directory or by google search. There are a > > > couple > > > > > of > > > > > > 3D examples in the > > > src/ksp/ksp/examples/tutorials > > > > > but > > > > > > they are about the SNES not KSPSolve. If > > > anyone > > > > > can > > > > > > provide me with such an example, I would > > > really > > > > > > appreciate it. > > > > > > Thanks a lot. > > > > > > > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > > Finding fabulous fares is fun. > > > > > > Let Yahoo! FareChase search your favorite > > > travel > > > > > sites to find flight and hotel bargains. > > > > > > > > > > http://farechase.yahoo.com/promo-generic-14795097 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > 8:00? 8:25? 8:40? Find a flick in no time > > > > with the Yahoo! Search movie showtime > shortcut. > > > > http://tools.search.yahoo.com/shortcuts/#news > > > > > === message truncated === ____________________________________________________________________________________ Finding fabulous fares is fun. Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains. http://farechase.yahoo.com/promo-generic-14795097 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log-1.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log-2.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log-4.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log-8.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log-16.txt URL: From bsmith at mcs.anl.gov Fri Feb 9 22:37:18 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 9 Feb 2007 22:37:18 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: <292553.57576.qm@web36210.mail.mud.yahoo.com> References: <292553.57576.qm@web36210.mail.mud.yahoo.com> Message-ID: What are all the calls for MatGetRow() for? They are consuming a great deal of time. Is there anyway to get rid of them? Barry On Fri, 9 Feb 2007, Shi Jin wrote: > Sorry that is not informative. > So I decide to attach the 5 files for NP=1,2,4,8,16 > for > the 400,000 finite element case. > > Please note that the simulation runs over 100 steps. > The 1st step is first order update, named as stage 1. > The rest 99 steps are second order updates. Within > that, stage 2-9 are created for the 8 stages of a > second order update. We should concentrate on the > second order updates. So four calls to KSPSolve in the > log file are important, in stage 4,5,6,and 8 > separately. > Pleaes let me know if you need any other information > or explanation. > Thank you very much. > > Shi > --- Matthew Knepley wrote: > > > You really have to give us the log summary output. > > None of the relevant > > numbers are in your summary. > > > > Thanks, > > > > Matt > > > > On 2/9/07, Shi Jin wrote: > > > > > > Dear Barry, > > > > > > Thank you. > > > I actually have done the staging already. > > > I summarized the timing of the runs in google > > online > > > spreadsheets. I have two runs. > > > 1. with 400,000 finite elements: > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZeDZlucTjEIA > > > 2. with 1,600,000 finite elements: > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZcCVLAqmzqQQ > > > > > > If you can take a look at them and give me some > > > advice, I will be deeply grateful. > > > > > > Shi > > > --- Barry Smith wrote: > > > > > > > > > > > NO, NO, don't spend time stripping your code! > > > > Unproductive > > > > > > > > See the manul pages for > > PetscLogStageRegister(), > > > > PetscLogStagePush() and > > > > PetscLogStagePop(). All you need to do is > > maintain a > > > > seperate stage for each > > > > of your KSPSolves; in your case you'll create 3 > > > > stages. > > > > > > > > Barry > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > Thank you. > > > > > But my code has 10 calls to KSPSolve of three > > > > > different linear systems at each time update. > > > > Should I > > > > > strip it down to a single KSPSolve so that it > > is > > > > > easier to analysis? I might have the code dump > > the > > > > > Matrix and vector and write another code to > > read > > > > them > > > > > into and call KSPSolve. I don't know whether > > this > > > > is > > > > > worth doing or should I just send in the > > messy > > > > log > > > > > file of the whole run. > > > > > Thanks for any advice. > > > > > > > > > > Shi > > > > > > > > > > --- Barry Smith wrote: > > > > > > > > > > > > > > > > > Shi, > > > > > > > > > > > > There is never a better test problem then > > > > your > > > > > > actual problem. > > > > > > Send the results from running on 1, 4, and 8 > > > > > > processes with the options > > > > > > -log_summary -ksp_view (use the optimized > > > > version of > > > > > > PETSc (running > > > > > > config/configure.py --with-debugging=0)) > > > > > > > > > > > > Barry > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > > > Hi there, > > > > > > > > > > > > > > I am tuning our 3D FEM CFD code written > > with > > > > > > PETSc. > > > > > > > The code doesn't scale very well. For > > example, > > > > > > with 8 > > > > > > > processes on a linux cluster, the speedup > > we > > > > > > achieve > > > > > > > with a fairly large problem size(million > > of > > > > > > elements) > > > > > > > is only 3 to 4 using the Congugate > > gradient > > > > > > solver. We > > > > > > > can achieve a speed up of a 6.5 using a > > GMRes > > > > > > solver > > > > > > > but the wall clock time of a GMRes is > > longer > > > > than > > > > > > a CG > > > > > > > solver which indicates that CG is the > > faster > > > > > > solver > > > > > > > and it scales not as good as GMRes. Is > > this > > > > > > generally > > > > > > > true? > > > > > > > > > > > > > > I then went to the examples and find a 2D > > > > example > > > > > > of > > > > > > > KSPSolve (ex2.c). I let the code ran with > > a > > > > > > 1000x1000 > > > > > > > mesh and get a linear scaling of the CG > > solver > > > > and > > > > > > a > > > > > > > super linear scaling of the GMRes. These > > are > > > > both > > > > > > much > > > > > > > better than our code. However, I think the > > 2D > > > > > > nature > > > > > > > of the sample problem might help the > > scaling > > > > of > > > > > > the > > > > > > > code. So I would like to try some 3D > > example > > > > using > > > > > > the > > > > > > > KSPSolve. Unfortunately, I couldn't find > > such > > > > an > > > > > > > example either in the > > > > > > src/ksp/ksp/examples/tutorials > > > > > > > directory or by google search. There are a > > > > couple > > > > > > of > > > > > > > 3D examples in the > > > > src/ksp/ksp/examples/tutorials > > > > > > but > > > > > > > they are about the SNES not KSPSolve. If > > > > anyone > > > > > > can > > > > > > > provide me with such an example, I would > > > > really > > > > > > > appreciate it. > > > > > > > Thanks a lot. > > > > > > > > > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > > > Finding fabulous fares is fun. > > > > > > > Let Yahoo! FareChase search your favorite > > > > travel > > > > > > sites to find flight and hotel bargains. > > > > > > > > > > > > > http://farechase.yahoo.com/promo-generic-14795097 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > 8:00? 8:25? 8:40? Find a flick in no time > > > > > with the Yahoo! Search movie showtime > > shortcut. > > > > > http://tools.search.yahoo.com/shortcuts/#news > > > > > > > > === message truncated === > > > > > ____________________________________________________________________________________ > Finding fabulous fares is fun. > Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains. > http://farechase.yahoo.com/promo-generic-14795097 From jinzishuai at yahoo.com Fri Feb 9 22:56:02 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 9 Feb 2007 20:56:02 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: Message-ID: <902001.46633.qm@web36203.mail.mud.yahoo.com> MatGetRow are used to build the right hand side vector. We use it in order to get the number of nonzero cols, global col indices and values in a row. The reason it is time consuming is that it is called for each row of the matrix. I am not sure how I can get away without it. Thanks. Shi --- Barry Smith wrote: > > What are all the calls for MatGetRow() for? They > are consuming a > great deal of time. Is there anyway to get rid of > them? > > Barry > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > Sorry that is not informative. > > So I decide to attach the 5 files for > NP=1,2,4,8,16 > > for > > the 400,000 finite element case. > > > > Please note that the simulation runs over 100 > steps. > > The 1st step is first order update, named as stage > 1. > > The rest 99 steps are second order updates. Within > > that, stage 2-9 are created for the 8 stages of a > > second order update. We should concentrate on the > > second order updates. So four calls to KSPSolve in > the > > log file are important, in stage 4,5,6,and 8 > > separately. > > Pleaes let me know if you need any other > information > > or explanation. > > Thank you very much. > > > > Shi > > --- Matthew Knepley wrote: > > > > > You really have to give us the log summary > output. > > > None of the relevant > > > numbers are in your summary. > > > > > > Thanks, > > > > > > Matt > > > > > > On 2/9/07, Shi Jin wrote: > > > > > > > > Dear Barry, > > > > > > > > Thank you. > > > > I actually have done the staging already. > > > > I summarized the timing of the runs in google > > > online > > > > spreadsheets. I have two runs. > > > > 1. with 400,000 finite elements: > > > > > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZeDZlucTjEIA > > > > 2. with 1,600,000 finite elements: > > > > > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZcCVLAqmzqQQ > > > > > > > > If you can take a look at them and give me > some > > > > advice, I will be deeply grateful. > > > > > > > > Shi > > > > --- Barry Smith wrote: > > > > > > > > > > > > > > NO, NO, don't spend time stripping your > code! > > > > > Unproductive > > > > > > > > > > See the manul pages for > > > PetscLogStageRegister(), > > > > > PetscLogStagePush() and > > > > > PetscLogStagePop(). All you need to do is > > > maintain a > > > > > seperate stage for each > > > > > of your KSPSolves; in your case you'll > create 3 > > > > > stages. > > > > > > > > > > Barry > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > Thank you. > > > > > > But my code has 10 calls to KSPSolve of > three > > > > > > different linear systems at each time > update. > > > > > Should I > > > > > > strip it down to a single KSPSolve so that > it > > > is > > > > > > easier to analysis? I might have the code > dump > > > the > > > > > > Matrix and vector and write another code > to > > > read > > > > > them > > > > > > into and call KSPSolve. I don't know > whether > > > this > > > > > is > > > > > > worth doing or should I just send in the > > > messy > > > > > log > > > > > > file of the whole run. > > > > > > Thanks for any advice. > > > > > > > > > > > > Shi > > > > > > > > > > > > --- Barry Smith > wrote: > > > > > > > > > > > > > > > > > > > > Shi, > > > > > > > > > > > > > > There is never a better test problem > then > > > > > your > > > > > > > actual problem. > > > > > > > Send the results from running on 1, 4, > and 8 > > > > > > > processes with the options > > > > > > > -log_summary -ksp_view (use the > optimized > > > > > version of > > > > > > > PETSc (running > > > > > > > config/configure.py --with-debugging=0)) > > > > > > > > > > > > > > Barry > > > > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > > > > > Hi there, > > > > > > > > > > > > > > > > I am tuning our 3D FEM CFD code > written > > > with > > > > > > > PETSc. > > > > > > > > The code doesn't scale very well. For > > > example, > > > > > > > with 8 > > > > > > > > processes on a linux cluster, the > speedup > > > we > > > > > > > achieve > > > > > > > > with a fairly large problem > size(million > > > of > > > > > > > elements) > > > > > > > > is only 3 to 4 using the Congugate > > > gradient > > > > > > > solver. We > > > > > > > > can achieve a speed up of a 6.5 using > a > > > GMRes > > > > > > > solver > > > > > > > > but the wall clock time of a GMRes is > > > longer > > > > > than > > > > > > > a CG > > > > > > > > solver which indicates that CG is the > > > faster > > > > > > > solver > > > > > > > > and it scales not as good as GMRes. Is > > > this > > > > > > > generally > > > > > > > > true? > > > > > > > > > > > > > > > > I then went to the examples and find a > 2D > > > > > example > > > > > > > of > > > > > > > > KSPSolve (ex2.c). I let the code ran > with > > > a > > > > > > > 1000x1000 > > > > > > > > mesh and get a linear scaling of the > CG > > > solver > > > > > and > > > > > > > a > > > > > > > > super linear scaling of the GMRes. > These > > > are > > > > > both > > > > > > > much > > > > > > > > better than our code. However, I think > the > > > 2D > > > > > > > nature > === message truncated === ____________________________________________________________________________________ Cheap talk? Check out Yahoo! Messenger's low PC-to-Phone call rates. http://voice.yahoo.com From balay at mcs.anl.gov Fri Feb 9 23:02:14 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Fri, 9 Feb 2007 23:02:14 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: References: <292553.57576.qm@web36210.mail.mud.yahoo.com> Message-ID: Just looking at 8 proc run [diffusion stage] we have: MatMult : 79 sec MatMultAdd : 2 sec VecScatterBegin: 17 sec VecScatterEnd : 51 sec So basically the communication in MatMult/Add is represented by VecScatters. Here out of 81 sec total - 68 seconds are used for communication [with a load imbalance of 11 for vecscaterend] So - I think MPI performance is reducing scalability here.. Things to try: * -vecstatter_rr etc options I sugested earlier * install mpich with '--with-device=ch3:ssm' and see if it makes a difference Satish --- Event Stage 4: Diffusion [x]rhsLtP 297 1.0 1.1017e+02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 39 0 0 0 0 0 [x]rhsGravity 99 1.0 4.2582e+0083.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 VecDot 4657 1.0 2.5748e+01 3.2 7.60e+07 3.2 0.0e+00 0.0e+00 4.7e+03 1 1 0 0 6 5 3 0 0 65 191 VecNorm 2477 1.0 2.2109e+01 2.2 3.22e+07 2.2 0.0e+00 0.0e+00 2.5e+03 1 0 0 0 3 5 2 0 0 35 118 VecScale 594 1.0 2.9330e-02 1.5 2.61e+08 1.5 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1361 VecCopy 594 1.0 2.7552e-01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 3665 1.0 6.0793e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 5251 1.0 2.5892e+00 1.2 3.31e+08 1.2 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 1 4 0 0 0 2137 VecAYPX 1883 1.0 8.6419e-01 1.3 3.62e+08 1.3 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 1 0 0 0 2296 VecScatterBegin 2873 1.0 1.7569e+01 3.0 0.00e+00 0.0 3.8e+04 1.6e+05 0.0e+00 1 0 10 20 0 5 0100100 0 0 VecScatterEnd 2774 1.0 5.1519e+0110.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 7 0 0 0 0 0 MatMult 2477 1.0 7.9186e+01 2.4 2.34e+08 2.4 3.5e+04 1.7e+05 0.0e+00 3 11 9 20 0 20 48 91 98 0 850 MatMultAdd 297 1.0 2.8161e+00 5.4 4.46e+07 2.2 3.6e+03 3.4e+04 0.0e+00 0 0 1 0 0 0 0 9 2 0 125 MatSolve 2477 1.0 6.2245e+01 1.2 1.41e+08 1.2 0.0e+00 0.0e+00 0.0e+00 4 10 0 0 0 22 41 0 0 0 926 MatLUFactorNum 3 1.0 2.7686e-01 1.1 2.79e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2016 MatGetRow 19560420 1.0 5.5195e+01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 20 0 0 0 0 0 KSPSetup 6 1.0 3.0756e-05 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 297 1.0 1.3142e+02 1.0 1.31e+08 1.1 3.1e+04 1.7e+05 7.1e+03 8 22 8 18 9 50 93 80 86100 1001 PCSetUp 6 1.0 2.7700e-01 1.1 2.78e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2015 PCSetUpOnBlocks 297 1.0 2.7794e-01 1.1 2.78e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2008 PCApply 2477 1.0 6.2772e+01 1.2 1.39e+08 1.2 0.0e+00 0.0e+00 0.0e+00 4 10 0 0 0 23 41 0 0 0 918 From bsmith at mcs.anl.gov Fri Feb 9 23:07:11 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 9 Feb 2007 23:07:11 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: <292553.57576.qm@web36210.mail.mud.yahoo.com> References: <292553.57576.qm@web36210.mail.mud.yahoo.com> Message-ID: Shi, The lack of good scaling is coming from two important sources. 1) The MPI on this system is terrible Average time to get PetscTime(): 1.71661e-06 Average time for MPI_Barrier(): 0.008253 Average time for zero size MPI_Send(): 0.000279441 you want to see numbers like 1.e-5 to 1.e-6 instead of 1e-3 to 1e-4 2) The number of iterations for the linear systems is growing too rapidly with more processes. For example in stage 8 it goes from 1782 iterations on 1 process to 3267 on 16 processors. 3) a lessor effect is from a slight inbalance in work between processes, for example in stage 8 the slowest MatSolve is 1.3 times the fastest. Initial suggestions. 0) Get rid of the MatGetRows() 1) it appears your matrices are symmetric? If so, you can use MATMPISBAIJ instead of AIJ, then you can use (incomplete) Cholesky on the blocks. 2) Try using ASM instead of block Jacobi as the preconditioner. Use -pc_type asm -pc_asm_type basic -sub_pc_type icc this will decrease the number of iterations in parallel at the cost of more expensive iterations so it may help or may not. 3) Try using hypre's boomeramg for some (poisson?) (all?) of the solves. config/configure.py PETSc with --download-hypre and run with -pc_type hypre -pc_hypre_type boomeramg (if you run this with -help it will show a large number of tuneable options that can really speed things up.) Final note: I would not expect to EVER see more than a speed up of more then say 10 to 12 on this machine, no matter how good the linear solver; due to the slowness of the network. But on a really good network you "might" be able to get 13 or 14 with hypre boomeramg. Barry On Fri, 9 Feb 2007, Shi Jin wrote: > Sorry that is not informative. > So I decide to attach the 5 files for NP=1,2,4,8,16 > for > the 400,000 finite element case. > > Please note that the simulation runs over 100 steps. > The 1st step is first order update, named as stage 1. > The rest 99 steps are second order updates. Within > that, stage 2-9 are created for the 8 stages of a > second order update. We should concentrate on the > second order updates. So four calls to KSPSolve in the > log file are important, in stage 4,5,6,and 8 > separately. > Pleaes let me know if you need any other information > or explanation. > Thank you very much. > > Shi > --- Matthew Knepley wrote: > > > You really have to give us the log summary output. > > None of the relevant > > numbers are in your summary. > > > > Thanks, > > > > Matt > > > > On 2/9/07, Shi Jin wrote: > > > > > > Dear Barry, > > > > > > Thank you. > > > I actually have done the staging already. > > > I summarized the timing of the runs in google > > online > > > spreadsheets. I have two runs. > > > 1. with 400,000 finite elements: > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZeDZlucTjEIA > > > 2. with 1,600,000 finite elements: > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZcCVLAqmzqQQ > > > > > > If you can take a look at them and give me some > > > advice, I will be deeply grateful. > > > > > > Shi > > > --- Barry Smith wrote: > > > > > > > > > > > NO, NO, don't spend time stripping your code! > > > > Unproductive > > > > > > > > See the manul pages for > > PetscLogStageRegister(), > > > > PetscLogStagePush() and > > > > PetscLogStagePop(). All you need to do is > > maintain a > > > > seperate stage for each > > > > of your KSPSolves; in your case you'll create 3 > > > > stages. > > > > > > > > Barry > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > Thank you. > > > > > But my code has 10 calls to KSPSolve of three > > > > > different linear systems at each time update. > > > > Should I > > > > > strip it down to a single KSPSolve so that it > > is > > > > > easier to analysis? I might have the code dump > > the > > > > > Matrix and vector and write another code to > > read > > > > them > > > > > into and call KSPSolve. I don't know whether > > this > > > > is > > > > > worth doing or should I just send in the > > messy > > > > log > > > > > file of the whole run. > > > > > Thanks for any advice. > > > > > > > > > > Shi > > > > > > > > > > --- Barry Smith wrote: > > > > > > > > > > > > > > > > > Shi, > > > > > > > > > > > > There is never a better test problem then > > > > your > > > > > > actual problem. > > > > > > Send the results from running on 1, 4, and 8 > > > > > > processes with the options > > > > > > -log_summary -ksp_view (use the optimized > > > > version of > > > > > > PETSc (running > > > > > > config/configure.py --with-debugging=0)) > > > > > > > > > > > > Barry > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > > > Hi there, > > > > > > > > > > > > > > I am tuning our 3D FEM CFD code written > > with > > > > > > PETSc. > > > > > > > The code doesn't scale very well. For > > example, > > > > > > with 8 > > > > > > > processes on a linux cluster, the speedup > > we > > > > > > achieve > > > > > > > with a fairly large problem size(million > > of > > > > > > elements) > > > > > > > is only 3 to 4 using the Congugate > > gradient > > > > > > solver. We > > > > > > > can achieve a speed up of a 6.5 using a > > GMRes > > > > > > solver > > > > > > > but the wall clock time of a GMRes is > > longer > > > > than > > > > > > a CG > > > > > > > solver which indicates that CG is the > > faster > > > > > > solver > > > > > > > and it scales not as good as GMRes. Is > > this > > > > > > generally > > > > > > > true? > > > > > > > > > > > > > > I then went to the examples and find a 2D > > > > example > > > > > > of > > > > > > > KSPSolve (ex2.c). I let the code ran with > > a > > > > > > 1000x1000 > > > > > > > mesh and get a linear scaling of the CG > > solver > > > > and > > > > > > a > > > > > > > super linear scaling of the GMRes. These > > are > > > > both > > > > > > much > > > > > > > better than our code. However, I think the > > 2D > > > > > > nature > > > > > > > of the sample problem might help the > > scaling > > > > of > > > > > > the > > > > > > > code. So I would like to try some 3D > > example > > > > using > > > > > > the > > > > > > > KSPSolve. Unfortunately, I couldn't find > > such > > > > an > > > > > > > example either in the > > > > > > src/ksp/ksp/examples/tutorials > > > > > > > directory or by google search. There are a > > > > couple > > > > > > of > > > > > > > 3D examples in the > > > > src/ksp/ksp/examples/tutorials > > > > > > but > > > > > > > they are about the SNES not KSPSolve. If > > > > anyone > > > > > > can > > > > > > > provide me with such an example, I would > > > > really > > > > > > > appreciate it. > > > > > > > Thanks a lot. > > > > > > > > > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > > > Finding fabulous fares is fun. > > > > > > > Let Yahoo! FareChase search your favorite > > > > travel > > > > > > sites to find flight and hotel bargains. > > > > > > > > > > > > > http://farechase.yahoo.com/promo-generic-14795097 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > 8:00? 8:25? 8:40? Find a flick in no time > > > > > with the Yahoo! Search movie showtime > > shortcut. > > > > > http://tools.search.yahoo.com/shortcuts/#news > > > > > > > > === message truncated === > > > > > ____________________________________________________________________________________ > Finding fabulous fares is fun. > Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains. > http://farechase.yahoo.com/promo-generic-14795097 From bsmith at mcs.anl.gov Fri Feb 9 23:09:50 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 9 Feb 2007 23:09:50 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: <902001.46633.qm@web36203.mail.mud.yahoo.com> References: <902001.46633.qm@web36203.mail.mud.yahoo.com> Message-ID: On Fri, 9 Feb 2007, Shi Jin wrote: > MatGetRow are used to build the right hand side > vector. ^^^^^^ Huh? > We use it in order to get the number of nonzero cols, > global col indices and values in a row. Huh? What do you do with all this information? Maybe we can do what you do with this information much more efficiently? Without all the calls to MatGetRow(). Barry > > The reason it is time consuming is that it is called > for each row of the matrix. I am not sure how I can > get away without it. > Thanks. > > Shi > --- Barry Smith wrote: > > > > > What are all the calls for MatGetRow() for? They > > are consuming a > > great deal of time. Is there anyway to get rid of > > them? > > > > Barry > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > Sorry that is not informative. > > > So I decide to attach the 5 files for > > NP=1,2,4,8,16 > > > for > > > the 400,000 finite element case. > > > > > > Please note that the simulation runs over 100 > > steps. > > > The 1st step is first order update, named as stage > > 1. > > > The rest 99 steps are second order updates. Within > > > that, stage 2-9 are created for the 8 stages of a > > > second order update. We should concentrate on the > > > second order updates. So four calls to KSPSolve in > > the > > > log file are important, in stage 4,5,6,and 8 > > > separately. > > > Pleaes let me know if you need any other > > information > > > or explanation. > > > Thank you very much. > > > > > > Shi > > > --- Matthew Knepley wrote: > > > > > > > You really have to give us the log summary > > output. > > > > None of the relevant > > > > numbers are in your summary. > > > > > > > > Thanks, > > > > > > > > Matt > > > > > > > > On 2/9/07, Shi Jin wrote: > > > > > > > > > > Dear Barry, > > > > > > > > > > Thank you. > > > > > I actually have done the staging already. > > > > > I summarized the timing of the runs in google > > > > online > > > > > spreadsheets. I have two runs. > > > > > 1. with 400,000 finite elements: > > > > > > > > > > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZeDZlucTjEIA > > > > > 2. with 1,600,000 finite elements: > > > > > > > > > > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZcCVLAqmzqQQ > > > > > > > > > > If you can take a look at them and give me > > some > > > > > advice, I will be deeply grateful. > > > > > > > > > > Shi > > > > > --- Barry Smith wrote: > > > > > > > > > > > > > > > > > NO, NO, don't spend time stripping your > > code! > > > > > > Unproductive > > > > > > > > > > > > See the manul pages for > > > > PetscLogStageRegister(), > > > > > > PetscLogStagePush() and > > > > > > PetscLogStagePop(). All you need to do is > > > > maintain a > > > > > > seperate stage for each > > > > > > of your KSPSolves; in your case you'll > > create 3 > > > > > > stages. > > > > > > > > > > > > Barry > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > > > Thank you. > > > > > > > But my code has 10 calls to KSPSolve of > > three > > > > > > > different linear systems at each time > > update. > > > > > > Should I > > > > > > > strip it down to a single KSPSolve so that > > it > > > > is > > > > > > > easier to analysis? I might have the code > > dump > > > > the > > > > > > > Matrix and vector and write another code > > to > > > > read > > > > > > them > > > > > > > into and call KSPSolve. I don't know > > whether > > > > this > > > > > > is > > > > > > > worth doing or should I just send in the > > > > messy > > > > > > log > > > > > > > file of the whole run. > > > > > > > Thanks for any advice. > > > > > > > > > > > > > > Shi > > > > > > > > > > > > > > --- Barry Smith > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Shi, > > > > > > > > > > > > > > > > There is never a better test problem > > then > > > > > > your > > > > > > > > actual problem. > > > > > > > > Send the results from running on 1, 4, > > and 8 > > > > > > > > processes with the options > > > > > > > > -log_summary -ksp_view (use the > > optimized > > > > > > version of > > > > > > > > PETSc (running > > > > > > > > config/configure.py --with-debugging=0)) > > > > > > > > > > > > > > > > Barry > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > > > > > > > Hi there, > > > > > > > > > > > > > > > > > > I am tuning our 3D FEM CFD code > > written > > > > with > > > > > > > > PETSc. > > > > > > > > > The code doesn't scale very well. For > > > > example, > > > > > > > > with 8 > > > > > > > > > processes on a linux cluster, the > > speedup > > > > we > > > > > > > > achieve > > > > > > > > > with a fairly large problem > > size(million > > > > of > > > > > > > > elements) > > > > > > > > > is only 3 to 4 using the Congugate > > > > gradient > > > > > > > > solver. We > > > > > > > > > can achieve a speed up of a 6.5 using > > a > > > > GMRes > > > > > > > > solver > > > > > > > > > but the wall clock time of a GMRes is > > > > longer > > > > > > than > > > > > > > > a CG > > > > > > > > > solver which indicates that CG is the > > > > faster > > > > > > > > solver > > > > > > > > > and it scales not as good as GMRes. Is > > > > this > > > > > > > > generally > > > > > > > > > true? > > > > > > > > > > > > > > > > > > I then went to the examples and find a > > 2D > > > > > > example > > > > > > > > of > > > > > > > > > KSPSolve (ex2.c). I let the code ran > > with > > > > a > > > > > > > > 1000x1000 > > > > > > > > > mesh and get a linear scaling of the > > CG > > > > solver > > > > > > and > > > > > > > > a > > > > > > > > > super linear scaling of the GMRes. > > These > > > > are > > > > > > both > > > > > > > > much > > > > > > > > > better than our code. However, I think > > the > > > > 2D > > > > > > > > nature > > > === message truncated === > > > > > ____________________________________________________________________________________ > Cheap talk? > Check out Yahoo! Messenger's low PC-to-Phone call rates. > http://voice.yahoo.com > > From zonexo at gmail.com Sat Feb 10 02:28:51 2007 From: zonexo at gmail.com (Ben Tay) Date: Sat, 10 Feb 2007 16:28:51 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> Message-ID: <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> Hi, I tried to use ex2f.F as a test code. I've changed the number n,m from 3 to 500 each. I ran the code using 1 processor and then with 4 processor. I then repeat the same with the following modification: do i=1,10 call KSPSolve(ksp,b,x,ierr) end do I've added to do loop to make the solving repeat 10 times. In both cases, the serial code is faster, e.g. 1 taking 2.4 min while the other 3.3 min. Here's the log_summary: ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by g0306332 Sat Feb 10 16:21:36 2007 Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST 2007 HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 Max Max/Min Avg Total Time (sec): 2.213e+02 1.00051 2.212e+02 Objects: 5.500e+01 1.00000 5.500e+01 Flops: 4.718e+09 1.00019 4.718e+09 1.887e+10 Flops/sec: 2.134e+07 1.00070 2.133e+07 8.531e+07 Memory: 3.186e+07 1.00069 1.274e+08 MPI Messages: 1.832e+03 2.00000 1.374e+03 5.496e+03 MPI Message Lengths: 7.324e+06 2.00000 3.998e+03 2.197e+07 MPI Reductions: 7.112e+02 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 2.2120e+02 100.0% 1.8871e+10 100.0% 5.496e+03 100.0% 3.998e+03 100.0% 2.845e+03 100.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ ########################################################## # # # WARNING!!! # # # # This code was compiled with a debugging option, # # To get timing results run config/configure.py # # using --with-debugging=no, the performance will # # be generally two or three times faster. # # # ########################################################## ########################################################## # # # WARNING!!! # # # # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend # # preloading. otherwise timing numbers may be # # meaningless. # ########################################################## Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03 4.0e+03 0.0e+00 18 11100100 0 18 11100100 0 46 MatSolve 915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00 0.0e+00 0.0e+00 7 11 0 0 0 7 11 0 0 0 131 MatLUFactorNum 1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 43 MatILUFactorSym 1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00 2.0e+03 1.3e+01 1 0 0 0 0 1 0 0 0 0 0 MatGetOrdering 1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecMDot 885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00 0.0e+00 8.8e+02 36 36 0 0 31 36 36 0 0 31 80 VecNorm 916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00 0.0e+00 9.2e+02 29 2 0 0 32 29 2 0 0 32 7 VecScale 915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 200 VecCopy 30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAXPY 60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 542 VecMAXPY 915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00 0.0e+00 0.0e+00 6 38 0 0 0 6 38 0 0 0 483 VecScatterBegin 915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03 4.0e+03 0.0e+00 0 0100100 0 0 0100100 0 0 VecScatterEnd 915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 14 0 0 0 0 14 0 0 0 0 0 VecNormalize 915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00 0.0e+00 9.2e+02 30 4 0 0 32 30 4 0 0 32 10 KSPGMRESOrthog 885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00 0.0e+00 8.8e+02 42 72 0 0 31 42 72 0 0 31 138 KSPSetup 2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03 4.0e+03 2.8e+03 99100100100 99 99100100100 99 86 PCSetUp 2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 30 PCSetUpOnBlocks 1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 31 PCApply 915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00 0.0e+00 9.2e+02 7 11 0 0 32 7 11 0 0 32 124 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 4 4 252008 0 Index Set 5 5 753096 0 Vec 41 41 18519984 0 Vec Scatter 1 1 0 0 Krylov Solver 2 2 16880 0 Preconditioner 2 2 196 0 ======================================================================================================================== Average time to get PetscTime(): 1.09673e-06 Average time for MPI_Barrier(): 4.18186e-05 Average time for zero size MPI_Send(): 2.62856e-05 OptionTable: -log_summary Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 sizeof(PetscScalar) 8 Configure run at: Thu Jan 18 12:23:31 2007 Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 --with-mpi-dir=/opt/mpich/myrinet/intel/ ----------------------------------------- Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on atlas1.nus.edu.sg Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 SMP Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 Using PETSc arch: linux-mpif90 ----------------------------------------- Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g -w90 -w ----------------------------------------- Using include paths: -I/nas/lsftmp/g0306332/petsc-2.3.2-p8-I/nas/lsftmp/g0306332/petsc- 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8/include -I/opt/mpich/myrinet/intel/include ------------------------------------------ Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g -w90 -w Using libraries: -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 -lm -Wl,-rpath,\ -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl ------------------------------------------ So is there something wrong with the server's mpi implementation? Thank you. On 2/10/07, Satish Balay wrote: > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec. > Either something is wrong with your run - or MPI is really broken.. > > Satish > > > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 2.4e+04 > 1.3e+03 > > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 2.4e+04 > 1.3e+03 > > > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zonexo at gmail.com Sat Feb 10 03:17:52 2007 From: zonexo at gmail.com (Ben Tay) Date: Sat, 10 Feb 2007 17:17:52 +0800 Subject: understanding the output from -info In-Reply-To: <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> Message-ID: <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> Hi, I've repeated the test with n,m = 800. Now serial takes around 11mins while parallel with 4 processors took 6mins. Does it mean that the problem must be pretty large before it is more superior to use parallel? Moreover 800x800 means there's 640000 unknowns. My problem is a 2D CFD code which typically has 200x80=16000 unknowns. Does it mean that I won't be able to benefit from running in parallel? Btw, this is the parallel's log_summary: Event Count Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage MatMult 1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03 0.0e+00 16 11100100 0 16 11100100 0 103 MatSolve 1265 1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00 0.0e+00 11 11 0 0 0 11 11 0 0 0 152 MatLUFactorNum 1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 22 MatILUFactorSym 1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03 1.3e+01 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 1.2088e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecMDot 1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00 1.2e+03 36 36 0 0 31 36 36 0 0 31 158 VecNorm 1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00 1.3e+03 24 2 0 0 33 24 2 0 0 33 16 VecScale 1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 216 VecCopy 41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 1308 1.0 3.2717e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 VecAXPY 82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 197 VecMAXPY 1265 1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00 0.0e+00 10 38 0 0 0 10 38 0 0 0 557 VecScatterBegin 1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03 0.0e+00 0 0100100 0 0 0100100 0 0 VecScatterEnd 1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 VecNormalize 1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00 1.3e+03 25 4 0 0 32 25 4 0 0 32 23 KSPGMRESOrthog 1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00 1.2e+03 45 72 0 0 31 45 72 0 0 31 254 KSPSetup 2 1.0 5.1674e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03 3.9e+03 99100100100 99 99100100100 99 166 PCSetUp 2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 12 PCSetUpOnBlocks 1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 13 PCApply 1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00 1.3e+03 12 11 0 0 32 12 11 0 0 32 143 ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 4 4 643208 0 Index Set 5 5 1924296 0 Vec 41 41 47379984 0 Vec Scatter 1 1 0 0 Krylov Solver 2 2 16880 0 Preconditioner 2 2 196 0 ======================================================================================================================== Average time to get PetscTime(): 1.00136e-06 Average time for MPI_Barrier(): 4.00066e-05 Average time for zero size MPI_Send(): 1.70469e-05 OptionTable: -log_summary Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 sizeof(PetscScalar) 8 Configure run at: Thu Jan 18 12:23:31 2007 Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 --with-mpi-dir=/opt/mpich/myrinet/intel/ ----------------------------------------- On 2/10/07, Ben Tay wrote: > > Hi, > > I tried to use ex2f.F as a test code. I've changed the number n,m from 3 > to 500 each. I ran the code using 1 processor and then with 4 processor. I > then repeat the same with the following modification: > > > do i=1,10 > > call KSPSolve(ksp,b,x,ierr) > > end do > I've added to do loop to make the solving repeat 10 times. > > In both cases, the serial code is faster, e.g. 1 taking 2.4 min while the > other 3.3 min. > > Here's the log_summary: > > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by > g0306332 Sat Feb 10 16:21:36 2007 > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST 2007 > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 > > Max Max/Min Avg Total > Time (sec): 2.213e+02 1.00051 2.212e+02 > Objects: 5.500e+01 1.00000 5.500e+01 > Flops: 4.718e+09 1.00019 4.718e+09 1.887e+10 > Flops/sec: 2.134e+07 1.00070 2.133e+07 8.531e+07 > > Memory: 3.186e+07 1.00069 1.274e+08 > MPI Messages: 1.832e+03 2.00000 1.374e+03 5.496e+03 > MPI Message Lengths: 7.324e+06 2.00000 3.998e+03 2.197e+07 > MPI Reductions: 7.112e+02 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages > --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts > %Total Avg %Total counts %Total > 0: Main Stage: 2.2120e+02 100.0% 1.8871e+10 100.0% 5.496e+03 > 100.0% 3.998e+03 100.0% 2.845e+03 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > over all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > ########################################################## > # # > # WARNING!!! # > # # > # This code was compiled with a debugging option, # > # To get timing results run config/configure.py # > # using --with-debugging=no, the performance will # > # be generally two or three times faster. # > # # > ########################################################## > > > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > Event Count Time (sec) > Flops/sec --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03 4.0e+03 > 0.0e+00 18 11100100 0 18 11100100 0 46 > MatSolve 915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00 0.0e+00 > 0.0e+00 7 11 0 0 0 7 11 0 0 0 131 > MatLUFactorNum 1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 43 > MatILUFactorSym 1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00 2.0e+03 > 1.3e+01 1 0 0 0 0 1 0 0 0 0 0 > MatGetOrdering 1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00 0.0e+00 > 8.8e+02 36 36 0 0 31 36 36 0 0 31 80 > VecNorm 916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00 0.0e+00 > 9.2e+02 29 2 0 0 32 29 2 0 0 32 7 > VecScale 915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 200 > VecCopy 30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 542 > VecMAXPY 915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00 0.0e+00 > 0.0e+00 6 38 0 0 0 6 38 0 0 0 483 > VecScatterBegin 915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03 4.0e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0 > VecScatterEnd 915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 14 0 0 0 0 14 0 0 0 0 0 > VecNormalize 915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00 0.0e+00 > 9.2e+02 30 4 0 0 32 30 4 0 0 32 10 > KSPGMRESOrthog 885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00 0.0e+00 > 8.8e+02 42 72 0 0 31 42 72 0 0 31 138 > KSPSetup 2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03 4.0e+03 > 2.8e+03 99100100100 99 99100100100 99 86 > PCSetUp 2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00 0.0e+00 > 6.0e+00 0 0 0 0 0 0 0 0 0 0 30 > PCSetUpOnBlocks 1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00 0.0e+00 > 4.0e+00 0 0 0 0 0 0 0 0 0 0 31 > PCApply 915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00 0.0e+00 > 9.2e+02 7 11 0 0 32 7 11 0 0 32 124 > ------------------------------------------------------------------------------------------------------------------------ > > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 4 4 252008 0 > Index Set 5 5 753096 0 > Vec 41 41 18519984 0 > Vec Scatter 1 1 0 0 > Krylov Solver 2 2 16880 0 > Preconditioner 2 2 196 0 > ======================================================================================================================== > > Average time to get PetscTime(): 1.09673e-06 > Average time for MPI_Barrier(): 4.18186e-05 > Average time for zero size MPI_Send(): 2.62856e-05 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > sizeof(PetscScalar) 8 > Configure run at: Thu Jan 18 12:23:31 2007 > Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > --with-mpi-dir=/opt/mpich/myrinet/intel/ > ----------------------------------------- > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on atlas1.nus.edu.sg > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 SMP > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 > Using PETSc arch: linux-mpif90 > ----------------------------------------- > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > -w90 -w > ----------------------------------------- > Using include paths: -I/nas/lsftmp/g0306332/petsc- 2.3.2-p8-I/nas/lsftmp/g0306332/petsc- > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8/include > -I/opt/mpich/myrinet/intel/include > ------------------------------------------ > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > -w90 -w > Using libraries: -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib -lPEPCF90 > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 -lm -Wl,-rpath,\ > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl > ------------------------------------------ > > So is there something wrong with the server's mpi implementation? > > Thank you. > > > > On 2/10/07, Satish Balay wrote: > > > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec. > > Either something is wrong with your run - or MPI is really broken.. > > > > Satish > > > > > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 2.4e+04 > > 1.3e+03 > > > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 2.4e+04 > > 1.3e+03 > > > > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 0.0e+00 > > 0.0e+00 > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Sat Feb 10 13:06:15 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sat, 10 Feb 2007 13:06:15 -0600 (CST) Subject: understanding the output from -info In-Reply-To: <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090724n73db6f8w574622903161eb4a@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> Message-ID: On Sat, 10 Feb 2007, Ben Tay wrote: > Hi, > > I've repeated the test with n,m = 800. Now serial takes around 11mins while > parallel with 4 processors took 6mins. Does it mean that the problem must be > pretty large before it is more superior to use parallel? Moreover 800x800 > means there's 640000 unknowns. My problem is a 2D CFD code which typically > has 200x80=16000 unknowns. Does it mean that I won't be able to benefit from ^^^^^^^^^^^ You'll never get much performance past 2 processors; its not even worth all the work of having a parallel code in this case. I'd just optimize the heck out of the serial code. Barry > running in parallel? > > Btw, this is the parallel's log_summary: > > > Event Count Time (sec) > Flops/sec --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03 > 0.0e+00 16 11100100 0 16 11100100 0 103 > MatSolve 1265 1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00 > 0.0e+00 11 11 0 0 0 11 11 0 0 0 152 > MatLUFactorNum 1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 22 > MatILUFactorSym 1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03 > 1.3e+01 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 1.2088e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecMDot 1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00 > 1.2e+03 36 36 0 0 31 36 36 0 0 31 158 > VecNorm 1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00 > 1.3e+03 24 2 0 0 33 24 2 0 0 33 16 > VecScale 1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00 > 0.0e+00 1 1 0 0 0 1 1 0 0 0 216 > VecCopy 41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 1308 1.0 3.2717e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecAXPY 82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 197 > VecMAXPY 1265 1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00 > 0.0e+00 10 38 0 0 0 10 38 0 0 0 557 > VecScatterBegin 1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0 > VecScatterEnd 1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 > VecNormalize 1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00 > 1.3e+03 25 4 0 0 32 25 4 0 0 32 23 > KSPGMRESOrthog 1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00 > 1.2e+03 45 72 0 0 31 45 72 0 0 31 254 > KSPSetup 2 1.0 5.1674e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03 > 3.9e+03 99100100100 99 99100100100 99 166 > PCSetUp 2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00 > 6.0e+00 0 0 0 0 0 0 0 0 0 0 12 > PCSetUpOnBlocks 1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00 > 4.0e+00 0 0 0 0 0 0 0 0 0 0 13 > PCApply 1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00 > 1.3e+03 12 11 0 0 32 12 11 0 0 32 143 > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 4 4 643208 0 > Index Set 5 5 1924296 0 > Vec 41 41 47379984 0 > Vec Scatter 1 1 0 0 > Krylov Solver 2 2 16880 0 > Preconditioner 2 2 196 0 > ======================================================================================================================== > Average time to get PetscTime(): 1.00136e-06 > Average time for MPI_Barrier(): 4.00066e-05 > Average time for zero size MPI_Send(): 1.70469e-05 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > sizeof(PetscScalar) 8 > Configure run at: Thu Jan 18 12:23:31 2007 > Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > --with-mpi-dir=/opt/mpich/myrinet/intel/ > ----------------------------------------- > > > > > > > > On 2/10/07, Ben Tay wrote: > > > > Hi, > > > > I tried to use ex2f.F as a test code. I've changed the number n,m from 3 > > to 500 each. I ran the code using 1 processor and then with 4 processor. I > > then repeat the same with the following modification: > > > > > > do i=1,10 > > > > call KSPSolve(ksp,b,x,ierr) > > > > end do > > I've added to do loop to make the solving repeat 10 times. > > > > In both cases, the serial code is faster, e.g. 1 taking 2.4 min while the > > other 3.3 min. > > > > Here's the log_summary: > > > > > > ---------------------------------------------- PETSc Performance Summary: > > ---------------------------------------------- > > > > ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by > > g0306332 Sat Feb 10 16:21:36 2007 > > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST 2007 > > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 > > > > Max Max/Min Avg Total > > Time (sec): 2.213e+02 1.00051 2.212e+02 > > Objects: 5.500e+01 1.00000 5.500e+01 > > Flops: 4.718e+09 1.00019 4.718e+09 1.887e+10 > > Flops/sec: 2.134e+07 1.00070 2.133e+07 8.531e+07 > > > > Memory: 3.186e+07 1.00069 1.274e+08 > > MPI Messages: 1.832e+03 2.00000 1.374e+03 5.496e+03 > > MPI Message Lengths: 7.324e+06 2.00000 3.998e+03 2.197e+07 > > MPI Reductions: 7.112e+02 1.00000 > > > > Flop counting convention: 1 flop = 1 real number operation of type > > (multiply/divide/add/subtract) > > e.g., VecAXPY() for real vectors of length N > > --> 2N flops > > and VecAXPY() for complex vectors of length N > > --> 8N flops > > > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages > > --- -- Message Lengths -- -- Reductions -- > > Avg %Total Avg %Total counts > > %Total Avg %Total counts %Total > > 0: Main Stage: 2.2120e+02 100.0% 1.8871e+10 100.0% 5.496e+03 > > 100.0% 3.998e+03 100.0% 2.845e+03 100.0% > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > See the 'Profiling' chapter of the users' manual for details on > > interpreting output. > > Phase summary info: > > Count: number of times phase was executed > > Time and Flops/sec: Max - maximum over all processors > > Ratio - ratio of maximum to minimum over all > > processors > > Mess: number of messages sent > > Avg. len: average message length > > Reduct: number of global reductions > > Global: entire computation > > Stage: stages of a computation. Set stages with PetscLogStagePush() and > > PetscLogStagePop(). > > %T - percent time in this phase %F - percent flops in this > > phase > > %M - percent messages in this phase %L - percent message lengths > > in this phase > > %R - percent reductions in this phase > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > > over all processors) > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > ########################################################## > > # # > > # WARNING!!! # > > # # > > # This code was compiled with a debugging option, # > > # To get timing results run config/configure.py # > > # using --with-debugging=no, the performance will # > > # be generally two or three times faster. # > > # # > > ########################################################## > > > > > > > > > > ########################################################## > > # # > > # WARNING!!! # > > # # > > # This code was run without the PreLoadBegin() # > > # macros. To get timing results we always recommend # > > # preloading. otherwise timing numbers may be # > > # meaningless. # > > ########################################################## > > > > > > Event Count Time (sec) > > Flops/sec --- Global --- --- Stage --- Total > > Max Ratio Max Ratio Max Ratio Mess Avg len > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > --- Event Stage 0: Main Stage > > > > MatMult 915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03 4.0e+03 > > 0.0e+00 18 11100100 0 18 11100100 0 46 > > MatSolve 915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00 0.0e+00 > > 0.0e+00 7 11 0 0 0 7 11 0 0 0 131 > > MatLUFactorNum 1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 43 > > MatILUFactorSym 1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatAssemblyBegin 1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatAssemblyEnd 1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00 2.0e+03 > > 1.3e+01 1 0 0 0 0 1 0 0 0 0 0 > > MatGetOrdering 1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecMDot 885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00 0.0e+00 > > 8.8e+02 36 36 0 0 31 36 36 0 0 31 80 > > VecNorm 916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00 0.0e+00 > > 9.2e+02 29 2 0 0 32 29 2 0 0 32 7 > > VecScale 915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00 0.0e+00 > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 200 > > VecCopy 30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecSet 947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecAXPY 60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 542 > > VecMAXPY 915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00 0.0e+00 > > 0.0e+00 6 38 0 0 0 6 38 0 0 0 483 > > VecScatterBegin 915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03 4.0e+03 > > 0.0e+00 0 0100100 0 0 0100100 0 0 > > VecScatterEnd 915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 14 0 0 0 0 14 0 0 0 0 0 > > VecNormalize 915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00 0.0e+00 > > 9.2e+02 30 4 0 0 32 30 4 0 0 32 10 > > KSPGMRESOrthog 885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00 0.0e+00 > > 8.8e+02 42 72 0 0 31 42 72 0 0 31 138 > > KSPSetup 2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > > KSPSolve 1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03 4.0e+03 > > 2.8e+03 99100100100 99 99100100100 99 86 > > PCSetUp 2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00 0.0e+00 > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 30 > > PCSetUpOnBlocks 1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00 0.0e+00 > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 31 > > PCApply 915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00 0.0e+00 > > 9.2e+02 7 11 0 0 32 7 11 0 0 32 124 > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > Memory usage is given in bytes: > > > > Object Type Creations Destructions Memory Descendants' Mem. > > > > --- Event Stage 0: Main Stage > > > > Matrix 4 4 252008 0 > > Index Set 5 5 753096 0 > > Vec 41 41 18519984 0 > > Vec Scatter 1 1 0 0 > > Krylov Solver 2 2 16880 0 > > Preconditioner 2 2 196 0 > > ======================================================================================================================== > > > > Average time to get PetscTime(): 1.09673e-06 > > Average time for MPI_Barrier(): 4.18186e-05 > > Average time for zero size MPI_Send(): 2.62856e-05 > > OptionTable: -log_summary > > Compiled without FORTRAN kernels > > Compiled with full precision matrices (default) > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > > sizeof(PetscScalar) 8 > > Configure run at: Thu Jan 18 12:23:31 2007 > > Configure options: --with-vendor-compilers=intel --with-x=0 --with-shared > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > > --with-mpi-dir=/opt/mpich/myrinet/intel/ > > ----------------------------------------- > > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on atlas1.nus.edu.sg > > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 SMP > > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux > > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 > > Using PETSc arch: linux-mpif90 > > ----------------------------------------- > > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > > -w90 -w > > ----------------------------------------- > > Using include paths: -I/nas/lsftmp/g0306332/petsc- > > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc- > > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8/include > > -I/opt/mpich/myrinet/intel/include > > ------------------------------------------ > > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > > -w90 -w > > Using libraries: > > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 > > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts > > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 > > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide > > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib > > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm > > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa > > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib -lPEPCF90 > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 -lm -Wl,-rpath,\ > > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib -L/opt/intel/compiler70/ia32/lib > > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl > > ------------------------------------------ > > > > So is there something wrong with the server's mpi implementation? > > > > Thank you. > > > > > > > > On 2/10/07, Satish Balay wrote: > > > > > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec. > > > Either something is wrong with your run - or MPI is really broken.. > > > > > > Satish > > > > > > > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 2.4e+04 > > > 1.3e+03 > > > > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 2.4e+04 > > > 1.3e+03 > > > > > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 0.0e+00 > > > 0.0e+00 > > > > > > > > > From balay at mcs.anl.gov Sat Feb 10 13:11:03 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Sat, 10 Feb 2007 13:11:03 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: References: <292553.57576.qm@web36210.mail.mud.yahoo.com> Message-ID: Can you send the optupt from the following runs. You can do this with src/ksp/ksp/examples/tutorials/ex2.c - to keep things simple. petscmpirun -n 2 taskset -c 0,2 ./ex2 -log_summary | egrep \(MPI_Send\|MPI_Barrier\) petscmpirun -n 2 taskset -c 0,4 ./ex2 -log_summary | egrep \(MPI_Send\|MPI_Barrier\) petscmpirun -n 2 taskset -c 0,6 ./ex2 -log_summary | egrep \(MPI_Send\|MPI_Barrier\) petscmpirun -n 2 taskset -c 0,8 ./ex2 -log_summary | egrep \(MPI_Send\|MPI_Barrier\) petscmpirun -n 2 taskset -c 0,12 ./ex2 -log_summary | egrep \(MPI_Send\|MPI_Barrier\) petscmpirun -n 2 taskset -c 0,14 ./ex2 -log_summary | egrep \(MPI_Send\|MPI_Barrier\) Satish From billy at dem.uminho.pt Sat Feb 10 12:50:50 2007 From: billy at dem.uminho.pt (billy at dem.uminho.pt) Date: Sat, 10 Feb 2007 18:50:50 +0000 Subject: A 3D example of KSPSolve? In-Reply-To: References: <902001.46633.qm@web36203.mail.mud.yahoo.com> Message-ID: <1171133450.45ce140ad527c@serv-g1.ccom.uminho.pt> Hi, Lately I was using 2D examples and when I changed to 3D I noticed bad performance. When I checked the code it was not allocating enough memory for 3D. Instead of 6 nonzeros in each it row it had 7 and performance went down very significantly as mentioned in PETSc manual. Billy. Quoting Barry Smith : > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > MatGetRow are used to build the right hand side > > vector. > ^^^^^^ > > Huh? > > > We use it in order to get the number of nonzero cols, > > global col indices and values in a row. > > Huh? What do you do with all this information? Maybe > we can do what you do with this information much more efficiently? > Without all the calls to MatGetRow(). > > Barry > > > > > The reason it is time consuming is that it is called > > for each row of the matrix. I am not sure how I can > > get away without it. > > Thanks. > > > > Shi > > --- Barry Smith wrote: > > > > > > > > What are all the calls for MatGetRow() for? They > > > are consuming a > > > great deal of time. Is there anyway to get rid of > > > them? > > > > > > Barry > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > Sorry that is not informative. > > > > So I decide to attach the 5 files for > > > NP=1,2,4,8,16 > > > > for > > > > the 400,000 finite element case. > > > > > > > > Please note that the simulation runs over 100 > > > steps. > > > > The 1st step is first order update, named as stage > > > 1. > > > > The rest 99 steps are second order updates. Within > > > > that, stage 2-9 are created for the 8 stages of a > > > > second order update. We should concentrate on the > > > > second order updates. So four calls to KSPSolve in > > > the > > > > log file are important, in stage 4,5,6,and 8 > > > > separately. > > > > Pleaes let me know if you need any other > > > information > > > > or explanation. > > > > Thank you very much. > > > > > > > > Shi > > > > --- Matthew Knepley wrote: > > > > > > > > > You really have to give us the log summary > > > output. > > > > > None of the relevant > > > > > numbers are in your summary. > > > > > > > > > > Thanks, > > > > > > > > > > Matt > > > > > > > > > > On 2/9/07, Shi Jin wrote: > > > > > > > > > > > > Dear Barry, > > > > > > > > > > > > Thank you. > > > > > > I actually have done the staging already. > > > > > > I summarized the timing of the runs in google > > > > > online > > > > > > spreadsheets. I have two runs. > > > > > > 1. with 400,000 finite elements: > > > > > > > > > > > > > > > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZeDZlucTjEIA > > > > > > 2. with 1,600,000 finite elements: > > > > > > > > > > > > > > > > > > > > http://spreadsheets.google.com/pub?key=pZHoqlL60quZcCVLAqmzqQQ > > > > > > > > > > > > If you can take a look at them and give me > > > some > > > > > > advice, I will be deeply grateful. > > > > > > > > > > > > Shi > > > > > > --- Barry Smith wrote: > > > > > > > > > > > > > > > > > > > > NO, NO, don't spend time stripping your > > > code! > > > > > > > Unproductive > > > > > > > > > > > > > > See the manul pages for > > > > > PetscLogStageRegister(), > > > > > > > PetscLogStagePush() and > > > > > > > PetscLogStagePop(). All you need to do is > > > > > maintain a > > > > > > > seperate stage for each > > > > > > > of your KSPSolves; in your case you'll > > > create 3 > > > > > > > stages. > > > > > > > > > > > > > > Barry > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > > > > > Thank you. > > > > > > > > But my code has 10 calls to KSPSolve of > > > three > > > > > > > > different linear systems at each time > > > update. > > > > > > > Should I > > > > > > > > strip it down to a single KSPSolve so that > > > it > > > > > is > > > > > > > > easier to analysis? I might have the code > > > dump > > > > > the > > > > > > > > Matrix and vector and write another code > > > to > > > > > read > > > > > > > them > > > > > > > > into and call KSPSolve. I don't know > > > whether > > > > > this > > > > > > > is > > > > > > > > worth doing or should I just send in the > > > > > messy > > > > > > > log > > > > > > > > file of the whole run. > > > > > > > > Thanks for any advice. > > > > > > > > > > > > > > > > Shi > > > > > > > > > > > > > > > > --- Barry Smith > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Shi, > > > > > > > > > > > > > > > > > > There is never a better test problem > > > then > > > > > > > your > > > > > > > > > actual problem. > > > > > > > > > Send the results from running on 1, 4, > > > and 8 > > > > > > > > > processes with the options > > > > > > > > > -log_summary -ksp_view (use the > > > optimized > > > > > > > version of > > > > > > > > > PETSc (running > > > > > > > > > config/configure.py --with-debugging=0)) > > > > > > > > > > > > > > > > > > Barry > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 9 Feb 2007, Shi Jin wrote: > > > > > > > > > > > > > > > > > > > Hi there, > > > > > > > > > > > > > > > > > > > > I am tuning our 3D FEM CFD code > > > written > > > > > with > > > > > > > > > PETSc. > > > > > > > > > > The code doesn't scale very well. For > > > > > example, > > > > > > > > > with 8 > > > > > > > > > > processes on a linux cluster, the > > > speedup > > > > > we > > > > > > > > > achieve > > > > > > > > > > with a fairly large problem > > > size(million > > > > > of > > > > > > > > > elements) > > > > > > > > > > is only 3 to 4 using the Congugate > > > > > gradient > > > > > > > > > solver. We > > > > > > > > > > can achieve a speed up of a 6.5 using > > > a > > > > > GMRes > > > > > > > > > solver > > > > > > > > > > but the wall clock time of a GMRes is > > > > > longer > > > > > > > than > > > > > > > > > a CG > > > > > > > > > > solver which indicates that CG is the > > > > > faster > > > > > > > > > solver > > > > > > > > > > and it scales not as good as GMRes. Is > > > > > this > > > > > > > > > generally > > > > > > > > > > true? > > > > > > > > > > > > > > > > > > > > I then went to the examples and find a > > > 2D > > > > > > > example > > > > > > > > > of > > > > > > > > > > KSPSolve (ex2.c). I let the code ran > > > with > > > > > a > > > > > > > > > 1000x1000 > > > > > > > > > > mesh and get a linear scaling of the > > > CG > > > > > solver > > > > > > > and > > > > > > > > > a > > > > > > > > > > super linear scaling of the GMRes. > > > These > > > > > are > > > > > > > both > > > > > > > > > much > > > > > > > > > > better than our code. However, I think > > > the > > > > > 2D > > > > > > > > > nature > > > > > === message truncated === > > > > > > > > > > > ____________________________________________________________________________________ > > Cheap talk? > > Check out Yahoo! Messenger's low PC-to-Phone call rates. > > http://voice.yahoo.com > > > > > > From jinzishuai at yahoo.com Sat Feb 10 16:45:29 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Sat, 10 Feb 2007 14:45:29 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: Message-ID: <221628.34725.qm@web36210.mail.mud.yahoo.com> Yes. The results follow. --- Satish Balay wrote: > Can you send the optupt from the following runs. You > can do this with > src/ksp/ksp/examples/tutorials/ex2.c - to keep > things simple. > > petscmpirun -n 2 taskset -c 0,2 ./ex2 -log_summary | > egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 1.81198e-06 Average time for zero size MPI_Send(): 5.00679e-06 > petscmpirun -n 2 taskset -c 0,4 ./ex2 -log_summary | > egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 2.00272e-06 Average time for zero size MPI_Send(): 4.05312e-06 > petscmpirun -n 2 taskset -c 0,6 ./ex2 -log_summary | > egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 1.7643e-06 Average time for zero size MPI_Send(): 4.05312e-06 > petscmpirun -n 2 taskset -c 0,8 ./ex2 -log_summary | > egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 2.00272e-06 Average time for zero size MPI_Send(): 4.05312e-06 > petscmpirun -n 2 taskset -c 0,12 ./ex2 -log_summary > | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 1.57356e-06 Average time for zero size MPI_Send(): 5.48363e-06 > petscmpirun -n 2 taskset -c 0,14 ./ex2 -log_summary > | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 2.00272e-06 Average time for zero size MPI_Send(): 4.52995e-06 I also did petscmpirun -n 2 taskset -c 0,10 ./ex2 -log_summary | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 5.00679e-06 Average time for zero size MPI_Send(): 3.93391e-06 The results are not so different from each other. Also please note, the timing is not exact, some times I got O(1e-5) timings for all cases. I assume these numbers are pretty good, right? Does it indicate that the MPI communication on a SMP machine is very fast? I will do a similar test on a cluster and report it back to the list. Shi ____________________________________________________________________________________ Need Mail bonding? Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users. http://answers.yahoo.com/dir/?link=list&sid=396546091 From jinzishuai at yahoo.com Sat Feb 10 17:01:19 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Sat, 10 Feb 2007 15:01:19 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: <221628.34725.qm@web36210.mail.mud.yahoo.com> Message-ID: <539403.87888.qm@web36205.mail.mud.yahoo.com> Here is the test on a linux cluster with gigabit ethernet interconnect. MPI2/output:Average time for MPI_Barrier(): 6.00338e-05 MPI2/output:Average time for zero size MPI_Send(): 5.40018e-05 MPI4/output:Average time for MPI_Barrier(): 0.00806541 MPI4/output:Average time for zero size MPI_Send(): 6.07371e-05 MPI8/output:Average time for MPI_Barrier(): 0.00805483 MPI8/output:Average time for zero size MPI_Send(): 6.97374e-05 Note MPI indicates the run using N processes. It seems that the MPI_Barrier takes a much longer time do finish than one a SMP machine. Is this a load balance issue or is it merely the show of slow communication speed? Thanks. Shi --- Shi Jin wrote: > Yes. The results follow. > --- Satish Balay wrote: > > > Can you send the optupt from the following runs. > You > > can do this with > > src/ksp/ksp/examples/tutorials/ex2.c - to keep > > things simple. > > > > petscmpirun -n 2 taskset -c 0,2 ./ex2 -log_summary > | > > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.81198e-06 > Average time for zero size MPI_Send(): 5.00679e-06 > > petscmpirun -n 2 taskset -c 0,4 ./ex2 -log_summary > | > > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 2.00272e-06 > Average time for zero size MPI_Send(): 4.05312e-06 > > petscmpirun -n 2 taskset -c 0,6 ./ex2 -log_summary > | > > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.7643e-06 > Average time for zero size MPI_Send(): 4.05312e-06 > > petscmpirun -n 2 taskset -c 0,8 ./ex2 -log_summary > | > > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 2.00272e-06 > Average time for zero size MPI_Send(): 4.05312e-06 > > petscmpirun -n 2 taskset -c 0,12 ./ex2 > -log_summary > > | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.57356e-06 > Average time for zero size MPI_Send(): 5.48363e-06 > > petscmpirun -n 2 taskset -c 0,14 ./ex2 > -log_summary > > | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 2.00272e-06 > Average time for zero size MPI_Send(): 4.52995e-06 > I also did > petscmpirun -n 2 taskset -c 0,10 ./ex2 -log_summary > | > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 5.00679e-06 > Average time for zero size MPI_Send(): 3.93391e-06 > > > The results are not so different from each other. > Also > please note, the timing is not exact, some times I > got > O(1e-5) timings for all cases. > I assume these numbers are pretty good, right? Does > it > indicate that the MPI communication on a SMP machine > is very fast? > I will do a similar test on a cluster and report it > back to the list. > > Shi > > > > > > ____________________________________________________________________________________ > Need Mail bonding? > Go to the Yahoo! Mail Q&A for great tips from Yahoo! > Answers users. > http://answers.yahoo.com/dir/?link=list&sid=396546091 > > ____________________________________________________________________________________ We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265 From bsmith at mcs.anl.gov Sat Feb 10 17:03:49 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sat, 10 Feb 2007 17:03:49 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: <539403.87888.qm@web36205.mail.mud.yahoo.com> References: <539403.87888.qm@web36205.mail.mud.yahoo.com> Message-ID: gigabit ethernet has huge latencies; it is not good enough for a cluster. Barry On Sat, 10 Feb 2007, Shi Jin wrote: > Here is the test on a linux cluster with gigabit > ethernet interconnect. > MPI2/output:Average time for MPI_Barrier(): > 6.00338e-05 > MPI2/output:Average time for zero size MPI_Send(): > 5.40018e-05 > MPI4/output:Average time for MPI_Barrier(): 0.00806541 > MPI4/output:Average time for zero size MPI_Send(): > 6.07371e-05 > MPI8/output:Average time for MPI_Barrier(): 0.00805483 > MPI8/output:Average time for zero size MPI_Send(): > 6.97374e-05 > > Note MPI indicates the run using N processes. > It seems that the MPI_Barrier takes a much longer time > do finish than one a SMP machine. Is this a load > balance issue or is it merely the show of slow > communication speed? > Thanks. > Shi > --- Shi Jin wrote: > > > Yes. The results follow. > > --- Satish Balay wrote: > > > > > Can you send the optupt from the following runs. > > You > > > can do this with > > > src/ksp/ksp/examples/tutorials/ex2.c - to keep > > > things simple. > > > > > > petscmpirun -n 2 taskset -c 0,2 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.81198e-06 > > Average time for zero size MPI_Send(): 5.00679e-06 > > > petscmpirun -n 2 taskset -c 0,4 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,6 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.7643e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,8 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,12 ./ex2 > > -log_summary > > > | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.57356e-06 > > Average time for zero size MPI_Send(): 5.48363e-06 > > > petscmpirun -n 2 taskset -c 0,14 ./ex2 > > -log_summary > > > | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.52995e-06 > > I also did > > petscmpirun -n 2 taskset -c 0,10 ./ex2 -log_summary > > | > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 5.00679e-06 > > Average time for zero size MPI_Send(): 3.93391e-06 > > > > > > The results are not so different from each other. > > Also > > please note, the timing is not exact, some times I > > got > > O(1e-5) timings for all cases. > > I assume these numbers are pretty good, right? Does > > it > > indicate that the MPI communication on a SMP machine > > is very fast? > > I will do a similar test on a cluster and report it > > back to the list. > > > > Shi > > > > > > > > > > > > > ____________________________________________________________________________________ > > Need Mail bonding? > > Go to the Yahoo! Mail Q&A for great tips from Yahoo! > > Answers users. > > > http://answers.yahoo.com/dir/?link=list&sid=396546091 > > > > > > > > > ____________________________________________________________________________________ > We won't tell. Get more on shows you hate to love > (and love to hate): Yahoo! TV's Guilty Pleasures list. > http://tv.yahoo.com/collections/265 > > From jinzishuai at yahoo.com Sat Feb 10 17:10:22 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Sat, 10 Feb 2007 15:10:22 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: <221628.34725.qm@web36210.mail.mud.yahoo.com> Message-ID: <918742.70603.qm@web36202.mail.mud.yahoo.com> Furthermore, I did a multi-process test on the SMP. petscmpirun -n 3 taskset -c 0,2,4 ./ex2 -ksp_type cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 4.19617e-06 Average time for zero size MPI_Send(): 3.65575e-06 petscmpirun -n 4 taskset -c 0,2,4,6 ./ex2 -ksp_type cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 1.75953e-05 Average time for zero size MPI_Send(): 2.44975e-05 petscmpirun -n 5 taskset -c 0,2,4,6,8 ./ex2 -ksp_type cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 4.22001e-05 Average time for zero size MPI_Send(): 2.54154e-05 petscmpirun -n 6 taskset -c 0,2,4,6,8,10 ./ex2 -ksp_type cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 4.87804e-05 Average time for zero size MPI_Send(): 1.83185e-05 petscmpirun -n 7 taskset -c 0,2,4,6,8,10,12 ./ex2 -ksp_type cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 2.37942e-05 Average time for zero size MPI_Send(): 5.00679e-06 petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ./ex2 -ksp_type cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) Average time for MPI_Barrier(): 1.35899e-05 Average time for zero size MPI_Send(): 6.73532e-06 They all seem quite fast. Shi --- Shi Jin wrote: > Yes. The results follow. > --- Satish Balay wrote: > > > Can you send the optupt from the following runs. > You > > can do this with > > src/ksp/ksp/examples/tutorials/ex2.c - to keep > > things simple. > > > > petscmpirun -n 2 taskset -c 0,2 ./ex2 -log_summary > | > > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.81198e-06 > Average time for zero size MPI_Send(): 5.00679e-06 > > petscmpirun -n 2 taskset -c 0,4 ./ex2 -log_summary > | > > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 2.00272e-06 > Average time for zero size MPI_Send(): 4.05312e-06 > > petscmpirun -n 2 taskset -c 0,6 ./ex2 -log_summary > | > > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.7643e-06 > Average time for zero size MPI_Send(): 4.05312e-06 > > petscmpirun -n 2 taskset -c 0,8 ./ex2 -log_summary > | > > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 2.00272e-06 > Average time for zero size MPI_Send(): 4.05312e-06 > > petscmpirun -n 2 taskset -c 0,12 ./ex2 > -log_summary > > | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.57356e-06 > Average time for zero size MPI_Send(): 5.48363e-06 > > petscmpirun -n 2 taskset -c 0,14 ./ex2 > -log_summary > > | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 2.00272e-06 > Average time for zero size MPI_Send(): 4.52995e-06 > I also did > petscmpirun -n 2 taskset -c 0,10 ./ex2 -log_summary > | > egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 5.00679e-06 > Average time for zero size MPI_Send(): 3.93391e-06 > > > The results are not so different from each other. > Also > please note, the timing is not exact, some times I > got > O(1e-5) timings for all cases. > I assume these numbers are pretty good, right? Does > it > indicate that the MPI communication on a SMP machine > is very fast? > I will do a similar test on a cluster and report it > back to the list. > > Shi > > > > > > ____________________________________________________________________________________ > Need Mail bonding? > Go to the Yahoo! Mail Q&A for great tips from Yahoo! > Answers users. > http://answers.yahoo.com/dir/?link=list&sid=396546091 > > ____________________________________________________________________________________ Yahoo! Music Unlimited Access over 1 million songs. http://music.yahoo.com/unlimited From jinzishuai at yahoo.com Sat Feb 10 17:54:53 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Sat, 10 Feb 2007 15:54:53 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: Message-ID: <37521.92324.qm@web36202.mail.mud.yahoo.com> I understand but this is our reality. I did the same test on a cluster with infiniband: MPI2/output:Average time for MPI_Barrier(): 9.58443e-06 MPI2/output:Average time for zero size MPI_Send(): 8.9407e-06 MPI4/output:Average time for MPI_Barrier(): 1.93596e-05 MPI4/output:Average time for zero size MPI_Send(): 1.0252e-05 MPI8/output:Average time for MPI_Barrier(): 3.33786e-05 MPI8/output:Average time for zero size MPI_Send(): 1.01328e-05 MPI16/output:Average time for MPI_Barrier(): 4.53949e-05 MPI16/output:Average time for zero size MPI_Send(): 9.87947e-06 The MPI_Barrier problem becomes much better. However, when our code is tested on both clusters (gigabit and infiniband), we don't see much difference in their performance. I attach the log file for a run with 4 processes on this infiniband cluster. Shi --- Barry Smith wrote: > > gigabit ethernet has huge latencies; it is not > good enough for a cluster. > > Barry > > > On Sat, 10 Feb 2007, Shi Jin wrote: > > > Here is the test on a linux cluster with gigabit > > ethernet interconnect. > > MPI2/output:Average time for MPI_Barrier(): > > 6.00338e-05 > > MPI2/output:Average time for zero size MPI_Send(): > > 5.40018e-05 > > MPI4/output:Average time for MPI_Barrier(): > 0.00806541 > > MPI4/output:Average time for zero size MPI_Send(): > > 6.07371e-05 > > MPI8/output:Average time for MPI_Barrier(): > 0.00805483 > > MPI8/output:Average time for zero size MPI_Send(): > > 6.97374e-05 > > > > Note MPI indicates the run using N processes. > > It seems that the MPI_Barrier takes a much longer > time > > do finish than one a SMP machine. Is this a load > > balance issue or is it merely the show of slow > > communication speed? > > Thanks. > > Shi > > --- Shi Jin wrote: > > > > > Yes. The results follow. > > > --- Satish Balay wrote: > > > > > > > Can you send the optupt from the following > runs. > > > You > > > > can do this with > > > > src/ksp/ksp/examples/tutorials/ex2.c - to keep > > > > things simple. > > > > > > > > petscmpirun -n 2 taskset -c 0,2 ./ex2 > -log_summary > > > | > > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 1.81198e-06 > > > Average time for zero size MPI_Send(): > 5.00679e-06 > > > > petscmpirun -n 2 taskset -c 0,4 ./ex2 > -log_summary > > > | > > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 2.00272e-06 > > > Average time for zero size MPI_Send(): > 4.05312e-06 > > > > petscmpirun -n 2 taskset -c 0,6 ./ex2 > -log_summary > > > | > > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 1.7643e-06 > > > Average time for zero size MPI_Send(): > 4.05312e-06 > > > > petscmpirun -n 2 taskset -c 0,8 ./ex2 > -log_summary > > > | > > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 2.00272e-06 > > > Average time for zero size MPI_Send(): > 4.05312e-06 > > > > petscmpirun -n 2 taskset -c 0,12 ./ex2 > > > -log_summary > > > > | egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 1.57356e-06 > > > Average time for zero size MPI_Send(): > 5.48363e-06 > > > > petscmpirun -n 2 taskset -c 0,14 ./ex2 > > > -log_summary > > > > | egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 2.00272e-06 > > > Average time for zero size MPI_Send(): > 4.52995e-06 > > > I also did > > > petscmpirun -n 2 taskset -c 0,10 ./ex2 > -log_summary > > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 5.00679e-06 > > > Average time for zero size MPI_Send(): > 3.93391e-06 > > > > > > > > > The results are not so different from each > other. > > > Also > > > please note, the timing is not exact, some times > I > > > got > > > O(1e-5) timings for all cases. > > > I assume these numbers are pretty good, right? > Does > > > it > > > indicate that the MPI communication on a SMP > machine > > > is very fast? > > > I will do a similar test on a cluster and report > it > > > back to the list. > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Need Mail bonding? > > > Go to the Yahoo! Mail Q&A for great tips from > Yahoo! > > > Answers users. > > > > > > http://answers.yahoo.com/dir/?link=list&sid=396546091 > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > We won't tell. Get more on shows you hate to love > > (and love to hate): Yahoo! TV's Guilty Pleasures > list. > > http://tv.yahoo.com/collections/265 > > > > > > ____________________________________________________________________________________ Any questions? Get answers on any topic at www.Answers.yahoo.com. Try it now. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log-4-infiniband.txt URL: From jinzishuai at yahoo.com Sat Feb 10 18:36:13 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Sat, 10 Feb 2007 16:36:13 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: Message-ID: <497379.41444.qm@web36203.mail.mud.yahoo.com> Hi, I am a bit confused at how to interpret the log_summary results. In my previous log files, I logged everything in that solving staging, including constructing the matrix and vector and the KSPSolve. I then specifically change the code so that each KSPSolve() function is tightly included within the PetscLogStagePush() and PetscLogStagePop() pair so that we exclude the other timings and concentrate on the linear solver. In this way, I still get list of 16 functions in that stage, although I only included one (KSPSolve). They are VecDot VecNorm VecCopy VecSet VecAXPY VecAYPX VecScatterBegin VecScatterEnd MatMult MatSolve MatLUFactorNum KSPSetup KSPSolve PCSetUp PCSetUpOnBlocks PCApply Are these functions called by the KSPSolve() (in this case, I used -ksp_type cg). I suppose the only network communications are done in the function calls VecScatterBegin VecScatterEnd If I am to compute the percentage of communication specifically for KSPSolve(), shall I just use the times of VecScatterBegin & VecScatterEnd devided by the time of KSPSolve? Or shall I use MatMult, like Satish did in his previous emails? I am a bit confused. Please advise. Thank you very much. Shi --- Satish Balay wrote: > > Just looking at 8 proc run [diffusion stage] we > have: > > MatMult : 79 sec > MatMultAdd : 2 sec > VecScatterBegin: 17 sec > VecScatterEnd : 51 sec > > So basically the communication in MatMult/Add is > represented by > VecScatters. Here out of 81 sec total - 68 seconds > are used for > communication [with a load imbalance of 11 for > vecscaterend] > > So - I think MPI performance is reducing scalability > here.. > > Things to try: > > * -vecstatter_rr etc options I sugested earlier > > * install mpich with '--with-device=ch3:ssm' and see > if it makes a difference > > Satish > > --- Event Stage 4: Diffusion > > [x]rhsLtP 297 1.0 1.1017e+02 1.5 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 39 0 0 > 0 0 0 > [x]rhsGravity 99 1.0 4.2582e+0083.5 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 > 0 0 0 > VecDot 4657 1.0 2.5748e+01 3.2 7.60e+07 > 3.2 0.0e+00 0.0e+00 4.7e+03 1 1 0 0 6 5 3 0 > 0 65 191 > VecNorm 2477 1.0 2.2109e+01 2.2 3.22e+07 > 2.2 0.0e+00 0.0e+00 2.5e+03 1 0 0 0 3 5 2 0 > 0 35 118 > VecScale 594 1.0 2.9330e-02 1.5 2.61e+08 > 1.5 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 1361 > VecCopy 594 1.0 2.7552e-01 1.3 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 0 > VecSet 3665 1.0 6.0793e-01 1.4 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 0 > VecAXPY 5251 1.0 2.5892e+00 1.2 3.31e+08 > 1.2 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 1 4 0 > 0 0 2137 > VecAYPX 1883 1.0 8.6419e-01 1.3 3.62e+08 > 1.3 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 1 0 > 0 0 2296 > VecScatterBegin 2873 1.0 1.7569e+01 3.0 0.00e+00 > 0.0 3.8e+04 1.6e+05 0.0e+00 1 0 10 20 0 5 > 0100100 0 0 > VecScatterEnd 2774 1.0 5.1519e+0110.9 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 7 0 0 > 0 0 0 > MatMult 2477 1.0 7.9186e+01 2.4 2.34e+08 > 2.4 3.5e+04 1.7e+05 0.0e+00 3 11 9 20 0 20 48 91 > 98 0 850 > MatMultAdd 297 1.0 2.8161e+00 5.4 4.46e+07 > 2.2 3.6e+03 3.4e+04 0.0e+00 0 0 1 0 0 0 0 9 > 2 0 125 > MatSolve 2477 1.0 6.2245e+01 1.2 1.41e+08 > 1.2 0.0e+00 0.0e+00 0.0e+00 4 10 0 0 0 22 41 0 > 0 0 926 > MatLUFactorNum 3 1.0 2.7686e-01 1.1 2.79e+08 > 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 2016 > MatGetRow 19560420 1.0 5.5195e+01 1.6 > 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 > 20 0 0 0 0 0 > KSPSetup 6 1.0 3.0756e-05 2.8 0.00e+00 > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 0 > KSPSolve 297 1.0 1.3142e+02 1.0 1.31e+08 > 1.1 3.1e+04 1.7e+05 7.1e+03 8 22 8 18 9 50 93 80 > 86100 1001 > PCSetUp 6 1.0 2.7700e-01 1.1 2.78e+08 > 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 2015 > PCSetUpOnBlocks 297 1.0 2.7794e-01 1.1 2.78e+08 > 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > 0 0 2008 > PCApply 2477 1.0 6.2772e+01 1.2 1.39e+08 > 1.2 0.0e+00 0.0e+00 0.0e+00 4 10 0 0 0 23 41 0 > 0 0 918 > > ____________________________________________________________________________________ It's here! Your new message! Get new email alerts with the free Yahoo! Toolbar. http://tools.search.yahoo.com/toolbar/features/mail/ From bsmith at mcs.anl.gov Sat Feb 10 18:43:33 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sat, 10 Feb 2007 18:43:33 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: <497379.41444.qm@web36203.mail.mud.yahoo.com> References: <497379.41444.qm@web36203.mail.mud.yahoo.com> Message-ID: On Sat, 10 Feb 2007, Shi Jin wrote: > Hi, I am a bit confused at how to interpret the > log_summary results. In my previous log files, I > logged everything in that solving staging, including > constructing the matrix and vector and the KSPSolve. > I then specifically change the code so that each > KSPSolve() function is tightly included within the > PetscLogStagePush() and PetscLogStagePop() pair so > that we exclude the other timings and concentrate on > the linear solver. > In this way, I still get list of 16 functions in that > stage, although I only included one (KSPSolve). They > are > VecDot > VecNorm > VecCopy > VecSet > VecAXPY > VecAYPX > VecScatterBegin > VecScatterEnd > MatMult > MatSolve > MatLUFactorNum > KSPSetup > KSPSolve > PCSetUp > PCSetUpOnBlocks > PCApply > Are these functions called by the KSPSolve() (in this > case, I used -ksp_type cg). YES > I suppose the only network communications are done in > the function calls > VecScatterBegin > VecScatterEnd The message passing. VecDot, VecNorm have MPI_Allreduce()s > If I am to compute the percentage of communication > specifically for KSPSolve(), shall I just use the > times of VecScatterBegin & VecScatterEnd devided by > the time of KSPSolve? Or shall I use MatMult, like > Satish did in his previous emails? I am a bit > confused. Please advise. You can do either; using Mult tells you how well the mult is doing in terms of message passing communication. Using ksp tells how in the entire solve. You can add the option -log_sync and it will try to seperate the amount of time in the dot, norm and scatters that is actually spent on communication and how much is spent on synchronization (due to load inbalance). Barry > > Thank you very much. > > Shi > --- Satish Balay wrote: > > > > > Just looking at 8 proc run [diffusion stage] we > > have: > > > > MatMult : 79 sec > > MatMultAdd : 2 sec > > VecScatterBegin: 17 sec > > VecScatterEnd : 51 sec > > > > So basically the communication in MatMult/Add is > > represented by > > VecScatters. Here out of 81 sec total - 68 seconds > > are used for > > communication [with a load imbalance of 11 for > > vecscaterend] > > > > So - I think MPI performance is reducing scalability > > here.. > > > > Things to try: > > > > * -vecstatter_rr etc options I sugested earlier > > > > * install mpich with '--with-device=ch3:ssm' and see > > if it makes a difference > > > > Satish > > > > --- Event Stage 4: Diffusion > > > > [x]rhsLtP 297 1.0 1.1017e+02 1.5 0.00e+00 > > 0.0 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 39 0 0 > > 0 0 0 > > [x]rhsGravity 99 1.0 4.2582e+0083.5 0.00e+00 > > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 > > 0 0 0 > > VecDot 4657 1.0 2.5748e+01 3.2 7.60e+07 > > 3.2 0.0e+00 0.0e+00 4.7e+03 1 1 0 0 6 5 3 0 > > 0 65 191 > > VecNorm 2477 1.0 2.2109e+01 2.2 3.22e+07 > > 2.2 0.0e+00 0.0e+00 2.5e+03 1 0 0 0 3 5 2 0 > > 0 35 118 > > VecScale 594 1.0 2.9330e-02 1.5 2.61e+08 > > 1.5 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > > 0 0 1361 > > VecCopy 594 1.0 2.7552e-01 1.3 0.00e+00 > > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > > 0 0 0 > > VecSet 3665 1.0 6.0793e-01 1.4 0.00e+00 > > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > > 0 0 0 > > VecAXPY 5251 1.0 2.5892e+00 1.2 3.31e+08 > > 1.2 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 1 4 0 > > 0 0 2137 > > VecAYPX 1883 1.0 8.6419e-01 1.3 3.62e+08 > > 1.3 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 1 0 > > 0 0 2296 > > VecScatterBegin 2873 1.0 1.7569e+01 3.0 0.00e+00 > > 0.0 3.8e+04 1.6e+05 0.0e+00 1 0 10 20 0 5 > > 0100100 0 0 > > VecScatterEnd 2774 1.0 5.1519e+0110.9 0.00e+00 > > 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 7 0 0 > > 0 0 0 > > MatMult 2477 1.0 7.9186e+01 2.4 2.34e+08 > > 2.4 3.5e+04 1.7e+05 0.0e+00 3 11 9 20 0 20 48 91 > > 98 0 850 > > MatMultAdd 297 1.0 2.8161e+00 5.4 4.46e+07 > > 2.2 3.6e+03 3.4e+04 0.0e+00 0 0 1 0 0 0 0 9 > > 2 0 125 > > MatSolve 2477 1.0 6.2245e+01 1.2 1.41e+08 > > 1.2 0.0e+00 0.0e+00 0.0e+00 4 10 0 0 0 22 41 0 > > 0 0 926 > > MatLUFactorNum 3 1.0 2.7686e-01 1.1 2.79e+08 > > 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > > 0 0 2016 > > MatGetRow 19560420 1.0 5.5195e+01 1.6 > > 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 > > 20 0 0 0 0 0 > > KSPSetup 6 1.0 3.0756e-05 2.8 0.00e+00 > > 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > > 0 0 0 > > KSPSolve 297 1.0 1.3142e+02 1.0 1.31e+08 > > 1.1 3.1e+04 1.7e+05 7.1e+03 8 22 8 18 9 50 93 80 > > 86100 1001 > > PCSetUp 6 1.0 2.7700e-01 1.1 2.78e+08 > > 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > > 0 0 2015 > > PCSetUpOnBlocks 297 1.0 2.7794e-01 1.1 2.78e+08 > > 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 > > 0 0 2008 > > PCApply 2477 1.0 6.2772e+01 1.2 1.39e+08 > > 1.2 0.0e+00 0.0e+00 0.0e+00 4 10 0 0 0 23 41 0 > > 0 0 918 > > > > > > > > > ____________________________________________________________________________________ > It's here! Your new message! > Get new email alerts with the free Yahoo! Toolbar. > http://tools.search.yahoo.com/toolbar/features/mail/ > > From zonexo at gmail.com Sat Feb 10 19:02:54 2007 From: zonexo at gmail.com (Ben Tay) Date: Sun, 11 Feb 2007 09:02:54 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> Message-ID: <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> Hi, In other words, for my CFD code, it is not possible to parallelize it effectively because the problem is too small? Is these true for all parallel solver, or just PETSc? I was hoping to reduce the runtime since mine is an unsteady problem which requires many steps to reach a periodic state and it takes many hours to reach it. Lastly, if I'm running on 2 processors, will there be improvement likely? Thank you. On 2/11/07, Barry Smith wrote: > > > > On Sat, 10 Feb 2007, Ben Tay wrote: > > > Hi, > > > > I've repeated the test with n,m = 800. Now serial takes around 11mins > while > > parallel with 4 processors took 6mins. Does it mean that the problem > must be > > pretty large before it is more superior to use parallel? Moreover > 800x800 > > means there's 640000 unknowns. My problem is a 2D CFD code which > typically > > has 200x80=16000 unknowns. Does it mean that I won't be able to benefit > from > ^^^^^^^^^^^ > You'll never get much performance past 2 processors; its not even worth > all the work of having a parallel code in this case. I'd just optimize the > heck out of the serial code. > > Barry > > > > > running in parallel? > > > > Btw, this is the parallel's log_summary: > > > > > > Event Count Time (sec) > > Flops/sec --- Global --- --- Stage --- Total > > Max Ratio Max Ratio Max Ratio Mess Avg len > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > ------------------------------------------------------------------------------------------------------------------------ > > > > --- Event Stage 0: Main Stage > > > > MatMult 1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03 > > 0.0e+00 16 11100100 0 16 11100100 0 103 > > MatSolve 1265 1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00 > > 0.0e+00 11 11 0 0 0 11 11 0 0 0 152 > > MatLUFactorNum 1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 22 > > MatILUFactorSym 1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatAssemblyBegin 1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatAssemblyEnd 1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03 > > 1.3e+01 0 0 0 0 0 0 0 0 0 0 0 > > MatGetOrdering 1 1.0 1.2088e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecMDot 1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00 > > 1.2e+03 36 36 0 0 31 36 36 0 0 31 158 > > VecNorm 1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00 > > 1.3e+03 24 2 0 0 33 24 2 0 0 33 16 > > VecScale 1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00 > > 0.0e+00 1 1 0 0 0 1 1 0 0 0 216 > > VecCopy 41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecSet 1308 1.0 3.2717e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > VecAXPY 82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 197 > > VecMAXPY 1265 1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00 > > 0.0e+00 10 38 0 0 0 10 38 0 0 0 557 > > VecScatterBegin 1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03 > > 0.0e+00 0 0100100 0 0 0100100 0 0 > > VecScatterEnd 1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 > > VecNormalize 1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00 > > 1.3e+03 25 4 0 0 32 25 4 0 0 32 23 > > KSPGMRESOrthog 1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00 > > 1.2e+03 45 72 0 0 31 45 72 0 0 31 254 > > KSPSetup 2 1.0 5.1674e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > > KSPSolve 1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03 > > 3.9e+03 99100100100 99 99100100100 99 166 > > PCSetUp 2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00 > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 12 > > PCSetUpOnBlocks 1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00 > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 13 > > PCApply 1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00 > > 1.3e+03 12 11 0 0 32 12 11 0 0 32 143 > > > ------------------------------------------------------------------------------------------------------------------------ > > > > Memory usage is given in bytes: > > > > Object Type Creations Destructions Memory Descendants' > Mem. > > > > --- Event Stage 0: Main Stage > > > > Matrix 4 4 643208 0 > > Index Set 5 5 1924296 0 > > Vec 41 41 47379984 0 > > Vec Scatter 1 1 0 0 > > Krylov Solver 2 2 16880 0 > > Preconditioner 2 2 196 0 > > > ======================================================================================================================== > > Average time to get PetscTime(): 1.00136e-06 > > Average time for MPI_Barrier(): 4.00066e-05 > > Average time for zero size MPI_Send(): 1.70469e-05 > > OptionTable: -log_summary > > Compiled without FORTRAN kernels > > Compiled with full precision matrices (default) > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > > sizeof(PetscScalar) 8 > > Configure run at: Thu Jan 18 12:23:31 2007 > > Configure options: --with-vendor-compilers=intel --with-x=0 > --with-shared > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > > --with-mpi-dir=/opt/mpich/myrinet/intel/ > > ----------------------------------------- > > > > > > > > > > > > > > > > On 2/10/07, Ben Tay wrote: > > > > > > Hi, > > > > > > I tried to use ex2f.F as a test code. I've changed the number n,m from > 3 > > > to 500 each. I ran the code using 1 processor and then with 4 > processor. I > > > then repeat the same with the following modification: > > > > > > > > > do i=1,10 > > > > > > call KSPSolve(ksp,b,x,ierr) > > > > > > end do > > > I've added to do loop to make the solving repeat 10 times. > > > > > > In both cases, the serial code is faster, e.g. 1 taking 2.4 min while > the > > > other 3.3 min. > > > > > > Here's the log_summary: > > > > > > > > > ---------------------------------------------- PETSc Performance > Summary: > > > ---------------------------------------------- > > > > > > ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by > > > g0306332 Sat Feb 10 16:21:36 2007 > > > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST > 2007 > > > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 > > > > > > Max Max/Min Avg Total > > > Time (sec): 2.213e+02 1.00051 2.212e+02 > > > Objects: 5.500e+01 1.00000 5.500e+01 > > > Flops: 4.718e+09 1.00019 4.718e+09 1.887e+10 > > > Flops/sec: 2.134e+07 1.00070 2.133e+07 8.531e+07 > > > > > > Memory: 3.186e+07 1.00069 1.274e+08 > > > MPI Messages: 1.832e+03 2.00000 1.374e+03 5.496e+03 > > > MPI Message Lengths: 7.324e+06 2.00000 3.998e+03 2.197e+07 > > > MPI Reductions: 7.112e+02 1.00000 > > > > > > Flop counting convention: 1 flop = 1 real number operation of type > > > (multiply/divide/add/subtract) > > > e.g., VecAXPY() for real vectors of length > N > > > --> 2N flops > > > and VecAXPY() for complex vectors of > length N > > > --> 8N flops > > > > > > Summary of Stages: ----- Time ------ ----- Flops ----- --- > Messages > > > --- -- Message Lengths -- -- Reductions -- > > > Avg %Total Avg %Total counts > > > %Total Avg %Total counts %Total > > > 0: Main Stage: 2.2120e+02 100.0% 1.8871e+10 100.0% 5.496e+03 > > > 100.0% 3.998e+03 100.0% 2.845e+03 100.0% > > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > See the 'Profiling' chapter of the users' manual for details on > > > interpreting output. > > > Phase summary info: > > > Count: number of times phase was executed > > > Time and Flops/sec: Max - maximum over all processors > > > Ratio - ratio of maximum to minimum over all > > > processors > > > Mess: number of messages sent > > > Avg. len: average message length > > > Reduct: number of global reductions > > > Global: entire computation > > > Stage: stages of a computation. Set stages with PetscLogStagePush() > and > > > PetscLogStagePop(). > > > %T - percent time in this phase %F - percent flops in > this > > > phase > > > %M - percent messages in this phase %L - percent message > lengths > > > in this phase > > > %R - percent reductions in this phase > > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > > > over all processors) > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > ########################################################## > > > # # > > > # WARNING!!! # > > > # # > > > # This code was compiled with a debugging option, # > > > # To get timing results run config/configure.py # > > > # using --with-debugging=no, the performance will # > > > # be generally two or three times faster. # > > > # # > > > ########################################################## > > > > > > > > > > > > > > > ########################################################## > > > # # > > > # WARNING!!! # > > > # # > > > # This code was run without the PreLoadBegin() # > > > # macros. To get timing results we always recommend # > > > # preloading. otherwise timing numbers may be # > > > # meaningless. # > > > ########################################################## > > > > > > > > > Event Count Time (sec) > > > Flops/sec --- Global --- --- Stage --- > Total > > > Max Ratio Max Ratio Max Ratio Mess Avg > len > > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > --- Event Stage 0: Main Stage > > > > > > MatMult 915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03 > 4.0e+03 > > > 0.0e+00 18 11100100 0 18 11100100 0 46 > > > MatSolve 915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00 > 0.0e+00 > > > 0.0e+00 7 11 0 0 0 7 11 0 0 0 131 > > > MatLUFactorNum 1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 43 > > > MatILUFactorSym 1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatAssemblyBegin 1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatAssemblyEnd 1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00 > 2.0e+03 > > > 1.3e+01 1 0 0 0 0 1 0 0 0 0 0 > > > MatGetOrdering 1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecMDot 885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00 > 0.0e+00 > > > 8.8e+02 36 36 0 0 31 36 36 0 0 31 80 > > > VecNorm 916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00 > 0.0e+00 > > > 9.2e+02 29 2 0 0 32 29 2 0 0 32 7 > > > VecScale 915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 200 > > > VecCopy 30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecSet 947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecAXPY 60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 542 > > > VecMAXPY 915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00 > 0.0e+00 > > > 0.0e+00 6 38 0 0 0 6 38 0 0 0 483 > > > VecScatterBegin 915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03 > 4.0e+03 > > > 0.0e+00 0 0100100 0 0 0100100 0 0 > > > VecScatterEnd 915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 0.0e+00 14 0 0 0 0 14 0 0 0 0 0 > > > VecNormalize 915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00 > 0.0e+00 > > > 9.2e+02 30 4 0 0 32 30 4 0 0 32 10 > > > KSPGMRESOrthog 885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00 > 0.0e+00 > > > 8.8e+02 42 72 0 0 31 42 72 0 0 31 138 > > > KSPSetup 2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > > > KSPSolve 1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03 > 4.0e+03 > > > 2.8e+03 99100100100 99 99100100100 99 86 > > > PCSetUp 2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00 > 0.0e+00 > > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 30 > > > PCSetUpOnBlocks 1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00 > 0.0e+00 > > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 31 > > > PCApply 915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00 > 0.0e+00 > > > 9.2e+02 7 11 0 0 32 7 11 0 0 32 124 > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > > > > Memory usage is given in bytes: > > > > > > Object Type Creations Destructions Memory Descendants' > Mem. > > > > > > --- Event Stage 0: Main Stage > > > > > > Matrix 4 4 252008 0 > > > Index Set 5 5 753096 0 > > > Vec 41 41 18519984 0 > > > Vec Scatter 1 1 0 0 > > > Krylov Solver 2 2 16880 0 > > > Preconditioner 2 2 196 0 > > > > ======================================================================================================================== > > > > > > Average time to get PetscTime(): 1.09673e-06 > > > Average time for MPI_Barrier(): 4.18186e-05 > > > Average time for zero size MPI_Send(): 2.62856e-05 > > > OptionTable: -log_summary > > > Compiled without FORTRAN kernels > > > Compiled with full precision matrices (default) > > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > > > sizeof(PetscScalar) 8 > > > Configure run at: Thu Jan 18 12:23:31 2007 > > > Configure options: --with-vendor-compilers=intel --with-x=0 > --with-shared > > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > > > --with-mpi-dir=/opt/mpich/myrinet/intel/ > > > ----------------------------------------- > > > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on > atlas1.nus.edu.sg > > > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 > SMP > > > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux > > > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 > > > Using PETSc arch: linux-mpif90 > > > ----------------------------------------- > > > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC > -g > > > -w90 -w > > > ----------------------------------------- > > > Using include paths: -I/nas/lsftmp/g0306332/petsc- > > > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc- > > > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8 > /include > > > -I/opt/mpich/myrinet/intel/include > > > ------------------------------------------ > > > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > > > -w90 -w > > > Using libraries: > > > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 > > > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts > > > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > > > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 > > > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide > > > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib > > > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm > > > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -L/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts > -lcxa > > > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib > -lPEPCF90 > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -L/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 > -lm -Wl,-rpath,\ > > > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -L/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl > > > ------------------------------------------ > > > > > > So is there something wrong with the server's mpi implementation? > > > > > > Thank you. > > > > > > > > > > > > On 2/10/07, Satish Balay wrote: > > > > > > > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec. > > > > Either something is wrong with your run - or MPI is really broken.. > > > > > > > > Satish > > > > > > > > > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 > 2.4e+04 > > > > 1.3e+03 > > > > > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 > 2.4e+04 > > > > 1.3e+03 > > > > > > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 > 0.0e+00 > > > > 0.0e+00 > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Sat Feb 10 21:26:07 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sat, 10 Feb 2007 21:26:07 -0600 (CST) Subject: understanding the output from -info In-Reply-To: <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702090816qb6d1325g1d311a0eb53eec26@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> Message-ID: My recommendation is just to try to optimize sequential runs by using the most appropriate solver algorithms, the best sequential processor with the fastest memory and slickest code. Parallel computing is to solve big problems, not to solve little problems fast. (anything less then 100k unknowns or even more is in my opinion is small). Barry On Sun, 11 Feb 2007, Ben Tay wrote: > Hi, > > In other words, for my CFD code, it is not possible to parallelize it > effectively because the problem is too small? > > Is these true for all parallel solver, or just PETSc? I was hoping to reduce > the runtime since mine is an unsteady problem which requires many steps to > reach a periodic state and it takes many hours to reach it. > > Lastly, if I'm running on 2 processors, will there be improvement likely? > > Thank you. > > > On 2/11/07, Barry Smith wrote: > > > > > > > > On Sat, 10 Feb 2007, Ben Tay wrote: > > > > > Hi, > > > > > > I've repeated the test with n,m = 800. Now serial takes around 11mins > > while > > > parallel with 4 processors took 6mins. Does it mean that the problem > > must be > > > pretty large before it is more superior to use parallel? Moreover > > 800x800 > > > means there's 640000 unknowns. My problem is a 2D CFD code which > > typically > > > has 200x80=16000 unknowns. Does it mean that I won't be able to benefit > > from > > ^^^^^^^^^^^ > > You'll never get much performance past 2 processors; its not even worth > > all the work of having a parallel code in this case. I'd just optimize the > > heck out of the serial code. > > > > Barry > > > > > > > > > running in parallel? > > > > > > Btw, this is the parallel's log_summary: > > > > > > > > > Event Count Time (sec) > > > Flops/sec --- Global --- --- Stage --- Total > > > Max Ratio Max Ratio Max Ratio Mess Avg len > > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > --- Event Stage 0: Main Stage > > > > > > MatMult 1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03 > > > 0.0e+00 16 11100100 0 16 11100100 0 103 > > > MatSolve 1265 1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00 > > > 0.0e+00 11 11 0 0 0 11 11 0 0 0 152 > > > MatLUFactorNum 1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 22 > > > MatILUFactorSym 1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatAssemblyBegin 1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatAssemblyEnd 1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03 > > > 1.3e+01 0 0 0 0 0 0 0 0 0 0 0 > > > MatGetOrdering 1 1.0 1.2088e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecMDot 1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00 > > > 1.2e+03 36 36 0 0 31 36 36 0 0 31 158 > > > VecNorm 1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00 > > > 1.3e+03 24 2 0 0 33 24 2 0 0 33 16 > > > VecScale 1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00 > > > 0.0e+00 1 1 0 0 0 1 1 0 0 0 216 > > > VecCopy 41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecSet 1308 1.0 3.2717e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > > VecAXPY 82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 197 > > > VecMAXPY 1265 1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00 > > > 0.0e+00 10 38 0 0 0 10 38 0 0 0 557 > > > VecScatterBegin 1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03 > > > 0.0e+00 0 0100100 0 0 0100100 0 0 > > > VecScatterEnd 1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > > > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 > > > VecNormalize 1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00 > > > 1.3e+03 25 4 0 0 32 25 4 0 0 32 23 > > > KSPGMRESOrthog 1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00 > > > 1.2e+03 45 72 0 0 31 45 72 0 0 31 254 > > > KSPSetup 2 1.0 5.1674e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > > > KSPSolve 1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03 > > > 3.9e+03 99100100100 99 99100100100 99 166 > > > PCSetUp 2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00 > > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 12 > > > PCSetUpOnBlocks 1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00 > > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 13 > > > PCApply 1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00 > > > 1.3e+03 12 11 0 0 32 12 11 0 0 32 143 > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > Memory usage is given in bytes: > > > > > > Object Type Creations Destructions Memory Descendants' > > Mem. > > > > > > --- Event Stage 0: Main Stage > > > > > > Matrix 4 4 643208 0 > > > Index Set 5 5 1924296 0 > > > Vec 41 41 47379984 0 > > > Vec Scatter 1 1 0 0 > > > Krylov Solver 2 2 16880 0 > > > Preconditioner 2 2 196 0 > > > > > ======================================================================================================================== > > > Average time to get PetscTime(): 1.00136e-06 > > > Average time for MPI_Barrier(): 4.00066e-05 > > > Average time for zero size MPI_Send(): 1.70469e-05 > > > OptionTable: -log_summary > > > Compiled without FORTRAN kernels > > > Compiled with full precision matrices (default) > > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > > > sizeof(PetscScalar) 8 > > > Configure run at: Thu Jan 18 12:23:31 2007 > > > Configure options: --with-vendor-compilers=intel --with-x=0 > > --with-shared > > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > > > --with-mpi-dir=/opt/mpich/myrinet/intel/ > > > ----------------------------------------- > > > > > > > > > > > > > > > > > > > > > > > > On 2/10/07, Ben Tay wrote: > > > > > > > > Hi, > > > > > > > > I tried to use ex2f.F as a test code. I've changed the number n,m from > > 3 > > > > to 500 each. I ran the code using 1 processor and then with 4 > > processor. I > > > > then repeat the same with the following modification: > > > > > > > > > > > > do i=1,10 > > > > > > > > call KSPSolve(ksp,b,x,ierr) > > > > > > > > end do > > > > I've added to do loop to make the solving repeat 10 times. > > > > > > > > In both cases, the serial code is faster, e.g. 1 taking 2.4 min while > > the > > > > other 3.3 min. > > > > > > > > Here's the log_summary: > > > > > > > > > > > > ---------------------------------------------- PETSc Performance > > Summary: > > > > ---------------------------------------------- > > > > > > > > ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by > > > > g0306332 Sat Feb 10 16:21:36 2007 > > > > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST > > 2007 > > > > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 > > > > > > > > Max Max/Min Avg Total > > > > Time (sec): 2.213e+02 1.00051 2.212e+02 > > > > Objects: 5.500e+01 1.00000 5.500e+01 > > > > Flops: 4.718e+09 1.00019 4.718e+09 1.887e+10 > > > > Flops/sec: 2.134e+07 1.00070 2.133e+07 8.531e+07 > > > > > > > > Memory: 3.186e+07 1.00069 1.274e+08 > > > > MPI Messages: 1.832e+03 2.00000 1.374e+03 5.496e+03 > > > > MPI Message Lengths: 7.324e+06 2.00000 3.998e+03 2.197e+07 > > > > MPI Reductions: 7.112e+02 1.00000 > > > > > > > > Flop counting convention: 1 flop = 1 real number operation of type > > > > (multiply/divide/add/subtract) > > > > e.g., VecAXPY() for real vectors of length > > N > > > > --> 2N flops > > > > and VecAXPY() for complex vectors of > > length N > > > > --> 8N flops > > > > > > > > Summary of Stages: ----- Time ------ ----- Flops ----- --- > > Messages > > > > --- -- Message Lengths -- -- Reductions -- > > > > Avg %Total Avg %Total counts > > > > %Total Avg %Total counts %Total > > > > 0: Main Stage: 2.2120e+02 100.0% 1.8871e+10 100.0% 5.496e+03 > > > > 100.0% 3.998e+03 100.0% 2.845e+03 100.0% > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > See the 'Profiling' chapter of the users' manual for details on > > > > interpreting output. > > > > Phase summary info: > > > > Count: number of times phase was executed > > > > Time and Flops/sec: Max - maximum over all processors > > > > Ratio - ratio of maximum to minimum over all > > > > processors > > > > Mess: number of messages sent > > > > Avg. len: average message length > > > > Reduct: number of global reductions > > > > Global: entire computation > > > > Stage: stages of a computation. Set stages with PetscLogStagePush() > > and > > > > PetscLogStagePop(). > > > > %T - percent time in this phase %F - percent flops in > > this > > > > phase > > > > %M - percent messages in this phase %L - percent message > > lengths > > > > in this phase > > > > %R - percent reductions in this phase > > > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > > > > over all processors) > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > > > ########################################################## > > > > # # > > > > # WARNING!!! # > > > > # # > > > > # This code was compiled with a debugging option, # > > > > # To get timing results run config/configure.py # > > > > # using --with-debugging=no, the performance will # > > > > # be generally two or three times faster. # > > > > # # > > > > ########################################################## > > > > > > > > > > > > > > > > > > > > ########################################################## > > > > # # > > > > # WARNING!!! # > > > > # # > > > > # This code was run without the PreLoadBegin() # > > > > # macros. To get timing results we always recommend # > > > > # preloading. otherwise timing numbers may be # > > > > # meaningless. # > > > > ########################################################## > > > > > > > > > > > > Event Count Time (sec) > > > > Flops/sec --- Global --- --- Stage --- > > Total > > > > Max Ratio Max Ratio Max Ratio Mess Avg > > len > > > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > > > --- Event Stage 0: Main Stage > > > > > > > > MatMult 915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03 > > 4.0e+03 > > > > 0.0e+00 18 11100100 0 18 11100100 0 46 > > > > MatSolve 915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00 > > 0.0e+00 > > > > 0.0e+00 7 11 0 0 0 7 11 0 0 0 131 > > > > MatLUFactorNum 1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00 > > 0.0e+00 > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 43 > > > > MatILUFactorSym 1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00 > > 0.0e+00 > > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > MatAssemblyBegin 1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00 > > 0.0e+00 > > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > MatAssemblyEnd 1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00 > > 2.0e+03 > > > > 1.3e+01 1 0 0 0 0 1 0 0 0 0 0 > > > > MatGetOrdering 1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00 > > 0.0e+00 > > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > VecMDot 885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00 > > 0.0e+00 > > > > 8.8e+02 36 36 0 0 31 36 36 0 0 31 80 > > > > VecNorm 916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00 > > 0.0e+00 > > > > 9.2e+02 29 2 0 0 32 29 2 0 0 32 7 > > > > VecScale 915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00 > > 0.0e+00 > > > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 200 > > > > VecCopy 30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00 > > 0.0e+00 > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > VecSet 947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00 > > 0.0e+00 > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > VecAXPY 60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00 > > 0.0e+00 > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 542 > > > > VecMAXPY 915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00 > > 0.0e+00 > > > > 0.0e+00 6 38 0 0 0 6 38 0 0 0 483 > > > > VecScatterBegin 915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03 > > 4.0e+03 > > > > 0.0e+00 0 0100100 0 0 0100100 0 0 > > > > VecScatterEnd 915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00 > > 0.0e+00 > > > > 0.0e+00 14 0 0 0 0 14 0 0 0 0 0 > > > > VecNormalize 915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00 > > 0.0e+00 > > > > 9.2e+02 30 4 0 0 32 30 4 0 0 32 10 > > > > KSPGMRESOrthog 885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00 > > 0.0e+00 > > > > 8.8e+02 42 72 0 0 31 42 72 0 0 31 138 > > > > KSPSetup 2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00 > > 0.0e+00 > > > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > > > > KSPSolve 1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03 > > 4.0e+03 > > > > 2.8e+03 99100100100 99 99100100100 99 86 > > > > PCSetUp 2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00 > > 0.0e+00 > > > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 30 > > > > PCSetUpOnBlocks 1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00 > > 0.0e+00 > > > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 31 > > > > PCApply 915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00 > > 0.0e+00 > > > > 9.2e+02 7 11 0 0 32 7 11 0 0 32 124 > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > > > > > > > Memory usage is given in bytes: > > > > > > > > Object Type Creations Destructions Memory Descendants' > > Mem. > > > > > > > > --- Event Stage 0: Main Stage > > > > > > > > Matrix 4 4 252008 0 > > > > Index Set 5 5 753096 0 > > > > Vec 41 41 18519984 0 > > > > Vec Scatter 1 1 0 0 > > > > Krylov Solver 2 2 16880 0 > > > > Preconditioner 2 2 196 0 > > > > > > ======================================================================================================================== > > > > > > > > Average time to get PetscTime(): 1.09673e-06 > > > > Average time for MPI_Barrier(): 4.18186e-05 > > > > Average time for zero size MPI_Send(): 2.62856e-05 > > > > OptionTable: -log_summary > > > > Compiled without FORTRAN kernels > > > > Compiled with full precision matrices (default) > > > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > > > > sizeof(PetscScalar) 8 > > > > Configure run at: Thu Jan 18 12:23:31 2007 > > > > Configure options: --with-vendor-compilers=intel --with-x=0 > > --with-shared > > > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > > > > --with-mpi-dir=/opt/mpich/myrinet/intel/ > > > > ----------------------------------------- > > > > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on > > atlas1.nus.edu.sg > > > > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 > > SMP > > > > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux > > > > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 > > > > Using PETSc arch: linux-mpif90 > > > > ----------------------------------------- > > > > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > > > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC > > -g > > > > -w90 -w > > > > ----------------------------------------- > > > > Using include paths: -I/nas/lsftmp/g0306332/petsc- > > > > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc- > > > > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8 > > /include > > > > -I/opt/mpich/myrinet/intel/include > > > > ------------------------------------------ > > > > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > > > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > > > > -w90 -w > > > > Using libraries: > > > > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 > > > > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts > > > > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > > > > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 > > > > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide > > > > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > > > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib > > > > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm > > > > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > -L/opt/intel/compiler70/ia32/lib > > > > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts > > -lcxa > > > > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib > > -lPEPCF90 > > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > -L/opt/intel/compiler70/ia32/lib > > > > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 > > -lm -Wl,-rpath,\ > > > > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread > > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > -L/opt/intel/compiler70/ia32/lib > > > > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl > > > > ------------------------------------------ > > > > > > > > So is there something wrong with the server's mpi implementation? > > > > > > > > Thank you. > > > > > > > > > > > > > > > > On 2/10/07, Satish Balay wrote: > > > > > > > > > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec. > > > > > Either something is wrong with your run - or MPI is really broken.. > > > > > > > > > > Satish > > > > > > > > > > > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 > > 2.4e+04 > > > > > 1.3e+03 > > > > > > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 > > 2.4e+04 > > > > > 1.3e+03 > > > > > > > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 > > 0.0e+00 > > > > > 0.0e+00 > > > > > > > > > > > > > > > > > > > > > > From dalcinl at gmail.com Sat Feb 10 22:03:20 2007 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Sun, 11 Feb 2007 01:03:20 -0300 Subject: understanding the output from -info In-Reply-To: <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> Message-ID: On 2/10/07, Ben Tay wrote: > In other words, for my CFD code, it is not possible to parallelize it > effectively because the problem is too small? > > Is these true for all parallel solver, or just PETSc? I was hoping to reduce > the runtime since mine is an unsteady problem which requires many steps to > reach a periodic state and it takes many hours to reach it. Can you describe your specific application and how are you solving it? As Barry said, your need-for-speed is not likely to be solved by running in parallel. -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From zonexo at gmail.com Sat Feb 10 23:41:31 2007 From: zonexo at gmail.com (Ben Tay) Date: Sun, 11 Feb 2007 13:41:31 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> Message-ID: <804ab5d40702102141s258ef22due0093263f83dc7bb@mail.gmail.com> Well, I am simulating unsteady flow past a moving airfoil at Re~10^4. I'm using fractional step FVM, which means that I need to solve a momentum and poisson equation. To reach a periodic state takes quite a few hours and so I'm trying to find ways to speed up the process. I thought parallelizing the code would help but it seems like it's not the case. I'm now trying out different types of solver/preconditioner available on PETSc to assess their performance. Is there other external solvers, which PETSc interfaces, which are recommended? I'm thinking of using multigrid to solve the poisson eqn... wonder if hypre/BoomerAMG etc would help... On 2/11/07, Lisandro Dalcin wrote: > > On 2/10/07, Ben Tay wrote: > > In other words, for my CFD code, it is not possible to parallelize it > > effectively because the problem is too small? > > > > Is these true for all parallel solver, or just PETSc? I was hoping to > reduce > > the runtime since mine is an unsteady problem which requires many steps > to > > reach a periodic state and it takes many hours to reach it. > > Can you describe your specific application and how are you solving it? > As Barry said, your need-for-speed is not likely to be solved by > running in parallel. > > > -- > Lisandro Dalc?n > --------------- > Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > Tel/Fax: +54-(0)342-451.1594 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Sun Feb 11 10:42:11 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sun, 11 Feb 2007 10:42:11 -0600 (CST) Subject: understanding the output from -info In-Reply-To: <804ab5d40702102141s258ef22due0093263f83dc7bb@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702091651h6265a510jf5d4ca46cd526876@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> <804ab5d40702102141s258ef22due0093263f83dc7bb@mail.gmail.com> Message-ID: hypre/boomeramg may be the way to go, especially for the Poisson problem. -pc_type hypre -pc_hypre_type boomeramg (-help for lots of tuning options.). Barry On Sun, 11 Feb 2007, Ben Tay wrote: > Well, > > I am simulating unsteady flow past a moving airfoil at Re~10^4. I'm using > fractional step FVM, which means that I need to solve a momentum and poisson > equation. > > To reach a periodic state takes quite a few hours and so I'm trying to find > ways to speed up the process. I thought parallelizing the code would help > but it seems like it's not the case. > > I'm now trying out different types of solver/preconditioner available on > PETSc to assess their performance. Is there other external solvers, which > PETSc interfaces, which are recommended? I'm thinking of using multigrid to > solve the poisson eqn... wonder if hypre/BoomerAMG etc would help... > > > On 2/11/07, Lisandro Dalcin wrote: > > > > On 2/10/07, Ben Tay wrote: > > > In other words, for my CFD code, it is not possible to parallelize it > > > effectively because the problem is too small? > > > > > > Is these true for all parallel solver, or just PETSc? I was hoping to > > reduce > > > the runtime since mine is an unsteady problem which requires many steps > > to > > > reach a periodic state and it takes many hours to reach it. > > > > Can you describe your specific application and how are you solving it? > > As Barry said, your need-for-speed is not likely to be solved by > > running in parallel. > > > > > > -- > > Lisandro Dalc??n > > --------------- > > Centro Internacional de M??todos Computacionales en Ingenier??a (CIMEC) > > Instituto de Desarrollo Tecnol??gico para la Industria Qu??mica (INTEC) > > Consejo Nacional de Investigaciones Cient??ficas y T??cnicas (CONICET) > > PTLC - G??emes 3450, (3000) Santa Fe, Argentina > > Tel/Fax: +54-(0)342-451.1594 > > > > > From zonexo at gmail.com Sun Feb 11 18:26:26 2007 From: zonexo at gmail.com (Ben Tay) Date: Mon, 12 Feb 2007 08:26:26 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> <804ab5d40702102141s258ef22due0093263f83dc7bb@mail.gmail.com> Message-ID: <804ab5d40702111626p2cbbf495ma954bcda1e3b75e8@mail.gmail.com> Hi, I have some questions regarding the use of hypre/boomeramg: 1. Is there anything I need to change in the assembly of matrix etc besides adding -pc_type hypre -pc_hypre_type boomeramg ? 2. Can it work in a sequential code? 3. I have 2 eqns to solve - momentum and poisson. if I used the options, will both equations be solved using hypre? Can I select which solver to solve with which equation? Thank you. On 2/12/07, Barry Smith wrote: > > > hypre/boomeramg may be the way to go, especially for the Poisson > problem. -pc_type hypre -pc_hypre_type boomeramg (-help for lots of > tuning options.). > > Barry > > > On Sun, 11 Feb 2007, Ben Tay wrote: > > > Well, > > > > I am simulating unsteady flow past a moving airfoil at Re~10^4. I'm > using > > fractional step FVM, which means that I need to solve a momentum and > poisson > > equation. > > > > To reach a periodic state takes quite a few hours and so I'm trying to > find > > ways to speed up the process. I thought parallelizing the code would > help > > but it seems like it's not the case. > > > > I'm now trying out different types of solver/preconditioner available on > > PETSc to assess their performance. Is there other external solvers, > which > > PETSc interfaces, which are recommended? I'm thinking of using multigrid > to > > solve the poisson eqn... wonder if hypre/BoomerAMG etc would help... > > > > > > On 2/11/07, Lisandro Dalcin wrote: > > > > > > On 2/10/07, Ben Tay wrote: > > > > In other words, for my CFD code, it is not possible to parallelize > it > > > > effectively because the problem is too small? > > > > > > > > Is these true for all parallel solver, or just PETSc? I was hoping > to > > > reduce > > > > the runtime since mine is an unsteady problem which requires many > steps > > > to > > > > reach a periodic state and it takes many hours to reach it. > > > > > > Can you describe your specific application and how are you solving it? > > > As Barry said, your need-for-speed is not likely to be solved by > > > running in parallel. > > > > > > > > > -- > > > Lisandro Dalc?n > > > --------------- > > > Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) > > > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Sun Feb 11 18:50:16 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sun, 11 Feb 2007 18:50:16 -0600 (CST) Subject: understanding the output from -info In-Reply-To: <804ab5d40702111626p2cbbf495ma954bcda1e3b75e8@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> <804ab5d40702102141s258ef22due0093263f83dc7bb@mail.gmail.com> <804ab5d40702111626p2cbbf495ma954bcda1e3b75e8@mail.gmail.com> Message-ID: On Mon, 12 Feb 2007, Ben Tay wrote: > Hi, > > I have some questions regarding the use of hypre/boomeramg: > > 1. Is there anything I need to change in the assembly of matrix etc besides > adding -pc_type hypre -pc_hypre_type boomeramg ? No > > 2. Can it work in a sequential code? yes > > 3. I have 2 eqns to solve - momentum and poisson. if I used the options, > will both equations be solved using hypre? yes > Can I select which solver to > solve with which equation? yes. For each KSP call KSPSetOptionsPrefix() for example KSPSetOptionsPrefix(kspmo,"momentum); KSPSetOptionsPrefix(ksppo,"poisson"); then from the command line use -momentum_ksp_type gmres -poisson_ksp_type cg -momentum_pc_type lusomething etc. For any solver option. Barry > > Thank you. > > > On 2/12/07, Barry Smith wrote: > > > > > > hypre/boomeramg may be the way to go, especially for the Poisson > > problem. -pc_type hypre -pc_hypre_type boomeramg (-help for lots of > > tuning options.). > > > > Barry > > > > > > On Sun, 11 Feb 2007, Ben Tay wrote: > > > > > Well, > > > > > > I am simulating unsteady flow past a moving airfoil at Re~10^4. I'm > > using > > > fractional step FVM, which means that I need to solve a momentum and > > poisson > > > equation. > > > > > > To reach a periodic state takes quite a few hours and so I'm trying to > > find > > > ways to speed up the process. I thought parallelizing the code would > > help > > > but it seems like it's not the case. > > > > > > I'm now trying out different types of solver/preconditioner available on > > > PETSc to assess their performance. Is there other external solvers, > > which > > > PETSc interfaces, which are recommended? I'm thinking of using multigrid > > to > > > solve the poisson eqn... wonder if hypre/BoomerAMG etc would help... > > > > > > > > > On 2/11/07, Lisandro Dalcin wrote: > > > > > > > > On 2/10/07, Ben Tay wrote: > > > > > In other words, for my CFD code, it is not possible to parallelize > > it > > > > > effectively because the problem is too small? > > > > > > > > > > Is these true for all parallel solver, or just PETSc? I was hoping > > to > > > > reduce > > > > > the runtime since mine is an unsteady problem which requires many > > steps > > > > to > > > > > reach a periodic state and it takes many hours to reach it. > > > > > > > > Can you describe your specific application and how are you solving it? > > > > As Barry said, your need-for-speed is not likely to be solved by > > > > running in parallel. > > > > > > > > > > > > -- > > > > Lisandro Dalc??n > > > > --------------- > > > > Centro Internacional de M??todos Computacionales en Ingenier??a (CIMEC) > > > > Instituto de Desarrollo Tecnol??gico para la Industria Qu??mica (INTEC) > > > > Consejo Nacional de Investigaciones Cient??ficas y T??cnicas (CONICET) > > > > PTLC - G??emes 3450, (3000) Santa Fe, Argentina > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > From zonexo at gmail.com Sun Feb 11 21:21:48 2007 From: zonexo at gmail.com (Ben Tay) Date: Mon, 12 Feb 2007 11:21:48 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> <804ab5d40702102141s258ef22due0093263f83dc7bb@mail.gmail.com> <804ab5d40702111626p2cbbf495ma954bcda1e3b75e8@mail.gmail.com> Message-ID: <804ab5d40702111921q2767248dte540b04e38a71236@mail.gmail.com> Hi, I tried to compile PETSc again and using --download-hypre=1. My command given is ./config/configure.py --with-vendor-compilers=intel --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/ --wit h-x=0 --with-shared --with-mpi-dir=/opt/mpich/myrinet/intel/ --with-debugging=0 --download-hypre=1 I tried twice and the same error msg appears: Downloaded hypre could not be used. Please check install in /nas/lsftmp/g0306332/petsc-2.3.2-p8/externalpackages/hypre-1.11.1b/linux-hypre. I've attached the configure.log for your reference. Thank you. On 2/12/07, Barry Smith wrote: > > > > On Mon, 12 Feb 2007, Ben Tay wrote: > > > Hi, > > > > I have some questions regarding the use of hypre/boomeramg: > > > > 1. Is there anything I need to change in the assembly of matrix etc > besides > > adding -pc_type hypre -pc_hypre_type boomeramg ? > > No > > > > 2. Can it work in a sequential code? > > yes > > > > 3. I have 2 eqns to solve - momentum and poisson. if I used the options, > > will both equations be solved using hypre? > > yes > > > Can I select which solver to > > solve with which equation? > > yes. For each KSP call KSPSetOptionsPrefix() for example > KSPSetOptionsPrefix(kspmo,"momentum); > KSPSetOptionsPrefix(ksppo,"poisson"); > then from the command line use -momentum_ksp_type gmres -poisson_ksp_type > cg > -momentum_pc_type lusomething etc. For any solver option. > > Barry > > > > > Thank you. > > > > > > On 2/12/07, Barry Smith wrote: > > > > > > > > > hypre/boomeramg may be the way to go, especially for the Poisson > > > problem. -pc_type hypre -pc_hypre_type boomeramg (-help for lots of > > > tuning options.). > > > > > > Barry > > > > > > > > > On Sun, 11 Feb 2007, Ben Tay wrote: > > > > > > > Well, > > > > > > > > I am simulating unsteady flow past a moving airfoil at Re~10^4. I'm > > > using > > > > fractional step FVM, which means that I need to solve a momentum and > > > poisson > > > > equation. > > > > > > > > To reach a periodic state takes quite a few hours and so I'm trying > to > > > find > > > > ways to speed up the process. I thought parallelizing the code would > > > help > > > > but it seems like it's not the case. > > > > > > > > I'm now trying out different types of solver/preconditioner > available on > > > > PETSc to assess their performance. Is there other external solvers, > > > which > > > > PETSc interfaces, which are recommended? I'm thinking of using > multigrid > > > to > > > > solve the poisson eqn... wonder if hypre/BoomerAMG etc would help... > > > > > > > > > > > > On 2/11/07, Lisandro Dalcin wrote: > > > > > > > > > > On 2/10/07, Ben Tay wrote: > > > > > > In other words, for my CFD code, it is not possible to > parallelize > > > it > > > > > > effectively because the problem is too small? > > > > > > > > > > > > Is these true for all parallel solver, or just PETSc? I was > hoping > > > to > > > > > reduce > > > > > > the runtime since mine is an unsteady problem which requires > many > > > steps > > > > > to > > > > > > reach a periodic state and it takes many hours to reach it. > > > > > > > > > > Can you describe your specific application and how are you solving > it? > > > > > As Barry said, your need-for-speed is not likely to be solved by > > > > > running in parallel. > > > > > > > > > > > > > > > -- > > > > > Lisandro Dalc?n > > > > > --------------- > > > > > Centro Internacional de M?todos Computacionales en Ingenier?a > (CIMEC) > > > > > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica > (INTEC) > > > > > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas > (CONICET) > > > > > PTLC - G?emes 3450, (3000) Santa Fe, Argentina > > > > > Tel/Fax: +54-(0)342-451.1594 > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: configure.log Type: application/octet-stream Size: 2817836 bytes Desc: not available URL: From balay at mcs.anl.gov Sun Feb 11 21:41:29 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Sun, 11 Feb 2007 21:41:29 -0600 (CST) Subject: understanding the output from -info In-Reply-To: <804ab5d40702111921q2767248dte540b04e38a71236@mail.gmail.com> References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702100028sf595a2apae8aba2fda9251f3@mail.gmail.com> <804ab5d40702100117i5977f5bh9b161c026f16a32a@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> <804ab5d40702102141s258ef22due0093263f83dc7bb@mail.gmail.com> <804ab5d40702111626p2cbbf495ma954bcda1e3b75e8@mail.gmail.com> <804ab5d40702111921q2767248dte540b04e38a71236@mail.gmail.com> Message-ID: - If you have build isses [involing sending configure.log] please use petsc-maint at mcs.anl.gov address [not the mailing list] - Looks like you were using the following configure options: --with-cc=/scratch/g0306332/intel/cc/bin/icc --with-fc=/lsftmp/g0306332/inter/fc/bin/ifort --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 --with-mpi=0 --with-x=0 --with-shared But now - you are not specifing the compilers. The default compiler in your path must be Intel compilers version 7. Configure breaks with it. So sugest using the compilers that worked for you before. i.e --with-cc=/scratch/g0306332/intel/cc/bin/icc --with-fc=/lsftmp/g0306332/inter/fc/bin/ifort If you still have problem with hypre - remove externalpackages/hypre-1.11.1b and retry. Satish On Mon, 12 Feb 2007, Ben Tay wrote: > Hi, > > I tried to compile PETSc again and using --download-hypre=1. My command > given is > > ./config/configure.py --with-vendor-compilers=intel > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/ --wit > h-x=0 --with-shared --with-mpi-dir=/opt/mpich/myrinet/intel/ > --with-debugging=0 --download-hypre=1 > > I tried twice and the same error msg appears: > > Downloaded hypre could not be used. Please check install in > /nas/lsftmp/g0306332/petsc-2.3.2-p8/externalpackages/hypre-1.11.1b/linux-hypre. > I've attached the configure.log for your reference. > > Thank you. > From dimitri.lecas at c-s.fr Mon Feb 12 04:57:52 2007 From: dimitri.lecas at c-s.fr (LECAS Dimitri) Date: Mon, 12 Feb 2007 11:57:52 +0100 Subject: Partitioning on a mpiaij matrix Message-ID: <6590361b.361b6590@c-s.fr> ----- Original Message ----- From: Barry Smith Date: Friday, February 9, 2007 8:09 pm Subject: Re: Partitioning on a mpiaij matrix > > MatConvert() checks for a variety of converts; from the code > > /* 3) See if a good general converter is registered for the > desired class */ > conv = B->ops->convertfrom; > ierr = MatDestroy(B);CHKERRQ(ierr); > if (conv) goto foundconv; > > now MATMPIADJ has a MatConvertFrom that SHOULD be listed in the > function table > so it should not fall into the default MatConvert_Basic(). > > What version of PETSc are you using? Maybe an older one that does > not have > this converter? If you are using 2.3.2 or petsc-dev you can put a > breakpoint in MatConvert() and try to see why it is not picking up > the > convertfrom function? It is possible some bug that we are not aware of > but I have difficulty seeing what could be going wrong. > > Good luck, > > Barry > I add a line in matrix.c : /* 3) See if a good general converter is registered for the desired class */ fprintf(stderr, "Breakpoint : %p %p %p\n", B, B->ops, B->ops->convert); if (!conv) conv = B->ops->convert; ierr = MatDestroy(B);CHKERRQ(ierr); if (conv) goto foundconv; The output is : Breakpoint : 0x11c6670 0x11c6e30 (nil) [0]PETSC ERROR: --------------------- Error Message ---------------------------- -------- [0]PETSC ERROR: No support for this operation for this object type! [0]PETSC ERROR: Mat type mpiadj! [0]PETSC ERROR: ---------------------------------------------------------------- -------- [0]PETSC ERROR: Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST 20 07 HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 Where is the function that can convert a mpiaij into a mpiadj matrix ? -- Dimitri Lecas From bsmith at mcs.anl.gov Mon Feb 12 07:42:04 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Mon, 12 Feb 2007 07:42:04 -0600 (CST) Subject: Partitioning on a mpiaij matrix In-Reply-To: <6590361b.361b6590@c-s.fr> References: <6590361b.361b6590@c-s.fr> Message-ID: It is convertfrom, not convert you need to check. In src/mat/impls/adj/mpi/mpiadj.c MatCreate_MPIAdj there is the line ierr = PetscMemcpy(B->ops,&MatOps_Values,sizeof(struct _MatOps));CHKERRQ(ierr); in the MatOps_Values above it there is /*60*/ 0, MatDestroy_MPIAdj, MatView_MPIAdj, MatConvertFrom_MPIAdj, 0, Therefor the conversion function convertfrom MUST be in the matrix ops table when the convert is called. But it is not for you, how is this possible? Barry On Mon, 12 Feb 2007, LECAS Dimitri wrote: > > > ----- Original Message ----- > From: Barry Smith > Date: Friday, February 9, 2007 8:09 pm > Subject: Re: Partitioning on a mpiaij matrix > > > > > MatConvert() checks for a variety of converts; from the code > > > > /* 3) See if a good general converter is registered for the > > desired class */ > > conv = B->ops->convertfrom; > > ierr = MatDestroy(B);CHKERRQ(ierr); > > if (conv) goto foundconv; > > > > now MATMPIADJ has a MatConvertFrom that SHOULD be listed in the > > function table > > so it should not fall into the default MatConvert_Basic(). > > > > What version of PETSc are you using? Maybe an older one that does > > not have > > this converter? If you are using 2.3.2 or petsc-dev you can put a > > breakpoint in MatConvert() and try to see why it is not picking up > > the > > convertfrom function? It is possible some bug that we are not aware of > > but I have difficulty seeing what could be going wrong. > > > > Good luck, > > > > Barry > > > > I add a line in matrix.c : > /* 3) See if a good general converter is registered for the desired class */ > fprintf(stderr, "Breakpoint : %p %p %p\n", B, B->ops, B->ops->convert); > if (!conv) conv = B->ops->convert; > ierr = MatDestroy(B);CHKERRQ(ierr); > if (conv) goto foundconv; > > The output is : > Breakpoint : 0x11c6670 0x11c6e30 (nil) > [0]PETSC ERROR: --------------------- Error Message > ---------------------------- > -------- > [0]PETSC ERROR: No support for this operation for this object type! > [0]PETSC ERROR: Mat type mpiadj! > [0]PETSC ERROR: > ---------------------------------------------------------------- > -------- > [0]PETSC ERROR: Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 > 14:33:59 PST 20 > 07 HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 > > Where is the function that can convert a mpiaij into a mpiadj matrix ? > > From zonexo at gmail.com Mon Feb 12 09:19:32 2007 From: zonexo at gmail.com (Ben Tay) Date: Mon, 12 Feb 2007 23:19:32 +0800 Subject: External software help Message-ID: <804ab5d40702120719ueaad01cy655372f7dbcd5d73@mail.gmail.com> Hi, I'm trying to experiment with using external solvers. I have some questions: 1. Is there any difference in speed with calling the external software from PETSc or directly using them? 2. I tried to install MUMPS using --download-mumps. I was prompted to include --with-scalapack. After changing, I was again prompted to include --with-blacs. I changed and again I was told the need for --with-blacs-dir=. I thought I'm supposed to specify where to install blacs and entered a directory. But it seems that I need to specify the location of where blacs is. But I do not have it. So how do I solve that? 3. I installed some other external packages. I wanted to test their speed at solving equations. In the manual, I was told to use the runtime option -mat_type -ksp_type preonly -pc_type and also -help to get help msg. However when I tried to issue ./a.out -mat_type superlu -ksp_type preonly -pc_type lu, nothing happened. How should the command be issued? I tried to get help by running ./a.out -h what appears isn't what I want. Thank you very much. Regards. -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay at mcs.anl.gov Mon Feb 12 09:27:27 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 12 Feb 2007 09:27:27 -0600 (CST) Subject: External software help In-Reply-To: <804ab5d40702120719ueaad01cy655372f7dbcd5d73@mail.gmail.com> References: <804ab5d40702120719ueaad01cy655372f7dbcd5d73@mail.gmail.com> Message-ID: On Mon, 12 Feb 2007, Ben Tay wrote: > Hi, > > I'm trying to experiment with using external solvers. I have some questions: > > 1. Is there any difference in speed with calling the external software from > PETSc or directly using them? There is minor conversion overhead when you use them from PETSc. > > 2. I tried to install MUMPS using --download-mumps. I was prompted to > include --with-scalapack. After changing, I was again prompted to include > --with-blacs. I changed and again I was told the need for > --with-blacs-dir=. I thought I'm supposed to specify where to > install blacs and entered a directory. But it seems that I need to specify > the location of where blacs is. But I do not have it. So how do I solve > that? Mumps requires blacs & scalapack. So use: --download-blacs=1 --download-scalapack=1 --download-mumps=1 > > 3. I installed some other external packages. I wanted to test their speed at > solving equations. In the manual, I was told to use the runtime option > -mat_type -ksp_type preonly -pc_type and also -help to > get help msg. However when I tried to issue ./a.out -mat_type superlu > -ksp_type preonly -pc_type lu, nothing happened. How should the command be > issued? I tried to get help by running ./a.out -h what appears isn't what I > want. Did you install PETSc with superlu_dist? If so use '-mat_type superlu_dist' [Note: superlu & superlu_dist are different packages - the first one is sequential - the second one is parallel] Satish > > Thank you very much. Regards. > From zonexo at gmail.com Mon Feb 12 09:53:35 2007 From: zonexo at gmail.com (Ben Tay) Date: Mon, 12 Feb 2007 23:53:35 +0800 Subject: External software help In-Reply-To: References: <804ab5d40702120719ueaad01cy655372f7dbcd5d73@mail.gmail.com> Message-ID: <804ab5d40702120753j5f227637m70aae8e71a929e68@mail.gmail.com> Hi Satish, I've installed superlu. I issued the command ./a.out -mat_type superlu -ksp_type preonly -pc_type lu and it just hanged there. Is it because I had install it with mpich? I also wanted to try umfpack and plapack. Is it similar? Btw, plapack 's option isn't in pg 82 of the manual. Thank you. On 2/12/07, Satish Balay wrote: > > On Mon, 12 Feb 2007, Ben Tay wrote: > > > Hi, > > > > I'm trying to experiment with using external solvers. I have some > questions: > > > > 1. Is there any difference in speed with calling the external software > from > > PETSc or directly using them? > > There is minor conversion overhead when you use them from PETSc. > > > > > 2. I tried to install MUMPS using --download-mumps. I was prompted to > > include --with-scalapack. After changing, I was again prompted to > include > > --with-blacs. I changed and again I was told the need for > > --with-blacs-dir=. I thought I'm supposed to specify where to > > install blacs and entered a directory. But it seems that I need to > specify > > the location of where blacs is. But I do not have it. So how do I solve > > that? > > Mumps requires blacs & scalapack. So use: > --download-blacs=1 --download-scalapack=1 --download-mumps=1 > > > > > 3. I installed some other external packages. I wanted to test their > speed at > > solving equations. In the manual, I was told to use the runtime option > > -mat_type -ksp_type preonly -pc_type and also -help > to > > get help msg. However when I tried to issue ./a.out -mat_type superlu > > -ksp_type preonly -pc_type lu, nothing happened. How should the command > be > > issued? I tried to get help by running ./a.out -h what appears isn't > what I > > want. > > Did you install PETSc with superlu_dist? If so use '-mat_type > superlu_dist' > > [Note: superlu & superlu_dist are different packages - the first one > is sequential - the second one is parallel] > > Satish > > > > > Thank you very much. Regards. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zonexo at gmail.com Mon Feb 12 10:03:19 2007 From: zonexo at gmail.com (Ben Tay) Date: Tue, 13 Feb 2007 00:03:19 +0800 Subject: External software help In-Reply-To: <804ab5d40702120753j5f227637m70aae8e71a929e68@mail.gmail.com> References: <804ab5d40702120719ueaad01cy655372f7dbcd5d73@mail.gmail.com> <804ab5d40702120753j5f227637m70aae8e71a929e68@mail.gmail.com> Message-ID: <804ab5d40702120803n66ece10cs927a178a20ec366d@mail.gmail.com> Btw, how is Trilinos/ML used and installed? Is the command to download also --download*-*Trilinos/ML ? waht about the command to use it? Thank you. On 2/12/07, Ben Tay wrote: > > Hi Satish, > > I've installed superlu. I issued the command ./a.out -mat_type > superlu -ksp_type preonly -pc_type lu and it just hanged there. Is it > because I had install it with mpich? I also wanted to try umfpack and > plapack. Is it similar? > > Btw, plapack 's option isn't in pg 82 of the manual. > > > Thank you. > > > On 2/12/07, Satish Balay wrote: > > > > On Mon, 12 Feb 2007, Ben Tay wrote: > > > > > Hi, > > > > > > I'm trying to experiment with using external solvers. I have some > > questions: > > > > > > 1. Is there any difference in speed with calling the external software > > from > > > PETSc or directly using them? > > > > There is minor conversion overhead when you use them from PETSc. > > > > > > > > 2. I tried to install MUMPS using --download-mumps. I was prompted to > > > > > include --with-scalapack. After changing, I was again prompted to > > include > > > --with-blacs. I changed and again I was told the need for > > > --with-blacs-dir=. I thought I'm supposed to specify where > > to > > > install blacs and entered a directory. But it seems that I need to > > specify > > > the location of where blacs is. But I do not have it. So how do I > > solve > > > that? > > > > Mumps requires blacs & scalapack. So use: > > --download-blacs=1 --download-scalapack=1 --download-mumps=1 > > > > > > > > 3. I installed some other external packages. I wanted to test their > > speed at > > > solving equations. In the manual, I was told to use the runtime option > > > > > -mat_type -ksp_type preonly -pc_type and also -help > > to > > > get help msg. However when I tried to issue ./a.out -mat_type superlu > > > -ksp_type preonly -pc_type lu, nothing happened. How should the > > command be > > > issued? I tried to get help by running ./a.out -h what appears isn't > > what I > > > want. > > > > Did you install PETSc with superlu_dist? If so use '-mat_type > > superlu_dist' > > > > [Note: superlu & superlu_dist are different packages - the first one > > is sequential - the second one is parallel] > > > > Satish > > > > > > > > Thank you very much. Regards. > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay at mcs.anl.gov Mon Feb 12 10:08:59 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 12 Feb 2007 10:08:59 -0600 (CST) Subject: A 3D example of KSPSolve? In-Reply-To: <918742.70603.qm@web36202.mail.mud.yahoo.com> References: <918742.70603.qm@web36202.mail.mud.yahoo.com> Message-ID: Well some how the inbalance comes up in your application run - but not in the test example. It is possible that the application stresses your machine/memory-subsytem a lot more than the test code. Your machine has a NUMA [Non-unimform memory access] - so some messages are local [if the memory is local - and others can take atleast 3 hops trhough the AMD memory/hypertransport network. I was assuming the delays due to multiple hops might show up in this test runs I requested. [but it does not]. So perhaps these multiple hops cause delays only when the memort network gets stressed - as with your application? http://www.thg.ru/cpu/20040929/images/opteron_8way.gif I guess we'll just have to use your app to benchmark. Earlier I sugested using latest mpich with '--device=ch3:sshm'. Another option to try is '--with-device=ch3:nemesis' To do these experiments - you can build different versions of PETSc [so that you can switch between them all]. i.e use a different value for PETSC_ARCH for each build: It is possible that some of the load imbalance happens before the communication stages - but its visible only in the scatter state [in log_summary]. So to get a better idea on this - we'll need a Barrier in VecScatterBegin(). Not sure how to do this. Barry: does -log_sync add a barrier in vecscatter? Also - can you confirm that no-one-else/no-other-application is using this machine when you perform these measurement runs? Satish On Sat, 10 Feb 2007, Shi Jin wrote: > Furthermore, I did a multi-process test on the SMP. > petscmpirun -n 3 taskset -c 0,2,4 ./ex2 -ksp_type cg > -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 4.19617e-06 > Average time for zero size MPI_Send(): 3.65575e-06 > > petscmpirun -n 4 taskset -c 0,2,4,6 ./ex2 -ksp_type > cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.75953e-05 > Average time for zero size MPI_Send(): 2.44975e-05 > > petscmpirun -n 5 taskset -c 0,2,4,6,8 ./ex2 -ksp_type > cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 4.22001e-05 > Average time for zero size MPI_Send(): 2.54154e-05 > > petscmpirun -n 6 taskset -c 0,2,4,6,8,10 ./ex2 > -ksp_type cg -log_summary | egrep > \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 4.87804e-05 > Average time for zero size MPI_Send(): 1.83185e-05 > > petscmpirun -n 7 taskset -c 0,2,4,6,8,10,12 ./ex2 > -ksp_type cg -log_summary | egrep > \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 2.37942e-05 > Average time for zero size MPI_Send(): 5.00679e-06 > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ./ex2 > -ksp_type cg -log_summary | egrep > \(MPI_Send\|MPI_Barrier\) > Average time for MPI_Barrier(): 1.35899e-05 > Average time for zero size MPI_Send(): 6.73532e-06 > > They all seem quite fast. > Shi > > --- Shi Jin wrote: > > > Yes. The results follow. > > --- Satish Balay wrote: > > > > > Can you send the optupt from the following runs. > > You > > > can do this with > > > src/ksp/ksp/examples/tutorials/ex2.c - to keep > > > things simple. > > > > > > petscmpirun -n 2 taskset -c 0,2 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.81198e-06 > > Average time for zero size MPI_Send(): 5.00679e-06 > > > petscmpirun -n 2 taskset -c 0,4 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,6 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.7643e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,8 ./ex2 -log_summary > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.05312e-06 > > > petscmpirun -n 2 taskset -c 0,12 ./ex2 > > -log_summary > > > | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.57356e-06 > > Average time for zero size MPI_Send(): 5.48363e-06 > > > petscmpirun -n 2 taskset -c 0,14 ./ex2 > > -log_summary > > > | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.00272e-06 > > Average time for zero size MPI_Send(): 4.52995e-06 > > I also did > > petscmpirun -n 2 taskset -c 0,10 ./ex2 -log_summary > > | > > egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 5.00679e-06 > > Average time for zero size MPI_Send(): 3.93391e-06 > > > > > > The results are not so different from each other. > > Also > > please note, the timing is not exact, some times I > > got > > O(1e-5) timings for all cases. > > I assume these numbers are pretty good, right? Does > > it > > indicate that the MPI communication on a SMP machine > > is very fast? > > I will do a similar test on a cluster and report it > > back to the list. > > > > Shi > > > > > > > > > > > > > ____________________________________________________________________________________ > > Need Mail bonding? > > Go to the Yahoo! Mail Q&A for great tips from Yahoo! > > Answers users. > > > http://answers.yahoo.com/dir/?link=list&sid=396546091 > > > > > > > > > ____________________________________________________________________________________ > Yahoo! Music Unlimited > Access over 1 million songs. > http://music.yahoo.com/unlimited > > From balay at mcs.anl.gov Mon Feb 12 10:14:52 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 12 Feb 2007 10:14:52 -0600 (CST) Subject: External software help In-Reply-To: <804ab5d40702120753j5f227637m70aae8e71a929e68@mail.gmail.com> References: <804ab5d40702120719ueaad01cy655372f7dbcd5d73@mail.gmail.com> <804ab5d40702120753j5f227637m70aae8e71a929e68@mail.gmail.com> Message-ID: On Mon, 12 Feb 2007, Ben Tay wrote: > Hi Satish, > > I've installed superlu. I issued the command ./a.out -mat_type > superlu -ksp_type preonly -pc_type lu and it just hanged there. Did you install superlu separately? Sugest installing with PETSc configure option '--download-superlu=1. > Is it because I had install it with mpich? No - its because superlu includes some blas code - that will hang if compiled 'with -O' - esp with intel compilers. PETSc configure handles this correctly. > I also wanted to try umfpack and plapack. Is it similar? > Btw, plapack 's option isn't in pg 82 of the manual. I believe plapack is for parallel dense usage - so perhaps its not appropriate for your usage.. Satish From hzhang at mcs.anl.gov Mon Feb 12 10:16:28 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Mon, 12 Feb 2007 10:16:28 -0600 (CST) Subject: External software help In-Reply-To: <804ab5d40702120753j5f227637m70aae8e71a929e68@mail.gmail.com> References: <804ab5d40702120719ueaad01cy655372f7dbcd5d73@mail.gmail.com> <804ab5d40702120753j5f227637m70aae8e71a929e68@mail.gmail.com> Message-ID: You may test the installation of superlu using petsc example src/ksp/ksp/examples/tutorials/ex5.c: e.g., ./ex5 -ksp_type preonly -pc_type lu -mat_type superlu -ksp_view | more KSP Object: type: preonly maximum iterations=10000, initial guess is zero tolerances: relative=1e-05, absolute=1e-50, divergence=10000 left preconditioning PC Object: type: lu LU: out-of-place factorization matrix ordering: nd LU: tolerance for zero pivot 1e-12 LU: factor fill ratio needed 0 Factored matrix follows Matrix Object: type=superlu, rows=6, cols=6 total: nonzeros=0, allocated nonzeros=6 not using I-node routines SuperLU run parameters: Equil: NO ColPerm: 3 IterRefine: 0 SymmetricMode: NO DiagPivotThresh: 1 PivotGrowth: NO ConditionNumber: NO RowPerm: 0 ReplaceTinyPivot: NO PrintStat: NO lwork: 0 linear system matrix = precond matrix: Matrix Object: type=superlu, rows=6, cols=6 total: nonzeros=20, allocated nonzeros=30 not using I-node routines Norm of error < 1.e-12, Iterations 1 KSP Object: ... > I've installed superlu. I issued the command ./a.out -mat_type > superlu -ksp_type preonly -pc_type lu and it just hanged there. Is it > because I had install it with mpich? I also wanted to try umfpack and > plapack. Is it similar? > > Btw, plapack 's option isn't in pg 82 of the manual. I'll add it. Thanks, Hong > > > Thank you. > > > On 2/12/07, Satish Balay wrote: > > > > On Mon, 12 Feb 2007, Ben Tay wrote: > > > > > Hi, > > > > > > I'm trying to experiment with using external solvers. I have some > > questions: > > > > > > 1. Is there any difference in speed with calling the external software > > from > > > PETSc or directly using them? > > > > There is minor conversion overhead when you use them from PETSc. > > > > > > > > 2. I tried to install MUMPS using --download-mumps. I was prompted to > > > include --with-scalapack. After changing, I was again prompted to > > include > > > --with-blacs. I changed and again I was told the need for > > > --with-blacs-dir=. I thought I'm supposed to specify where to > > > install blacs and entered a directory. But it seems that I need to > > specify > > > the location of where blacs is. But I do not have it. So how do I solve > > > that? > > > > Mumps requires blacs & scalapack. So use: > > --download-blacs=1 --download-scalapack=1 --download-mumps=1 > > > > > > > > 3. I installed some other external packages. I wanted to test their > > speed at > > > solving equations. In the manual, I was told to use the runtime option > > > -mat_type -ksp_type preonly -pc_type and also -help > > to > > > get help msg. However when I tried to issue ./a.out -mat_type superlu > > > -ksp_type preonly -pc_type lu, nothing happened. How should the command > > be > > > issued? I tried to get help by running ./a.out -h what appears isn't > > what I > > > want. > > > > Did you install PETSc with superlu_dist? If so use '-mat_type > > superlu_dist' > > > > [Note: superlu & superlu_dist are different packages - the first one > > is sequential - the second one is parallel] > > > > Satish > > > > > > > > Thank you very much. Regards. > > > > > > > > From balay at mcs.anl.gov Mon Feb 12 10:22:16 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Mon, 12 Feb 2007 10:22:16 -0600 (CST) Subject: External software help In-Reply-To: <804ab5d40702120803n66ece10cs927a178a20ec366d@mail.gmail.com> References: <804ab5d40702120719ueaad01cy655372f7dbcd5d73@mail.gmail.com> <804ab5d40702120753j5f227637m70aae8e71a929e68@mail.gmail.com> <804ab5d40702120803n66ece10cs927a178a20ec366d@mail.gmail.com> Message-ID: On Tue, 13 Feb 2007, Ben Tay wrote: > Btw, how is Trilinos/ML used > and installed? Is the command to download also > --download*-*Trilinos/ML ? > waht about the command to use it? To install ML - use: --dowload-ml=1 Usage is: '-pc_type ml' Satish From dimitri.lecas at c-s.fr Mon Feb 12 10:44:03 2007 From: dimitri.lecas at c-s.fr (LECAS Dimitri) Date: Mon, 12 Feb 2007 17:44:03 +0100 Subject: Partitioning on a mpiaij matrix Message-ID: ----- Original Message ----- From: Barry Smith Date: Monday, February 12, 2007 2:42 pm Subject: Re: Partitioning on a mpiaij matrix > > It is convertfrom, not convert you need to check. > > In src/mat/impls/adj/mpi/mpiadj.c MatCreate_MPIAdj > there is the line > ierr = PetscMemcpy(B- > >ops,&MatOps_Values,sizeof(struct _MatOps));CHKERRQ(ierr); > in the MatOps_Values above it there is > /*60*/ 0, > MatDestroy_MPIAdj, > MatView_MPIAdj, > MatConvertFrom_MPIAdj, > 0, > > Therefor the conversion function convertfrom MUST be in the matrix > ops table > when the convert is called. But it is not for you, how is this > possible? > > Barry > I made some progress but it's not very comprehensible It's seems there is a problem with parmetis. When i compile petsc without parmetis, this line don't give an error. CALL Matconvert(mat,MATMPIADJ,MAT_INITIAL_MATRIX,mat2,ierr) CHKERRQ(ierr) With mat created with CALL MatCreateMPIAIJ (PETSC_COMM_WORLD, partTab(rk+1), partTab(rk+1), N, N, 30 , PETSC_NULL_INTEGER, 30, PETSC_NULL_INTEGER, mat, ierr) But, when petsc is compiled with parmetis, the call to MatConvert give the error [0]PETSC ERROR: No support for this operation for this object type! [0]PETSC ERROR: Mat type mpiadj! -- Dimitri Lecas From bhatiamanav at gmail.com Mon Feb 12 17:26:33 2007 From: bhatiamanav at gmail.com (Manav Bhatia) Date: Mon, 12 Feb 2007 15:26:33 -0800 Subject: nonlinear solvers Message-ID: Hi, I am using the nonlinear solvers in Petsc. My application requires the jacobian at the final nonlinear solution, since after the nonlinear solution I solve a linear system of equations with the jacobian as the system matrix. I am curious to know if it is safe to assume that for all nonlinear solvers in Petsc, the last jacobian used before convergence is same as the jacobian evaluated at the final solution. If this is the case, then I will not need to evaluate the jacobian again, otherwise, I will need to compute it again after the final solution. Kindly help me with your comments. Thanks, Manav From knepley at gmail.com Mon Feb 12 21:23:49 2007 From: knepley at gmail.com (Matthew Knepley) Date: Mon, 12 Feb 2007 21:23:49 -0600 Subject: nonlinear solvers In-Reply-To: References: Message-ID: On 2/12/07, Manav Bhatia wrote: > Hi, > > I am using the nonlinear solvers in Petsc. My application requires > the jacobian at the final nonlinear solution, since after the > nonlinear solution I solve a linear system of equations with the > jacobian as the system matrix. > I am curious to know if it is safe to assume that for all > nonlinear solvers in Petsc, the last jacobian used before convergence > is same as the jacobian evaluated at the final solution. If this is > the case, then I will not need to evaluate the jacobian again, > otherwise, I will need to compute it again after the final solution. Never. We solve the Newton Equation, update the solution, and THEN check for convergence. The Jacobian would not be updated. Matt > Kindly help me with your comments. > > Thanks, > Manav > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie From jinzishuai at yahoo.com Mon Feb 12 22:22:14 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Mon, 12 Feb 2007 20:22:14 -0800 (PST) Subject: A 3D example of KSPSolve? In-Reply-To: Message-ID: <20070213042214.67470.qmail@web36215.mail.mud.yahoo.com> Thank you Satish. I cannot say that no one is using that machine when I ran. But I made sure that when I use taskset, the processors are exclusively mine. The total number of running jobs is always smaller than the available runs. I think I will stop benchmarking the SMP machine for the time being and focus my concentration on the code to run on a distributed memory cluster. I think it is very likely I can make some improvement to the existing code by tuning the linear solver and preconditioner. I am starting another thread on how to use the incomplete cholesky docomposition (ICC) as a preconditioner for my congugate gradient method. When I am satisfied with the code and its performance on a cluster, I will revisit the SMP issue so that we might achieve better performance when the number of processes is not too large (<-8). Thank you very much. Shi T Shi --- Satish Balay wrote: > Well some how the inbalance comes up in your > application run - but not > in the test example. It is possible that the > application stresses your > machine/memory-subsytem a lot more than the test > code. > > Your machine has a NUMA [Non-unimform memory access] > - so some > messages are local [if the memory is local - and > others can take > atleast 3 hops trhough the AMD memory/hypertransport > network. I was > assuming the delays due to multiple hops might show > up in this test > runs I requested. [but it does not]. > > So perhaps these multiple hops cause delays only > when the memort > network gets stressed - as with your application? > > http://www.thg.ru/cpu/20040929/images/opteron_8way.gif > > I guess we'll just have to use your app to > benchmark. Earlier I > sugested using latest mpich with > '--device=ch3:sshm'. Another option > to try is '--with-device=ch3:nemesis' > > To do these experiments - you can build different > versions of PETSc > [so that you can switch between them all]. i.e use a > different value > for PETSC_ARCH for each build: > > It is possible that some of the load imbalance > happens before the > communication stages - but its visible only in the > scatter state [in > log_summary]. So to get a better idea on this - > we'll need a Barrier > in VecScatterBegin(). Not sure how to do this. > > Barry: does -log_sync add a barrier in vecscatter? > > Also - can you confirm that > no-one-else/no-other-application is using > this machine when you perform these measurement > runs? > > Satish > > On Sat, 10 Feb 2007, Shi Jin wrote: > > > Furthermore, I did a multi-process test on the > SMP. > > petscmpirun -n 3 taskset -c 0,2,4 ./ex2 -ksp_type > cg > > -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 4.19617e-06 > > Average time for zero size MPI_Send(): 3.65575e-06 > > > > petscmpirun -n 4 taskset -c 0,2,4,6 ./ex2 > -ksp_type > > cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.75953e-05 > > Average time for zero size MPI_Send(): 2.44975e-05 > > > > petscmpirun -n 5 taskset -c 0,2,4,6,8 ./ex2 > -ksp_type > > cg -log_summary | egrep \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 4.22001e-05 > > Average time for zero size MPI_Send(): 2.54154e-05 > > > > petscmpirun -n 6 taskset -c 0,2,4,6,8,10 ./ex2 > > -ksp_type cg -log_summary | egrep > > \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 4.87804e-05 > > Average time for zero size MPI_Send(): 1.83185e-05 > > > > petscmpirun -n 7 taskset -c 0,2,4,6,8,10,12 ./ex2 > > -ksp_type cg -log_summary | egrep > > \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 2.37942e-05 > > Average time for zero size MPI_Send(): 5.00679e-06 > > > > petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 > ./ex2 > > -ksp_type cg -log_summary | egrep > > \(MPI_Send\|MPI_Barrier\) > > Average time for MPI_Barrier(): 1.35899e-05 > > Average time for zero size MPI_Send(): 6.73532e-06 > > > > They all seem quite fast. > > Shi > > > > --- Shi Jin wrote: > > > > > Yes. The results follow. > > > --- Satish Balay wrote: > > > > > > > Can you send the optupt from the following > runs. > > > You > > > > can do this with > > > > src/ksp/ksp/examples/tutorials/ex2.c - to keep > > > > things simple. > > > > > > > > petscmpirun -n 2 taskset -c 0,2 ./ex2 > -log_summary > > > | > > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 1.81198e-06 > > > Average time for zero size MPI_Send(): > 5.00679e-06 > > > > petscmpirun -n 2 taskset -c 0,4 ./ex2 > -log_summary > > > | > > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 2.00272e-06 > > > Average time for zero size MPI_Send(): > 4.05312e-06 > > > > petscmpirun -n 2 taskset -c 0,6 ./ex2 > -log_summary > > > | > > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 1.7643e-06 > > > Average time for zero size MPI_Send(): > 4.05312e-06 > > > > petscmpirun -n 2 taskset -c 0,8 ./ex2 > -log_summary > > > | > > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 2.00272e-06 > > > Average time for zero size MPI_Send(): > 4.05312e-06 > > > > petscmpirun -n 2 taskset -c 0,12 ./ex2 > > > -log_summary > > > > | egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 1.57356e-06 > > > Average time for zero size MPI_Send(): > 5.48363e-06 > > > > petscmpirun -n 2 taskset -c 0,14 ./ex2 > > > -log_summary > > > > | egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 2.00272e-06 > > > Average time for zero size MPI_Send(): > 4.52995e-06 > > > I also did > > > petscmpirun -n 2 taskset -c 0,10 ./ex2 > -log_summary > > > | > > > egrep \(MPI_Send\|MPI_Barrier\) > > > Average time for MPI_Barrier(): 5.00679e-06 > > > Average time for zero size MPI_Send(): > 3.93391e-06 > > > > > > > > > The results are not so different from each > other. > > > Also > > > please note, the timing is not exact, some times > I > > > got > > > O(1e-5) timings for all cases. > > > I assume these numbers are pretty good, right? > Does > > > it > > > indicate that the MPI communication on a SMP > machine > > > is very fast? > > > I will do a similar test on a cluster and report > it > > > back to the list. > > > > > > Shi > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Need Mail bonding? > > > Go to the Yahoo! Mail Q&A for great tips from > Yahoo! > > > Answers users. > > > > > > http://answers.yahoo.com/dir/?link=list&sid=396546091 > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Yahoo! Music Unlimited > > Access over 1 million songs. > === message truncated === ____________________________________________________________________________________ Have a burning question? Go to www.Answers.yahoo.com and get answers from real people who know. From jinzishuai at yahoo.com Mon Feb 12 22:43:52 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Mon, 12 Feb 2007 20:43:52 -0800 (PST) Subject: Using ICC for MPISBAIJ? Message-ID: <205544.61365.qm@web36213.mail.mud.yahoo.com> Hi All, Thank you very much for the help you gave me in tuning my code. I now think it is important for us to take advantage of the symmetric positive definiteness property of our Matrix, i.e., we should use the conjugate gradient (CG) method with incomplete Cholesky decomposition (ICC) as the pre-conditioner (I assume this is commonly accepted at least for serial computation, right?). However, I am surprised and disappointed to realize that the -pc_type icc option only exists for seqsbaij Matrices. In order to parallelize the linear solver, I have to use the external package BlockSolve95. I took a look at this package at http://www-unix.mcs.anl.gov/sumaa3d/BlockSolve/ I am very disappointed to see it hasn't been in development ever since 1997. I am worried it does not provide a state-of-art performance. Nevertheless, I gave it a try. The package is not as easy to build as common linux software (even much worse than Petsc), especially according their REAME, it is unknown to work with linux. However, by hand-editing the bmake/linux/linux.site file, I seemed to be able to build the library. However, the examples doesn't build and the PETSC built with BlockSolve95 gives me errors in linking like: undefined referece to "dgemv_" and "dgetrf_". In another place of the PETSC mannul, I found there is another external package "Spooles" that can also be used with mpisbaij and Cholesky PC. But it is also dated in 1999. Could anyone give me some advice what is the best way to go to solve a large sparse symmetric positive definite linux system efficiently using MPI on a cluster? Thank you very much. Shi ____________________________________________________________________________________ Don't get soaked. Take a quick peak at the forecast with the Yahoo! Search weather shortcut. http://tools.search.yahoo.com/shortcuts/#loc_weather From hzhang at mcs.anl.gov Mon Feb 12 23:17:33 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Mon, 12 Feb 2007 23:17:33 -0600 (CST) Subject: Using ICC for MPISBAIJ? In-Reply-To: <205544.61365.qm@web36213.mail.mud.yahoo.com> References: <205544.61365.qm@web36213.mail.mud.yahoo.com> Message-ID: > Thank you very much for the help you gave me in tuning > my code. I now think it is important for us to take > advantage of the symmetric positive definiteness > property of our Matrix, i.e., we should use the > conjugate gradient (CG) method with incomplete > Cholesky decomposition (ICC) as the pre-conditioner (I > assume this is commonly accepted at least for serial > computation, right?). Yes. > However, I am surprised and disappointed to realize > that the -pc_type icc option only exists for seqsbaij > Matrices. In order to parallelize the linear solver, I icc also works for seqaij type, which enables more efficient data accessing than seqsbaij. > have to use the external package BlockSolve95. > I took a look at this package at > http://www-unix.mcs.anl.gov/sumaa3d/BlockSolve/ > I am very disappointed to see it hasn't been in > development ever since 1997. I am worried it does not > provide a state-of-art performance. > > Nevertheless, I gave it a try. The package is not as > easy to build as common linux software (even much > worse than Petsc), especially according their REAME, > it is unknown to work with linux. However, by > hand-editing the bmake/linux/linux.site file, I seemed > to be able to build the library. However, the examples > doesn't build and the PETSC built with BlockSolve95 > gives me errors in linking like: > undefined referece to "dgemv_" and "dgetrf_". This seems relates to linking lapack. Satish might knows about it. > > In another place of the PETSC mannul, I found there is > another external package "Spooles" that can also be > used with mpisbaij and Cholesky PC. But it is also > dated in 1999. Spooles is sparse direct solver. Although it has been out of support since 99, we find it is still in good quality, especially it has good robustness and portability. Petsc also interfaces with other well-maintained sparse direct solvers, e.g., mumps and superlu_dist. When matrices are in the order of 100k or less and ill-conditioned, the direct solvers are good choices. > > Could anyone give me some advice what is the best way > to go to solve a large sparse symmetric positive > definite linux system efficiently using MPI on a > cluster? The performance is application dependant. Petsc allows you testing various algorithms at runtime. Use '-help' to see all possible options. Run your application with '-log_summary' to collect and compare performance data. Good luck, Hong > > > > ____________________________________________________________________________________ > Don't get soaked. Take a quick peak at the forecast > with the Yahoo! Search weather shortcut. > http://tools.search.yahoo.com/shortcuts/#loc_weather > > From hzhang at mcs.anl.gov Mon Feb 12 23:22:48 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Mon, 12 Feb 2007 23:22:48 -0600 (CST) Subject: Using ICC for MPISBAIJ? In-Reply-To: <205544.61365.qm@web36213.mail.mud.yahoo.com> References: <205544.61365.qm@web36213.mail.mud.yahoo.com> Message-ID: I forget to tell you that you can use parallel CG with block-jacobi, and sequential icc within the diagonal blocks. Example, run src/ksp/ksp/examples/tutorials/ex5 with mpirun -np 2 ./ex5 -ksp_type cg -pc_type bjacobi -sub_pc_type icc -ksp_view Use '-help' to get many options on icc. Hong On Mon, 12 Feb 2007, Shi Jin wrote: > Hi All, > > Thank you very much for the help you gave me in tuning > my code. I now think it is important for us to take > advantage of the symmetric positive definiteness > property of our Matrix, i.e., we should use the > conjugate gradient (CG) method with incomplete > Cholesky decomposition (ICC) as the pre-conditioner (I > assume this is commonly accepted at least for serial > computation, right?). > However, I am surprised and disappointed to realize > that the -pc_type icc option only exists for seqsbaij > Matrices. In order to parallelize the linear solver, I > have to use the external package BlockSolve95. > I took a look at this package at > http://www-unix.mcs.anl.gov/sumaa3d/BlockSolve/ > I am very disappointed to see it hasn't been in > development ever since 1997. I am worried it does not > provide a state-of-art performance. > > Nevertheless, I gave it a try. The package is not as > easy to build as common linux software (even much > worse than Petsc), especially according their REAME, > it is unknown to work with linux. However, by > hand-editing the bmake/linux/linux.site file, I seemed > to be able to build the library. However, the examples > doesn't build and the PETSC built with BlockSolve95 > gives me errors in linking like: > undefined referece to "dgemv_" and "dgetrf_". > > In another place of the PETSC mannul, I found there is > another external package "Spooles" that can also be > used with mpisbaij and Cholesky PC. But it is also > dated in 1999. > > Could anyone give me some advice what is the best way > to go to solve a large sparse symmetric positive > definite linux system efficiently using MPI on a > cluster? > > Thank you very much. > Shi > > > > > ____________________________________________________________________________________ > Don't get soaked. Take a quick peak at the forecast > with the Yahoo! Search weather shortcut. > http://tools.search.yahoo.com/shortcuts/#loc_weather > > From zonexo at gmail.com Tue Feb 13 00:28:41 2007 From: zonexo at gmail.com (Ben Tay) Date: Tue, 13 Feb 2007 14:28:41 +0800 Subject: understanding the output from -info In-Reply-To: References: <804ab5d40702080747g69ee8b44h54cce509177ee0a8@mail.gmail.com> <804ab5d40702101702s71c974d7u39a97d6ab8058cf4@mail.gmail.com> <804ab5d40702102141s258ef22due0093263f83dc7bb@mail.gmail.com> <804ab5d40702111626p2cbbf495ma954bcda1e3b75e8@mail.gmail.com> <804ab5d40702111921q2767248dte540b04e38a71236@mail.gmail.com> Message-ID: <804ab5d40702122228k66de0664t62fe2e12db3be1a9@mail.gmail.com> Ya thanks for the suggestion. strangely it worked. However, if I had not included hypre, the original command also worked. Tks anyway. On 2/12/07, Satish Balay wrote: > > - If you have build isses [involing sending configure.log] please use > petsc-maint at mcs.anl.gov address [not the mailing list] > > - Looks like you were using the following configure options: > > --with-cc=/scratch/g0306332/intel/cc/bin/icc > --with-fc=/lsftmp/g0306332/inter/fc/bin/ifort > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > --with-mpi=0 --with-x=0 --with-shared > > But now - you are not specifing the compilers. The default compiler in > your path must be Intel compilers version 7. Configure breaks with it. > So sugest using the compilers that worked for you before. i.e > > --with-cc=/scratch/g0306332/intel/cc/bin/icc > --with-fc=/lsftmp/g0306332/inter/fc/bin/ifort > > If you still have problem with hypre - remove > externalpackages/hypre-1.11.1b and retry. > > Satish > > On Mon, 12 Feb 2007, Ben Tay wrote: > > > Hi, > > > > I tried to compile PETSc again and using --download-hypre=1. My command > > given is > > > > ./config/configure.py --with-vendor-compilers=intel > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/ --wit > > h-x=0 --with-shared --with-mpi-dir=/opt/mpich/myrinet/intel/ > > --with-debugging=0 --download-hypre=1 > > > > I tried twice and the same error msg appears: > > > > Downloaded hypre could not be used. Please check install in > > /nas/lsftmp/g0306332/petsc-2.3.2-p8/externalpackages/hypre-1.11.1b > /linux-hypre. > > I've attached the configure.log for your reference. > > > > Thank you. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Tue Feb 13 09:07:20 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 13 Feb 2007 09:07:20 -0600 (CST) Subject: Using ICC for MPISBAIJ? In-Reply-To: <205544.61365.qm@web36213.mail.mud.yahoo.com> References: <205544.61365.qm@web36213.mail.mud.yahoo.com> Message-ID: For a moderate number of processes -pc_type bjacobi -sub_pc_type icc -sub_ksp_type preonly or -pc_type asm -sub_pc_type icc -sub_ksp_type preonly Barry On Mon, 12 Feb 2007, Shi Jin wrote: > Hi All, > > Thank you very much for the help you gave me in tuning > my code. I now think it is important for us to take > advantage of the symmetric positive definiteness > property of our Matrix, i.e., we should use the > conjugate gradient (CG) method with incomplete > Cholesky decomposition (ICC) as the pre-conditioner (I > assume this is commonly accepted at least for serial > computation, right?). > However, I am surprised and disappointed to realize > that the -pc_type icc option only exists for seqsbaij > Matrices. In order to parallelize the linear solver, I > have to use the external package BlockSolve95. > I took a look at this package at > http://www-unix.mcs.anl.gov/sumaa3d/BlockSolve/ > I am very disappointed to see it hasn't been in > development ever since 1997. I am worried it does not > provide a state-of-art performance. > > Nevertheless, I gave it a try. The package is not as > easy to build as common linux software (even much > worse than Petsc), especially according their REAME, > it is unknown to work with linux. However, by > hand-editing the bmake/linux/linux.site file, I seemed > to be able to build the library. However, the examples > doesn't build and the PETSC built with BlockSolve95 > gives me errors in linking like: > undefined referece to "dgemv_" and "dgetrf_". > > In another place of the PETSC mannul, I found there is > another external package "Spooles" that can also be > used with mpisbaij and Cholesky PC. But it is also > dated in 1999. > > Could anyone give me some advice what is the best way > to go to solve a large sparse symmetric positive > definite linux system efficiently using MPI on a > cluster? > > Thank you very much. > Shi > > > > > ____________________________________________________________________________________ > Don't get soaked. Take a quick peak at the forecast > with the Yahoo! Search weather shortcut. > http://tools.search.yahoo.com/shortcuts/#loc_weather > > From dimitri.lecas at c-s.fr Wed Feb 14 14:07:39 2007 From: dimitri.lecas at c-s.fr (LECAS Dimitri) Date: Wed, 14 Feb 2007 21:07:39 +0100 Subject: Differents number of iterations with the same problem Message-ID: <1fd0920e75.20e751fd09@c-s.fr> Hello I'm surprised to not have the same numbers of iterations when i run several instance of my program with the same number of processors and the same matrix/right hand side. Is there some random or it's asynchronous message passing fault ? PS: My code used KSPSolve with bicg and jacobi for pc) From knepley at gmail.com Wed Feb 14 14:14:51 2007 From: knepley at gmail.com (Matthew Knepley) Date: Wed, 14 Feb 2007 14:14:51 -0600 Subject: Differents number of iterations with the same problem In-Reply-To: <1fd0920e75.20e751fd09@c-s.fr> References: <1fd0920e75.20e751fd09@c-s.fr> Message-ID: No there is no randomness. I suspect the matrix/rhs is not the same. Matt On 2/14/07, LECAS Dimitri wrote: > > Hello > > I'm surprised to not have the same numbers of iterations when i run > several instance of my program with the same number of processors and > the same matrix/right hand side. Is there some random or it's > asynchronous message passing fault ? > > PS: My code used KSPSolve with bicg and jacobi for pc) > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Wed Feb 14 16:47:03 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 14 Feb 2007 16:47:03 -0600 (CST) Subject: Differents number of iterations with the same problem In-Reply-To: <1fd0920e75.20e751fd09@c-s.fr> References: <1fd0920e75.20e751fd09@c-s.fr> Message-ID: With jacobi it probably requires a lot of iterations? Then it would not supprise me to have a slight difference in iteration count. If it is taking 1000's of iterations then I would expect large differences in iteration count. Barry On Wed, 14 Feb 2007, LECAS Dimitri wrote: > Hello > > I'm surprised to not have the same numbers of iterations when i run > several instance of my program with the same number of processors and > the same matrix/right hand side. Is there some random or it's > asynchronous message passing fault ? > > PS: My code used KSPSolve with bicg and jacobi for pc) > > From manav at u.washington.edu Thu Feb 15 02:56:53 2007 From: manav at u.washington.edu (Manav Bhatia) Date: Thu, 15 Feb 2007 00:56:53 -0800 Subject: matrix addition Message-ID: Hi, I need to add two matrices, but I did not find a direct function to do so. Is there a specific reason not having a matrix addition function? In the absence of such a function, I am thinking of extracting each row of the two matrices and adding them. Would there be a more efficient way to do the same? Kindly help me with your advice. Thanks, Manav From DOMI0002 at ntu.edu.sg Thu Feb 15 03:24:08 2007 From: DOMI0002 at ntu.edu.sg (#DOMINIC DENVER JOHN CHANDAR#) Date: Thu, 15 Feb 2007 17:24:08 +0800 Subject: matrix addition In-Reply-To: Message-ID: How about MatAXPY() ? http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/ manualpages/Mat/MatAXPY.html Computes Y = aX + Y , where X, Y are matrices. Set a=1. -Dominic -----Original Message----- From: owner-petsc-users at mcs.anl.gov [mailto:owner-petsc-users at mcs.anl.gov] On Behalf Of Manav Bhatia Sent: Thursday, February 15, 2007 4:57 PM To: petsc-users at mcs.anl.gov Subject: matrix addition Hi, I need to add two matrices, but I did not find a direct function to do so. Is there a specific reason not having a matrix addition function? In the absence of such a function, I am thinking of extracting each row of the two matrices and adding them. Would there be a more efficient way to do the same? Kindly help me with your advice. Thanks, Manav From manav at u.washington.edu Thu Feb 15 03:29:00 2007 From: manav at u.washington.edu (Manav Bhatia) Date: Thu, 15 Feb 2007 01:29:00 -0800 Subject: matrix addition In-Reply-To: References: Message-ID: ooppss.... totally missed it... Thanks, :-) Manav On Feb 15, 2007, at 1:24 AM, #DOMINIC DENVER JOHN CHANDAR# wrote: > How about MatAXPY() ? > > > http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/ > docs/ > manualpages/Mat/MatAXPY.html > > Computes Y = aX + Y , where X, Y are matrices. Set a=1. > > > -Dominic > > > > -----Original Message----- > From: owner-petsc-users at mcs.anl.gov > [mailto:owner-petsc-users at mcs.anl.gov] On Behalf Of Manav Bhatia > Sent: Thursday, February 15, 2007 4:57 PM > To: petsc-users at mcs.anl.gov > Subject: matrix addition > > Hi, > I need to add two matrices, but I did not find a direct function to > do so. Is there a specific reason not having a matrix addition > function? > In the absence of such a function, I am thinking of extracting each > row of the two matrices and adding them. Would there be a more > efficient > way to do the same? > > Kindly help me with your advice. > > Thanks, > Manav > From jianings at gmail.com Thu Feb 15 13:18:27 2007 From: jianings at gmail.com (Jianing Shi) Date: Thu, 15 Feb 2007 11:18:27 -0800 Subject: code design Message-ID: <63516a2e0702151118r7aeed903y40ebeab1be9a9170@mail.gmail.com> Hi Petsc masters, I have a question which is more about code design using Petsc. Suppose I need to implement a C++ library to provide an interface for users to set up the ODE and/or PDE systems, which I will solve on parallel computers using Petsc. Since Petsc has defined its own data type (in fact, a lot), PetscInt, PetscScalar, etc. I would like to link my own C++ library to the Petsc library. I imagine there are two solutions: 1) write an interface between my library and Petsc, i.e., between my own data structure (object-oriented) with the DA structure of Petsc. This requires translation between all the data type, for instance, int and PetscInt.... 2) use templated programming in my own library, so that when I link to the Petsc library, I can easily reuse my own code to set up the Right hand side, Jacobian and so on. Just wondering what is a good solution for an efficient and neat design? Thanks, Jianing From knepley at gmail.com Thu Feb 15 13:23:37 2007 From: knepley at gmail.com (Matthew Knepley) Date: Thu, 15 Feb 2007 13:23:37 -0600 Subject: code design In-Reply-To: <63516a2e0702151118r7aeed903y40ebeab1be9a9170@mail.gmail.com> References: <63516a2e0702151118r7aeed903y40ebeab1be9a9170@mail.gmail.com> Message-ID: On 2/15/07, Jianing Shi wrote: > Hi Petsc masters, > > I have a question which is more about code design using Petsc. > > Suppose I need to implement a C++ library to provide an interface for > users to set up the ODE and/or PDE systems, which I will solve on > parallel computers using Petsc. Since Petsc has defined its own data > type (in fact, a lot), PetscInt, PetscScalar, etc. I would like to > link my own C++ library to the Petsc library. I imagine there are two > solutions: > > 1) write an interface between my library and Petsc, i.e., between my > own data structure (object-oriented) with the DA structure of Petsc. > This requires translation between all the data type, for instance, int > and PetscInt.... OO is really orthogonal to the introduction of new types. > 2) use templated programming in my own library, so that when I link to > the Petsc library, I can easily reuse my own code to set up the Right > hand side, Jacobian and so on. Yes, this is the correct way to handle .it. Matt > Just wondering what is a good solution for an efficient and neat design? > > Thanks, > > Jianing > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie From jianings at gmail.com Thu Feb 15 13:46:15 2007 From: jianings at gmail.com (Jianing Shi) Date: Thu, 15 Feb 2007 11:46:15 -0800 Subject: code design In-Reply-To: References: <63516a2e0702151118r7aeed903y40ebeab1be9a9170@mail.gmail.com> Message-ID: <63516a2e0702151146x5ac51efdw3547a42d46b497f6@mail.gmail.com> So follow up my puzzles for code design, do I have the same problem if I want to use SUNDIALS control script, and link it to Petsc library? I am trying to building up my own C++ library, handle it over to PVODE control script, and solve the underlying system using Petsc. Jianing > > I have a question which is more about code design using Petsc. > > > > Suppose I need to implement a C++ library to provide an interface for > > users to set up the ODE and/or PDE systems, which I will solve on > > parallel computers using Petsc. Since Petsc has defined its own data > > type (in fact, a lot), PetscInt, PetscScalar, etc. I would like to > > link my own C++ library to the Petsc library. I imagine there are two > > solutions: > > > > 1) write an interface between my library and Petsc, i.e., between my > > own data structure (object-oriented) with the DA structure of Petsc. > > This requires translation between all the data type, for instance, int > > and PetscInt.... > > OO is really orthogonal to the introduction of new types. > > > 2) use templated programming in my own library, so that when I link to > > the Petsc library, I can easily reuse my own code to set up the Right > > hand side, Jacobian and so on. > > Yes, this is the correct way to handle .it. > > Matt > > > Just wondering what is a good solution for an efficient and neat design? > > > > Thanks, > > > > Jianing From knepley at gmail.com Fri Feb 16 12:45:34 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 16 Feb 2007 12:45:34 -0600 Subject: code design In-Reply-To: <63516a2e0702151146x5ac51efdw3547a42d46b497f6@mail.gmail.com> References: <63516a2e0702151118r7aeed903y40ebeab1be9a9170@mail.gmail.com> <63516a2e0702151146x5ac51efdw3547a42d46b497f6@mail.gmail.com> Message-ID: I don't really understand the question, I guess. Matt On 2/15/07, Jianing Shi wrote: > So follow up my puzzles for code design, do I have the same problem if > I want to use SUNDIALS control script, and link it to Petsc library? > > I am trying to building up my own C++ library, handle it over to PVODE > control script, and solve the underlying system using Petsc. > > Jianing > > > > I have a question which is more about code design using Petsc. > > > > > > Suppose I need to implement a C++ library to provide an interface for > > > users to set up the ODE and/or PDE systems, which I will solve on > > > parallel computers using Petsc. Since Petsc has defined its own data > > > type (in fact, a lot), PetscInt, PetscScalar, etc. I would like to > > > link my own C++ library to the Petsc library. I imagine there are two > > > solutions: > > > > > > 1) write an interface between my library and Petsc, i.e., between my > > > own data structure (object-oriented) with the DA structure of Petsc. > > > This requires translation between all the data type, for instance, int > > > and PetscInt.... > > > > OO is really orthogonal to the introduction of new types. > > > > > 2) use templated programming in my own library, so that when I link to > > > the Petsc library, I can easily reuse my own code to set up the Right > > > hand side, Jacobian and so on. > > > > Yes, this is the correct way to handle .it. > > > > Matt > > > > > Just wondering what is a good solution for an efficient and neat design? > > > > > > Thanks, > > > > > > Jianing > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie From hzhang at mcs.anl.gov Fri Feb 16 13:30:20 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Fri, 16 Feb 2007 13:30:20 -0600 (CST) Subject: code design In-Reply-To: References: <63516a2e0702151118r7aeed903y40ebeab1be9a9170@mail.gmail.com> <63516a2e0702151146x5ac51efdw3547a42d46b497f6@mail.gmail.com> Message-ID: Jianing, > > On 2/15/07, Jianing Shi wrote: > > So follow up my puzzles for code design, do I have the same problem if > > I want to use SUNDIALS control script, and link it to Petsc library? I don't know what is SUNDIALS control script. We do support CVODE through petsc-sundials interface. > > > > I am trying to building up my own C++ library, handle it over to PVODE > > control script, and solve the underlying system using Petsc. We have intention to use CVODE's multi-time-step control, and use Petsc solving linear and non-linear systems at each time step. Is this what you need? Current interface uses SUNDAILS solvers. The interface is implemented in ~petsc/src/ts/impls/implicit/sundials/sundials.c Hong > > > > Jianing > > > > > > I have a question which is more about code design using Petsc. > > > > > > > > Suppose I need to implement a C++ library to provide an interface for > > > > users to set up the ODE and/or PDE systems, which I will solve on > > > > parallel computers using Petsc. Since Petsc has defined its own data > > > > type (in fact, a lot), PetscInt, PetscScalar, etc. I would like to > > > > link my own C++ library to the Petsc library. I imagine there are two > > > > solutions: > > > > > > > > 1) write an interface between my library and Petsc, i.e., between my > > > > own data structure (object-oriented) with the DA structure of Petsc. > > > > This requires translation between all the data type, for instance, int > > > > and PetscInt.... > > > > > > OO is really orthogonal to the introduction of new types. > > > > > > > 2) use templated programming in my own library, so that when I link to > > > > the Petsc library, I can easily reuse my own code to set up the Right > > > > hand side, Jacobian and so on. > > > > > > Yes, this is the correct way to handle .it. > > > > > > Matt > > > > > > > Just wondering what is a good solution for an efficient and neat design? > > > > > > > > Thanks, > > > > > > > > Jianing > > > > > > > -- > One trouble is that despite this system, anyone who reads journals widely > and critically is forced to realize that there are scarcely any bars to eventual > publication. There seems to be no study too fragmented, no hypothesis too > trivial, no literature citation too biased or too egotistical, no design too > warped, no methodology too bungled, no presentation of results too > inaccurate, too obscure, and too contradictory, no analysis too self-serving, > no argument too circular, no conclusions too trifling or too unjustified, and > no grammar and syntax too offensive for a paper to end up in print. -- > Drummond Rennie > > From jinzishuai at yahoo.com Fri Feb 16 14:18:22 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 16 Feb 2007 12:18:22 -0800 (PST) Subject: Problem creating a non-square MPIAIJ Matrix Message-ID: <581824.83974.qm@web36201.mail.mud.yahoo.com> Hi there, I am found a very mysterious problem of creating a MxN matrix when M>N. Please take a look at the attached short test code I wrote to demonstrate the problem. Basically, I want to create a 8x6 matrix. A={1 2 3 4 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0 If I do it with 2 processes, I suppose the local submatrices should look like p0: 1 2 3 4 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0; p1: 0 0 0 0 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0; 0 0 0 0 0 0 The problem is that on the first process, the diagonal portion of the local submatrix is 1 2 3 4 | 0 0; 0 0 0 0 | 0 0; 0 0 0 0 | 0 0; 0 0 0 0 | 0 0; So I need to set d_nnz[0]=4 on p0 since the first row has 4 diagonal nonzero entries. However, when I run the code by mpiexec -n 2 ./mpiaij I got error saying that [0]PETSC ERROR: --------------------- Error Message ------------------------------------ [0]PETSC ERROR: Argument out of range! [0]PETSC ERROR: nnz cannot be greater than row length: local row 0 value 4 rowlength 3! It seems that it is checking against the n parameter which is set by petsc to be n=N/2=3. But why should we do that? According the manual, the local submatrix is of dimension m by N. Could you please help me understand the problem? Thank you very much. Shi ____________________________________________________________________________________ TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV. http://tv.yahoo.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: mpiaij.c Type: text/x-csrc Size: 977 bytes Desc: 3194231359-mpiaij.c URL: From balay at mcs.anl.gov Fri Feb 16 14:33:11 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Fri, 16 Feb 2007 14:33:11 -0600 (CST) Subject: Problem creating a non-square MPIAIJ Matrix In-Reply-To: <581824.83974.qm@web36201.mail.mud.yahoo.com> References: <581824.83974.qm@web36201.mail.mud.yahoo.com> Message-ID: > MatCreateMPIAIJ(PETSC_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE, > 8,6,PETSC_DEFAULT,d_nnz ,PETSC_DEFAULT, o_nnz, &A); Here you are asking PETSc to decide tine local 'm,n' partition sizes. So you should use MatGetLocalSize() to get these values [and not assume they are sequare blocks] If you need to have the diagonal blocks square - then specify them appropriately at the matrix creation time [instead of using PETSC_DECIDE - for m,n parameters]. However - note that you'll have to create Vectors with matching parallel layout [when using in MatVec()] i.e in y = Ax x should match column layout of A y should match row layout of A Satish On Fri, 16 Feb 2007, Shi Jin wrote: > Hi there, > > I am found a very mysterious problem of creating a MxN > matrix when M>N. > Please take a look at the attached short test code I > wrote to demonstrate the problem. > > Basically, I want to create a 8x6 matrix. > A={1 2 3 4 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0 > > If I do it with 2 processes, I suppose the local > submatrices should look like > p0: > 1 2 3 4 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0; > p1: > 0 0 0 0 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0; > 0 0 0 0 0 0 > The problem is that on the first process, the diagonal > portion of the local submatrix is > 1 2 3 4 | 0 0; > 0 0 0 0 | 0 0; > 0 0 0 0 | 0 0; > 0 0 0 0 | 0 0; > So I need to set d_nnz[0]=4 on p0 since the first row > has 4 diagonal nonzero entries. However, when I run > the code by > mpiexec -n 2 ./mpiaij > I got error saying that > [0]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [0]PETSC ERROR: Argument out of range! > [0]PETSC ERROR: nnz cannot be greater than row length: > local row 0 value 4 rowlength 3! > > It seems that it is checking against the n parameter > which is set by petsc to be n=N/2=3. > But why should we do that? According the manual, the > local submatrix is of dimension m by N. > > Could you please help me understand the problem? > Thank you very much. > Shi > > > > ____________________________________________________________________________________ > TV dinner still cooling? > Check out "Tonight's Picks" on Yahoo! TV. > http://tv.yahoo.com/ From jinzishuai at yahoo.com Fri Feb 16 14:47:00 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 16 Feb 2007 12:47:00 -0800 (PST) Subject: Problem creating a non-square MPIAIJ Matrix In-Reply-To: Message-ID: <20070216204700.28042.qmail@web36211.mail.mud.yahoo.com> I actually used MatGetLocalSize(A,&m,&n) in the code. They give me m=4,n=3, as expected. I can also specify m=4,n=3 in MatCreateMPIAIJ() which is exactly identical to the previous code. If I specify anything else, I get error saying that they don't agree with the global sizes. But I don't understand why we need to make n=N/2? Are we storing the whole rows of the matrix? Just like the mannual says, the local submatrix is of size m*N. Shi --- Satish Balay wrote: > > > > MatCreateMPIAIJ(PETSC_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE, > > 8,6,PETSC_DEFAULT,d_nnz ,PETSC_DEFAULT, > o_nnz, &A); > > Here you are asking PETSc to decide tine local 'm,n' > partition sizes. > > So you should use MatGetLocalSize() to get these > values [and not > assume they are sequare blocks] > > If you need to have the diagonal blocks square - > then specify them > appropriately at the matrix creation time [instead > of using > PETSC_DECIDE - for m,n parameters]. > > However - note that you'll have to create Vectors > with matching > parallel layout [when using in MatVec()] > > i.e in y = Ax > > x should match column layout of A > y should match row layout of A > > Satish > > On Fri, 16 Feb 2007, Shi Jin wrote: > > > Hi there, > > > > I am found a very mysterious problem of creating a > MxN > > matrix when M>N. > > Please take a look at the attached short test code > I > > wrote to demonstrate the problem. > > > > Basically, I want to create a 8x6 matrix. > > A={1 2 3 4 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0 > > > > If I do it with 2 processes, I suppose the local > > submatrices should look like > > p0: > > 1 2 3 4 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > p1: > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0; > > 0 0 0 0 0 0 > > The problem is that on the first process, the > diagonal > > portion of the local submatrix is > > 1 2 3 4 | 0 0; > > 0 0 0 0 | 0 0; > > 0 0 0 0 | 0 0; > > 0 0 0 0 | 0 0; > > So I need to set d_nnz[0]=4 on p0 since the first > row > > has 4 diagonal nonzero entries. However, when I > run > > the code by > > mpiexec -n 2 ./mpiaij > > I got error saying that > > [0]PETSC ERROR: --------------------- Error > Message > > ------------------------------------ > > [0]PETSC ERROR: Argument out of range! > > [0]PETSC ERROR: nnz cannot be greater than row > length: > > local row 0 value 4 rowlength 3! > > > > It seems that it is checking against the n > parameter > > which is set by petsc to be n=N/2=3. > > But why should we do that? According the manual, > the > > local submatrix is of dimension m by N. > > > > Could you please help me understand the problem? > > Thank you very much. > > Shi > > > > > > > > > ____________________________________________________________________________________ > > TV dinner still cooling? > > Check out "Tonight's Picks" on Yahoo! TV. > > http://tv.yahoo.com/ > > ____________________________________________________________________________________ Have a burning question? Go to www.Answers.yahoo.com and get answers from real people who know. From balay at mcs.anl.gov Fri Feb 16 15:03:56 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Fri, 16 Feb 2007 15:03:56 -0600 (CST) Subject: Problem creating a non-square MPIAIJ Matrix In-Reply-To: <20070216204700.28042.qmail@web36211.mail.mud.yahoo.com> References: <20070216204700.28042.qmail@web36211.mail.mud.yahoo.com> Message-ID: On Fri, 16 Feb 2007, Shi Jin wrote: > I actually used MatGetLocalSize(A,&m,&n) in the code. They give me > m=4,n=3, as expected. I can also specify m=4,n=3 in > MatCreateMPIAIJ() which is exactly identical to the previous code. > If I specify anything else, I get error saying that they don't agree > with the global sizes. What did you specify? Notice that in your partition scheme (m,n) have different values on each proc) (M,N = 8,6) (m0,n0 = 4,4) (m1,n1 = 4,2) 4 2 1 2 3 4 | 0 0 0 0 0 0 | 0 0 4 0 0 0 0 | 0 0 0 0 0 0 | 0 0 ------------- 0 0 0 0 | 0 0 4 0 0 0 0 | 0 0 0 0 0 0 | 0 0 0 0 0 0 | 0 0 Howeve note that you get a square diagonal block on proc-0 [4x4] but not on proc-1 [4x2] . Its probably best to use the default PETSc partitioning scheme then this alternative one. > But I don't understand why we need to make n=N/2? This is the default partitioning scheme - when you specify PETSC_DECIDE for m,n. Here we choose to divide things as evenly as possible. > Are we storing the whole rows of the matrix? Just like > the mannual says, the local submatrix is of size m*N. We are stoing the diagonal block and offdiagonal block separately. However both blocks are on the same processor. i.e each processor stores m*N values - in 2 submatrices m*n, m*(N-n). To understand this better - check manpage for MatCreateMPIAIJ(). Satish From jinzishuai at yahoo.com Fri Feb 16 15:34:22 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 16 Feb 2007 13:34:22 -0800 (PST) Subject: Problem creating a non-square MPIAIJ Matrix In-Reply-To: Message-ID: <152812.69083.qm@web36203.mail.mud.yahoo.com> > We are stoing the diagonal block and offdiagonal > block > separately. However both blocks are on the same > processor. i.e each > processor stores m*N values - in 2 submatrices m*n, > m*(N-n). To > understand this better - check manpage for > MatCreateMPIAIJ(). Thanks. But this is completely different from what I read from the PETSC mannual. http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatCreateMPIAIJ.html Here is says: "The DIAGONAL portion of the local submatrix of a processor can be defined as the submatrix which is obtained by extraction the part corresponding to the rows r1-r2 and columns r1-r2 of the global matrix, where r1 is the first row that belongs to the processor, and r2 is the last row belonging to the this processor. This is a square mxm matrix. The remaining portion of the local submatrix (mxN) constitute the OFF-DIAGONAL portion." So the two matrices are mxm and mx(N-m) instead of what you said: mxn and mx(N-n) However, the code seems to act like what you described. They are equivalent for square matrices but not the same for non-square matrices like the one I showed. Could you please clarify whether the manual is accurate or not? Thank you very much. Shi ____________________________________________________________________________________ Sucker-punch spam with award-winning protection. Try the free Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/features_spam.html From balay at mcs.anl.gov Fri Feb 16 15:55:24 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Fri, 16 Feb 2007 15:55:24 -0600 (CST) Subject: Problem creating a non-square MPIAIJ Matrix In-Reply-To: <152812.69083.qm@web36203.mail.mud.yahoo.com> References: <152812.69083.qm@web36203.mail.mud.yahoo.com> Message-ID: On Fri, 16 Feb 2007, Shi Jin wrote: > > We are stoing the diagonal block and offdiagonal block > > separately. However both blocks are on the same processor. i.e > > each processor stores m*N values - in 2 submatrices m*n, > > m*(N-n). To understand this better - check manpage for > > MatCreateMPIAIJ(). > Thanks. But this is completely different from what I > read from the PETSC mannual. > http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatCreateMPIAIJ.html > Here is says: > "The DIAGONAL portion of the local submatrix of a > processor can be defined as the submatrix which is > obtained by extraction the part corresponding to the > rows r1-r2 and columns r1-r2 of the global matrix, > where r1 is the first row that belongs to the > processor, and r2 is the last row belonging to the > this processor. This is a square mxm matrix. The > remaining portion of the local submatrix (mxN) > constitute the OFF-DIAGONAL portion." This text was proably writen assuming that the initial matrix was a square MxM matrix [in which case the diagonal blocks are also square]. This text should be corrected to reflect the 'rectangular' matrix case as well. > So the two matrices are mxm and mx(N-m) instead of what you said: > mxn and mx(N-n) However, the code seems to act like what you > described. This interpreatation of the partitionling is not possible. You are assuming the following partitioning - which PETSc doesn't support. 1 2 3 4 | 0 0 0 0 0 0 | 0 0 0 0 0 0 | 0 0 0 0 0 0 | 0 0 ------------- 0 0 | 0 0 0 0 0 0 | 0 0 0 0 0 0 | 0 0 0 0 0 0 | 0 0 0 0 Note: the primary purpose of storing diagonal & offdiagonal blocks is to separate comutation that requires messages from compuattion that does not - in a MatVec. i.e We the current petsc partitioning - the diagonal block can be processed without any communication. [with a matching vec partitioning - as mentioned in an earlier e-mail] The above scheme - with different column partitioning on each node - doesn't help with this [and removes the primary purpose for storing the matrix blocks separately] Satish > They are equivalent for square matrices but not the same for > non-square matrices like the one I showed. Could you please clarify > whether the manual is accurate or not? From jianings at gmail.com Fri Feb 16 15:57:18 2007 From: jianings at gmail.com (Jianing Shi) Date: Fri, 16 Feb 2007 15:57:18 -0600 Subject: code design In-Reply-To: References: <63516a2e0702151118r7aeed903y40ebeab1be9a9170@mail.gmail.com> <63516a2e0702151146x5ac51efdw3547a42d46b497f6@mail.gmail.com> Message-ID: <63516a2e0702161357n5e982c0cu2d1ea91619c589df@mail.gmail.com> > We have intention to use CVODE's multi-time-step control, and use Petsc > solving linear and non-linear systems at each time step. > Is this what you need? Yes, that is exactly what I mean. > Current interface uses SUNDAILS solvers. > > The interface is implemented in > ~petsc/src/ts/impls/implicit/sundials/sundials.c Thanks, I will look into that. Jianing From jinzishuai at yahoo.com Fri Feb 16 16:07:34 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Fri, 16 Feb 2007 14:07:34 -0800 (PST) Subject: Problem creating a non-square MPIAIJ Matrix In-Reply-To: Message-ID: <155636.80641.qm@web36203.mail.mud.yahoo.com> Thanks. Now I know I followed the documentation that is mistaken. I will change the code according to your description. Thank you very much. Shi --- Satish Balay wrote: > On Fri, 16 Feb 2007, Shi Jin wrote: > > > > We are stoing the diagonal block and offdiagonal > block > > > separately. However both blocks are on the same > processor. i.e > > > each processor stores m*N values - in 2 > submatrices m*n, > > > m*(N-n). To understand this better - check > manpage for > > > MatCreateMPIAIJ(). > > > Thanks. But this is completely different from what > I > > read from the PETSC mannual. > > > http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatCreateMPIAIJ.html > > Here is says: > > "The DIAGONAL portion of the local submatrix of a > > processor can be defined as the submatrix which is > > obtained by extraction the part corresponding to > the > > rows r1-r2 and columns r1-r2 of the global matrix, > > where r1 is the first row that belongs to the > > processor, and r2 is the last row belonging to the > > this processor. This is a square mxm matrix. The > > remaining portion of the local submatrix (mxN) > > constitute the OFF-DIAGONAL portion." > > This text was proably writen assuming that the > initial matrix was a > square MxM matrix [in which case the diagonal blocks > are also > square]. This text should be corrected to reflect > the 'rectangular' > matrix case as well. > > > So the two matrices are mxm and mx(N-m) instead > of what you said: > > mxn and mx(N-n) However, the code seems to act > like what you > > described. > > This interpreatation of the partitionling is not > possible. You are > assuming the following partitioning - which PETSc > doesn't support. > > > 1 2 3 4 | 0 0 > 0 0 0 0 | 0 0 > 0 0 0 0 | 0 0 > 0 0 0 0 | 0 0 > ------------- > 0 0 | 0 0 0 0 > 0 0 | 0 0 0 0 > 0 0 | 0 0 0 0 > 0 0 | 0 0 0 0 > > Note: the primary purpose of storing diagonal & > offdiagonal blocks is > to separate comutation that requires messages from > compuattion that > does not - in a MatVec. > > i.e We the current petsc partitioning - the diagonal > block can be > processed without any communication. [with a > matching vec partitioning > - as mentioned in an earlier e-mail] > > The above scheme - with different column > partitioning on each node - > doesn't help with this [and removes the primary > purpose for storing > the matrix blocks separately] > > Satish > > > They are equivalent for square matrices but not > the same for > > non-square matrices like the one I showed. Could > you please clarify > > whether the manual is accurate or not? > > ____________________________________________________________________________________ Don't pick lemons. See all the new 2007 cars at Yahoo! Autos. http://autos.yahoo.com/new_cars.html From svm at cfdrc.com Fri Feb 16 17:44:25 2007 From: svm at cfdrc.com (Saikrishna V. Marella) Date: Fri, 16 Feb 2007 17:44:25 -0600 Subject: MatSetValuesBlocked In-Reply-To: <155636.80641.qm@web36203.mail.mud.yahoo.com> References: <155636.80641.qm@web36203.mail.mud.yahoo.com> Message-ID: <000301c75224$642e3a80$10fda8c0@svmwin64> Hey guys, How are the block matrices assembled when using MatSetValuesBlocked(mat,m,idxm[],n,idxn[],v[],addv). Suppose m=n=2 and block size(bs) = 2 The matrix I am trying to assemble is 1 2 | 3 4 5 6 | 7 8 - - - | - - - 9 10 | 11 12 13 14 | 15 16 The manual says, v[] should be row-oriented. For Block matrices what does that mean? Should the array v[] being passed in look like a) v[] = [1,2,5,6,3,4,7,8,9,10,13,14,11,12,15,16] OR b) v[] = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] Thanks. Sai Marella. _____________________________________________________ Saikrishna Marella, (PhD), Project Engineer CFD Research Corp. 215 Wynn Dr. Huntsville AL 35805 Tel: 256-726-4954,(4800), Fax(4806) , svm at cfdrc.com Home Page: http://www.cfdrc.com -------------- next part -------------- A non-text attachment was scrubbed... Name: Saikrishna(Sai) Marella (svm at cfdrc.com).vcf Type: text/x-vcard Size: 420 bytes Desc: not available URL: From bsmith at mcs.anl.gov Fri Feb 16 20:38:13 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Fri, 16 Feb 2007 20:38:13 -0600 (CST) Subject: MatSetValuesBlocked In-Reply-To: <000301c75224$642e3a80$10fda8c0@svmwin64> References: <155636.80641.qm@web36203.mail.mud.yahoo.com> <000301c75224$642e3a80$10fda8c0@svmwin64> Message-ID: It is b) with the column oriented option it would be 1 5 9 13 2 6 .... Barry I will add your example to the manual page to make this absolutly clear for future users. On Fri, 16 Feb 2007, Saikrishna V. Marella wrote: > Hey guys, > > How are the block matrices assembled when using > MatSetValuesBlocked(mat,m,idxm[],n,idxn[],v[],addv). > > Suppose m=n=2 and block size(bs) = 2 The matrix I am trying to assemble is > > 1 2 | 3 4 > 5 6 | 7 8 > - - - | - - - > 9 10 | 11 12 > 13 14 | 15 16 > > The manual says, v[] should be row-oriented. For Block matrices what does > that mean? > > Should the array v[] being passed in look like > > a) v[] = [1,2,5,6,3,4,7,8,9,10,13,14,11,12,15,16] > > OR > > b) v[] = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] > > Thanks. > Sai Marella. > _____________________________________________________ > Saikrishna Marella, (PhD), Project Engineer > CFD Research Corp. 215 Wynn Dr. Huntsville AL 35805 > Tel: 256-726-4954,(4800), Fax(4806) , svm at cfdrc.com > Home Page: http://www.cfdrc.com > From manav at u.washington.edu Sun Feb 18 07:08:14 2007 From: manav at u.washington.edu (Manav Bhatia) Date: Sun, 18 Feb 2007 05:08:14 -0800 Subject: fill for matrix multiplication Message-ID: Hi, I am performing a matrix multiplication of two dense matrices with both MatMatMult and MatMatMultTranspose. What do I choose a fill factor as? According to the definition in the documentation: fiill = expected fill as ratio of nnz(C)/(nnz(A) + nnz(B)). So, If I have two full matrices A and B, I will get a full matrix as my result. Hence, the fill factor will be 0.5. However, both these methods give an error with a fill factor less than 1.0. Also, if I use PETSC_DFAULT as the fill argument, it agains results in an error since its value is -2. Kindly help me with your advice here. Thanks Manav -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Sun Feb 18 12:34:47 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Sun, 18 Feb 2007 12:34:47 -0600 (CST) Subject: fill for matrix multiplication In-Reply-To: References: Message-ID: Use 1.0 for dense matrices; it is ignored since dense matrices are always dense. Barry On Sun, 18 Feb 2007, Manav Bhatia wrote: > Hi, > > I am performing a matrix multiplication of two dense matrices with both > MatMatMult and MatMatMultTranspose. > What do I choose a fill factor as? According to the definition in the > documentation: fiill = expected fill as ratio of nnz(C)/(nnz(A) + nnz(B)). > So, If I have two full matrices A and B, I will get a full matrix as my > result. Hence, the fill factor will be 0.5. However, both these methods give > an error with a fill factor less than 1.0. Also, if I use PETSC_DFAULT as the > fill argument, it agains results in an error since its value is -2. > > Kindly help me with your advice here. > > Thanks > Manav From hzhang at mcs.anl.gov Sun Feb 18 14:12:25 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Sun, 18 Feb 2007 14:12:25 -0600 (CST) Subject: fill for matrix multiplication In-Reply-To: References: Message-ID: Manav, I've enabled fill=PETSC_DFAULT for MatMatMultSymbolic(). We recommend using MatMatMult() instead of MatMatMultSymbolic() and MatMatMultNumeric(). Thanks for reporting the problem, Hong On Sun, 18 Feb 2007, Barry Smith wrote: > > Use 1.0 for dense matrices; it is ignored since dense matrices > are always dense. > > Barry > > > On Sun, 18 Feb 2007, Manav Bhatia wrote: > > > Hi, > > > > I am performing a matrix multiplication of two dense matrices with both > > MatMatMult and MatMatMultTranspose. > > What do I choose a fill factor as? According to the definition in the > > documentation: fiill = expected fill as ratio of nnz(C)/(nnz(A) + nnz(B)). > > So, If I have two full matrices A and B, I will get a full matrix as my > > result. Hence, the fill factor will be 0.5. However, both these methods give > > an error with a fill factor less than 1.0. Also, if I use PETSC_DFAULT as the > > fill argument, it agains results in an error since its value is -2. > > > > Kindly help me with your advice here. > > > > Thanks > > Manav > > From jens.madsen at risoe.dk Mon Feb 19 07:34:27 2007 From: jens.madsen at risoe.dk (jens.madsen at risoe.dk) Date: Mon, 19 Feb 2007 14:34:27 +0100 Subject: Can I please come off the mailing list for a while References: Message-ID: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/ms-tnef Size: 2820 bytes Desc: not available URL: From manav at u.washington.edu Tue Feb 20 17:38:42 2007 From: manav at u.washington.edu (Manav Bhatia) Date: Tue, 20 Feb 2007 15:38:42 -0800 Subject: TS Message-ID: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> Hi I am preparing my code to use the TS capability of Petsc, and I had a few doubts to clear up. These primarily relate to the set up of the problem, and I have stated them below. Please correct me if I am wrong. -- For a linear transient problem, I understand that the following different combinations are possible: 1> A(t) U_t = f (t) where I will have to call the setLHSMatrix() and setRHSFunction() functions during set up. 2> A(t) U_t = B(t) U where I will have to call the setLHSMatrix() and setRHSMatrix() functions during set up. 3> U_t = f(t) where I call only the setRHSfunction() 4> U_t = A(t) U where I call only the setRHSMatrix() -- For a nonlinear transient problem, I understand that the following different combinations are possible: 1> A(t) U_t = f (U, t) where I will have to call the setLHSMatrix() and setRHSFunction(), and setRHSJacobian() functions during set up. 2> U_t = f(U, t) where I will have to call the setRHSfunction() and setRHSJacobian() functions during set up. -- setting KSP and PC types From what I understand, I can set up the KSP and PC type of the transient solver, which will be used only if I specify an A matrix for the problem. In addition, I can independently set the KSP and PC type of the SNES used by TS, which is used by the time solver. Kindly help me with your advice here. Thanks, Manav From knepley at gmail.com Tue Feb 20 17:47:54 2007 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 20 Feb 2007 17:47:54 -0600 Subject: TS In-Reply-To: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> References: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> Message-ID: On 2/20/07, Manav Bhatia wrote: > Hi > > I am preparing my code to use the TS capability of Petsc, and I > had a few doubts to clear up. These primarily relate to the set up of > the problem, and I have stated them below. Please correct me if I am > wrong. > > -- For a linear transient problem, I understand that the following > different combinations are possible: > 1> A(t) U_t = f (t) > where I will have to call the setLHSMatrix() and setRHSFunction() > functions during set up. > 2> A(t) U_t = B(t) U > where I will have to call the setLHSMatrix() and setRHSMatrix() > functions during set up. > 3> U_t = f(t) > where I call only the setRHSfunction() > 4> U_t = A(t) U > where I call only the setRHSMatrix() > > > -- For a nonlinear transient problem, I understand that the following > different combinations are possible: > 1> A(t) U_t = f (U, t) > where I will have to call the setLHSMatrix() and setRHSFunction(), > and setRHSJacobian() functions during set up. > 2> U_t = f(U, t) > where I will have to call the setRHSfunction() and setRHSJacobian() > functions during set up. This sounds correct. You do not have to specify the Jacobian, as we can automatically give a FD approximation, but it is better to do so. > > -- setting KSP and PC types > From what I understand, I can set up the KSP and PC type of the > transient solver, which will be used only if I specify an A matrix > for the problem. In addition, I can independently set the KSP and PC > type of the SNES used by TS, which is used by the time solver. The solver is only used if you specify an implicit method. The KSP and PC type are used by either a SNES or just the KSP itself depending on whether the problem is nonlinear. Matt > > Kindly help me with your advice here. > > Thanks, > Manav > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie From manav at u.washington.edu Tue Feb 20 23:00:31 2007 From: manav at u.washington.edu (Manav Bhatia) Date: Tue, 20 Feb 2007 21:00:31 -0800 Subject: TS In-Reply-To: References: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> Message-ID: <29AB5E28-BBDB-47A0-AA8E-52D4946F41E4@u.washington.edu> > >> >> -- setting KSP and PC types >> From what I understand, I can set up the KSP and PC type of the >> transient solver, which will be used only if I specify an A matrix >> for the problem. In addition, I can independently set the KSP and PC >> type of the SNES used by TS, which is used by the time solver. > > The solver is only used if you specify an implicit method. The KSP > and PC > type are used by either a SNES or just the KSP itself depending on > whether > the problem is nonlinear. > So, if I have a problem with a LHS matrix, and I want to use an explicit method, then do I have to invert the matrix before asking the solver to run? i.e. the solver will not do that for me? Thanks, Manav From hzhang at mcs.anl.gov Tue Feb 20 23:16:04 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Tue, 20 Feb 2007 23:16:04 -0600 (CST) Subject: TS In-Reply-To: <29AB5E28-BBDB-47A0-AA8E-52D4946F41E4@u.washington.edu> References: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> <29AB5E28-BBDB-47A0-AA8E-52D4946F41E4@u.washington.edu> Message-ID: On Tue, 20 Feb 2007, Manav Bhatia wrote: > > > > So, if I have a problem with a LHS matrix, and I want to use an > explicit method, then do I have to invert the matrix before asking > the solver to run? i.e. the solver will not do that for me? The LHS matrix formulation is not implemented for the explicit method. Do you have such application? Hong From manav at u.washington.edu Wed Feb 21 00:28:48 2007 From: manav at u.washington.edu (Manav Bhatia) Date: Tue, 20 Feb 2007 22:28:48 -0800 Subject: TS In-Reply-To: References: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> <29AB5E28-BBDB-47A0-AA8E-52D4946F41E4@u.washington.edu> Message-ID: On Feb 20, 2007, at 9:16 PM, Hong Zhang wrote: > > > On Tue, 20 Feb 2007, Manav Bhatia wrote: > >>> >> >> So, if I have a problem with a LHS matrix, and I want to use an >> explicit method, then do I have to invert the matrix before asking >> the solver to run? i.e. the solver will not do that for me? > > The LHS matrix formulation is not implemented for the explicit method. > Do you have such application? Well, I am working with a conduction heat transfer finite element model, which has the following equation set: [C(t,{T})] d{T}/dt = {F(t,{T})} - [K(t,{T})] {T} with initial conditions {T(0)} = {T0} So, I have a problem which has a LHS matrix, which is also dependent on the primary variable (which is temperature {T}). In the simple case, ofcourse, we can neglect this dependence on temperature (for [C]), but that has limited applicability for my problem, since the [C] matrix has non-negligible nonlinearities. So, I am looking for ways to formulate my problem to use the Petsc solvers. The best option that I can think of is to restate the problem as: d{T}/dt = [C(t,{T})]^(-1) ({F(t,{T})} - [K(t,{T})] {T}) where I can now specify the RHS function and its jacobian (I will provide the jacobian, so no need to use finite differencing), and use an explicit / implicit solver. However, if I assume a linear problem, then I am left with a case of [C] d{T}/dt = {F(t)} - [K]{T} Here, I could either restate the problem in the same way as I did above, or I could specify a LHS matrix (in this case [C]), and ask the solver to handle it. But, from our previous email exchanges, it seems like I will have to use an implicit solver for the same, since an explicit solver will not handle a LHS matrix. Kindly correct me if I am wrong. Thanks, Manav From hzhang at mcs.anl.gov Wed Feb 21 09:24:55 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Wed, 21 Feb 2007 09:24:55 -0600 (CST) Subject: TS In-Reply-To: References: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> <29AB5E28-BBDB-47A0-AA8E-52D4946F41E4@u.washington.edu> Message-ID: Manav, Since you have to solve equation at each time-step, I would suggest using Crank-Nicholson method, which combines implicit and explicit methods and gives higher order of approximation than Euler and Backward Euler methods that petsc supports. However, Crank-Nicholson method is not supported by the petsc release. You must use petsc-dev (see http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html#Obtaining on how to get it). Additional note: in petsc-dev, the interface functions TSSetRHSMatrix() and TSSetLHSMatrix() are replaced by TSSetMatrices(). An example of using cn method is petsc-dev/src/ts/examples/tests/ex1.c See the targets of "runex1_cn_*" in petsc-dev/src/ts/examples/tests/makefile on how to run this example. Use of LHS matrices in petsc is not well tested yet. I've been looking for examples that involve LHS matrix. Would you like contribute your application, or a simplified version of it to us as a test example? We'll put your name in the contributed example. Thanks, Hong > > [C(t,{T})] d{T}/dt = {F(t,{T})} - [K(t,{T})] {T} > > with initial conditions > {T(0)} = {T0} > > So, I have a problem which has a LHS matrix, which is also dependent > on the primary variable (which is temperature {T}). > In the simple case, ofcourse, we can neglect this dependence on > temperature (for [C]), but that has limited applicability for my > problem, since the [C] matrix has non-negligible nonlinearities. > So, I am looking for ways to formulate my problem to use the Petsc > solvers. The best option that I can think of is to restate the > problem as: > > d{T}/dt = [C(t,{T})]^(-1) ({F(t,{T})} - [K(t,{T})] {T}) > > where I can now specify the RHS function and its jacobian (I will > provide the jacobian, so no need to use finite differencing), and use > an explicit / implicit solver. > > However, if I assume a linear problem, then I am left with a case of > > [C] d{T}/dt = {F(t)} - [K]{T} > > Here, I could either restate the problem in the same way as I did > above, or I could specify a LHS matrix (in this case [C]), and ask > the solver to handle it. But, from our previous email exchanges, it > seems like I will have to use an implicit solver for the same, since > an explicit solver will not handle a LHS matrix. > > Kindly correct me if I am wrong. > > Thanks, > Manav > > > From diosady at MIT.EDU Wed Feb 21 11:14:24 2007 From: diosady at MIT.EDU (Laslo Tibor Diosady) Date: Wed, 21 Feb 2007 12:14:24 -0500 Subject: Defining my own reordering method for ILU Message-ID: <1172078064.18714.17.camel@splinter.mit.edu> Hi, I want to be able to define an ordering type for the ILU factorization to be used as a preconditioner to GMRES. I have successfully used PCFactorSetMatOrdering(), to set standard types, however I have my own reordering method which I would like to be able to use. Since my reordering method is based on information not found within the matrix and I may want to solve several different systems with the same ordering, what I really want to be able to do is create an index set in advance and pass it is to be used for the reordering. Is this possible, and how would you suggest I go about doing this? Thanks, Laslo From bsmith at mcs.anl.gov Wed Feb 21 11:27:24 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 21 Feb 2007 11:27:24 -0600 (CST) Subject: Defining my own reordering method for ILU In-Reply-To: <1172078064.18714.17.camel@splinter.mit.edu> References: <1172078064.18714.17.camel@splinter.mit.edu> Message-ID: Laslo, You can do your ordering up front to generate an index set. But you will still need to write a routine MatOrdering_MyOrdering() that returns the isrow and iscol, then register the routine with MatOrderingRegisterDynamic(). Then use PCFactorSetMatOrdering() to tell the PC to use your new ordering routine. To have the preconditioner reuse the reordering for several matrices use PCFactorSetReuseOrdering(). Good luck, Barry I know it is a little strange to need to provide the call back function MatOrdering_MyOrdering() that simply returns the ordering you already created but that is the only way to get the ordering into the right place inside the PC objects. The MatOrdering_MyOrdering() simply sets the isrow and iscol pointer and returns. On Wed, 21 Feb 2007, Laslo Tibor Diosady wrote: > Hi, > > I want to be able to define an ordering type for the ILU factorization > to be used as a preconditioner to GMRES. I have successfully used > PCFactorSetMatOrdering(), to set standard types, however I have my own > reordering method which I would like to be able to use. > > Since my reordering method is based on information not found within the > matrix and I may want to solve several different systems with the same > ordering, what I really want to be able to do is create an index set in > advance and pass it is to be used for the reordering. > > Is this possible, and how would you suggest I go about doing this? > > Thanks, > > Laslo > > From diosady at MIT.EDU Wed Feb 21 11:53:16 2007 From: diosady at MIT.EDU (Laslo Tibor Diosady) Date: Wed, 21 Feb 2007 12:53:16 -0500 Subject: Defining my own reordering method for ILU In-Reply-To: References: <1172078064.18714.17.camel@splinter.mit.edu> Message-ID: <1172080396.18714.24.camel@splinter.mit.edu> Hi Barry, If I understand how this works then I need to create a function MatOrdering_MyOrdering(Mat mat,const MatOrderingType type,IS *irow,IS *icol) My problem is that in order to compute my matrix ordering I need data other than the matrix as input. I can't see how I can do that if those are the only four arguments passed to the function. Thanks, Laslo On Wed, 2007-02-21 at 11:27 -0600, Barry Smith wrote: > Laslo, > > You can do your ordering up front to generate an index set. > But you will still need to write a routine MatOrdering_MyOrdering() that > returns the isrow and iscol, then register the routine > with MatOrderingRegisterDynamic(). Then use PCFactorSetMatOrdering() > to tell the PC to use your new ordering routine. > > To have the preconditioner reuse the reordering for several matrices > use PCFactorSetReuseOrdering(). > > Good luck, > > Barry > > I know it is a little strange to need to provide the call back function > MatOrdering_MyOrdering() that simply returns the ordering you already > created but that is the only way to get the ordering into the right > place inside the PC objects. > > The MatOrdering_MyOrdering() simply sets the isrow and iscol pointer > and returns. > > On Wed, 21 Feb 2007, Laslo Tibor Diosady wrote: > > > Hi, > > > > I want to be able to define an ordering type for the ILU factorization > > to be used as a preconditioner to GMRES. I have successfully used > > PCFactorSetMatOrdering(), to set standard types, however I have my own > > reordering method which I would like to be able to use. > > > > Since my reordering method is based on information not found within the > > matrix and I may want to solve several different systems with the same > > ordering, what I really want to be able to do is create an index set in > > advance and pass it is to be used for the reordering. > > > > Is this possible, and how would you suggest I go about doing this? > > > > Thanks, > > > > Laslo > > > > > From manav at u.washington.edu Wed Feb 21 13:04:50 2007 From: manav at u.washington.edu (Manav Bhatia) Date: Wed, 21 Feb 2007 11:04:50 -0800 Subject: TS In-Reply-To: References: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> <29AB5E28-BBDB-47A0-AA8E-52D4946F41E4@u.washington.edu> Message-ID: <3D428AD2-EFED-4631-A70C-7F9AC5F08A30@u.washington.edu> Hong, In my case, my LHS matrix is also dependent on the primary variable >> [C(t,{T})] d{T}/dt = {F(t,{T})} Would this case be handled by CN? or does it handle only a constant or time dependent LHS matrix? For a constant/time-dependent LHS matrix, the jacobian of the RHS is same as the steady-state nonlinear analysis. Otherwise, I need to change the jacobian definition too, I think. My main concern is: If I use the restated problem (from my previous mail) >> d{T}/dt = [C(t,{T})]^(-1) ({F(t,{T})} - [K(t,{T})] {T}) I know how to calculate the jacobian of the RHS, even though it requires inversion. With this, I can use any of the explicit/implicit methods. But if I do not state the problem like this, what do I do about the LHS matrix that is dependent on the primary variable? I would be happy to contribute an example. But it will have to wait a about 2 weeks till I can start working on it. Thanks, Manav On Feb 21, 2007, at 7:24 AM, Hong Zhang wrote: > Manav, > > Since you have to solve equation at each time-step, > I would suggest using Crank-Nicholson method, which combines > implicit and explicit methods and gives > higher order of approximation than Euler and Backward Euler > methods that petsc supports. > > However, Crank-Nicholson method is not supported by the > petsc release. You must use petsc-dev > (see > http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/ > index.html#Obtaining > on how to get it). > > Additional note: in petsc-dev, the interface functions > TSSetRHSMatrix() and TSSetLHSMatrix() are replaced by TSSetMatrices(). > An example of using cn method is petsc-dev/src/ts/examples/tests/ex1.c > See the targets of > "runex1_cn_*" in petsc-dev/src/ts/examples/tests/makefile > on how to run this example. > > Use of LHS matrices in petsc is not well tested yet. > I've been looking for examples that involve LHS matrix. > Would you like contribute your application, or a simplified version > of it to us as a test example? We'll put your name > in the contributed example. > > Thanks, > > Hong >> >> [C(t,{T})] d{T}/dt = {F(t,{T})} - [K(t,{T})] {T} >> >> with initial conditions >> {T(0)} = {T0} >> >> So, I have a problem which has a LHS matrix, which is also dependent >> on the primary variable (which is temperature {T}). >> In the simple case, ofcourse, we can neglect this dependence on >> temperature (for [C]), but that has limited applicability for my >> problem, since the [C] matrix has non-negligible nonlinearities. >> So, I am looking for ways to formulate my problem to use the Petsc >> solvers. The best option that I can think of is to restate the >> problem as: >> >> d{T}/dt = [C(t,{T})]^(-1) ({F(t,{T})} - [K(t,{T})] {T}) >> >> where I can now specify the RHS function and its jacobian (I will >> provide the jacobian, so no need to use finite differencing), and use >> an explicit / implicit solver. >> >> However, if I assume a linear problem, then I am left with a case of >> >> [C] d{T}/dt = {F(t)} - [K]{T} >> >> Here, I could either restate the problem in the same way as I did >> above, or I could specify a LHS matrix (in this case [C]), and ask >> the solver to handle it. But, from our previous email exchanges, it >> seems like I will have to use an implicit solver for the same, since >> an explicit solver will not handle a LHS matrix. >> >> Kindly correct me if I am wrong. >> >> Thanks, >> Manav >> >> >> > From hzhang at mcs.anl.gov Wed Feb 21 14:13:42 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Wed, 21 Feb 2007 14:13:42 -0600 (CST) Subject: TS In-Reply-To: <3D428AD2-EFED-4631-A70C-7F9AC5F08A30@u.washington.edu> References: <7A0AE5E2-4EB8-493E-AF41-14F3CCE3152A@u.washington.edu> <29AB5E28-BBDB-47A0-AA8E-52D4946F41E4@u.washington.edu> <3D428AD2-EFED-4631-A70C-7F9AC5F08A30@u.washington.edu> Message-ID: Manav, > > >> [C(t,{T})] d{T}/dt = {F(t,{T})} The current code should be able to handle [C(t_n,{T_n})] {T_(n+1) - T_n}/dt = {F(t,{T})}, i.e., the LHS matrix uses explicit scheme. As I mentioned, the codes for cn and LHS matrix are buggy and not sufficiently tested. Additional coding is likely needed. How about start from the formulation > > >> d{T}/dt = [C(t,{T})]^(-1) ({F(t,{T})} - [K(t,{T})] {T}) and get a working code. Then pass it to me. I'll use it to test the above LHS matrix formulation and improve petsc cn method. > I would be happy to contribute an example. But it will have to wait a > about 2 weeks till I can start working on it. Fine with us. We can help to optimize it. Hong > > > On Feb 21, 2007, at 7:24 AM, Hong Zhang wrote: > > > Manav, > > > > Since you have to solve equation at each time-step, > > I would suggest using Crank-Nicholson method, which combines > > implicit and explicit methods and gives > > higher order of approximation than Euler and Backward Euler > > methods that petsc supports. > > > > However, Crank-Nicholson method is not supported by the > > petsc release. You must use petsc-dev > > (see > > http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/ > > index.html#Obtaining > > on how to get it). > > > > Additional note: in petsc-dev, the interface functions > > TSSetRHSMatrix() and TSSetLHSMatrix() are replaced by TSSetMatrices(). > > An example of using cn method is petsc-dev/src/ts/examples/tests/ex1.c > > See the targets of > > "runex1_cn_*" in petsc-dev/src/ts/examples/tests/makefile > > on how to run this example. > > > > Use of LHS matrices in petsc is not well tested yet. > > I've been looking for examples that involve LHS matrix. > > Would you like contribute your application, or a simplified version > > of it to us as a test example? We'll put your name > > in the contributed example. > > > > Thanks, > > > > Hong > >> > >> [C(t,{T})] d{T}/dt = {F(t,{T})} - [K(t,{T})] {T} > >> > >> with initial conditions > >> {T(0)} = {T0} > >> > >> So, I have a problem which has a LHS matrix, which is also dependent > >> on the primary variable (which is temperature {T}). > >> In the simple case, ofcourse, we can neglect this dependence on > >> temperature (for [C]), but that has limited applicability for my > >> problem, since the [C] matrix has non-negligible nonlinearities. > >> So, I am looking for ways to formulate my problem to use the Petsc > >> solvers. The best option that I can think of is to restate the > >> problem as: > >> > >> d{T}/dt = [C(t,{T})]^(-1) ({F(t,{T})} - [K(t,{T})] {T}) > >> > >> where I can now specify the RHS function and its jacobian (I will > >> provide the jacobian, so no need to use finite differencing), and use > >> an explicit / implicit solver. > >> > >> However, if I assume a linear problem, then I am left with a case of > >> > >> [C] d{T}/dt = {F(t)} - [K]{T} > >> > >> Here, I could either restate the problem in the same way as I did > >> above, or I could specify a LHS matrix (in this case [C]), and ask > >> the solver to handle it. But, from our previous email exchanges, it > >> seems like I will have to use an implicit solver for the same, since > >> an explicit solver will not handle a LHS matrix. > >> > >> Kindly correct me if I am wrong. > >> > >> Thanks, > >> Manav > >> > >> > >> > > > > From jinzishuai at yahoo.com Wed Feb 21 14:37:58 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Wed, 21 Feb 2007 12:37:58 -0800 (PST) Subject: efficiency of tranposing a Matrix? Message-ID: <20070221203758.2941.qmail@web36206.mail.mud.yahoo.com> Hi there, I have a code that keeps on using the same matrix L and its transpose in all time updates. I can improve the performance of the code by replacing the MatMultTranspose() with MatMult() and computing the transposed matrix at the beginning of the code for only once. The cost is of course extra storage of the transposed matrix. However, I have a question regarding the efficiency of transposing the matrix. I created the Matrix L with MPIAIJ and preallocated the proper memory for it. Then I call MatTranspose(L,<) to compute LT which is the transposed L. But I noticed that this process is extremely slow, 6 times slower than the creation of Matrix L itself. The first question is do I need to preallocate the memory for LT also? I didn't do it since I suppose PETSc is smart enough to figure out the necessary storage. Secondly, I am not sure why MatTranspose is so slow. I understand in order to transpose a Matrix, one may need to call MPI_Alltoall which is extremely expensive. But it seems trivial that I can go through a similar process of creating the Matrix L and be much faster. I am not sure how MatTraspose() is implemented and whether I should actually compose LT instead of transpose L. Thank you very much. Shi ____________________________________________________________________________________ Don't get soaked. Take a quick peak at the forecast with the Yahoo! Search weather shortcut. http://tools.search.yahoo.com/shortcuts/#loc_weather From hzhang at mcs.anl.gov Wed Feb 21 15:03:12 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Wed, 21 Feb 2007 15:03:12 -0600 (CST) Subject: efficiency of tranposing a Matrix? In-Reply-To: <20070221203758.2941.qmail@web36206.mail.mud.yahoo.com> References: <20070221203758.2941.qmail@web36206.mail.mud.yahoo.com> Message-ID: Shi, Checking MatTranspose_MPIAIJ(), I find that the preallocation is not implemented. This is likely the reason of slowdown. > > I have a code that keeps on using the same matrix L > and its transpose in all time updates. > I can improve the performance of the code by replacing > the MatMultTranspose() with MatMult() and computing > the transposed matrix at the beginning of the code for > only once. The cost is of course extra storage of the > transposed matrix. > > However, I have a question regarding the efficiency of > transposing the matrix. I created the Matrix L with > MPIAIJ and preallocated the proper memory for it. > Then I call MatTranspose(L,<) to compute LT which is > the transposed L. But I noticed that this process is > extremely slow, 6 times slower than the creation of > Matrix L itself. > > The first question is do I need to preallocate the > memory for LT also? I didn't do it since I suppose > PETSc is smart enough to figure out the necessary > storage. Preallocation of LT is non-trivial, requring all-to-all communications. I'll add it into MatTranspose_MPIAIJ(). > Secondly, I am not sure why MatTranspose is so slow. I > understand in order to transpose a Matrix, one may > need to call MPI_Alltoall which is extremely > expensive. But it seems trivial that I can go through > a similar process of creating the Matrix L and be much > faster. I am not sure how MatTraspose() is implemented > and whether I should actually compose LT instead of > transpose L. If you know the non-zero structure of LT without communication, creating it directly would outperform petsc MatTranspose(). See MatMatTranspose_MPIAIJ() in petsc/src/mat/impls/aij/mpi/mpiaij.c for details. Hong From bsmith at mcs.anl.gov Wed Feb 21 15:04:14 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 21 Feb 2007 15:04:14 -0600 (CST) Subject: Defining my own reordering method for ILU In-Reply-To: <1172080396.18714.24.camel@splinter.mit.edu> References: <1172078064.18714.17.camel@splinter.mit.edu> <1172080396.18714.24.camel@splinter.mit.edu> Message-ID: Laslo, I would just "cheat" and compute the ordering up front and then just have it as a global variable that you access in the routine. The more "PETSc" way would be to use PetscObjectCompose to stick the ordering you have computed in the matrix then PetscObjectQuery() to get it out inside the MyOrdering routine. Barry On Wed, 21 Feb 2007, Laslo Tibor Diosady wrote: > Hi Barry, > > If I understand how this works then I need to create a function > MatOrdering_MyOrdering(Mat mat,const MatOrderingType type,IS *irow,IS > *icol) > > My problem is that in order to compute my matrix ordering I need data > other than the matrix as input. > > I can't see how I can do that if those are the only four arguments > passed to the function. > > Thanks, > > Laslo > > > > On Wed, 2007-02-21 at 11:27 -0600, Barry Smith wrote: > > Laslo, > > > > You can do your ordering up front to generate an index set. > > But you will still need to write a routine MatOrdering_MyOrdering() that > > returns the isrow and iscol, then register the routine > > with MatOrderingRegisterDynamic(). Then use PCFactorSetMatOrdering() > > to tell the PC to use your new ordering routine. > > > > To have the preconditioner reuse the reordering for several matrices > > use PCFactorSetReuseOrdering(). > > > > Good luck, > > > > Barry > > > > I know it is a little strange to need to provide the call back function > > MatOrdering_MyOrdering() that simply returns the ordering you already > > created but that is the only way to get the ordering into the right > > place inside the PC objects. > > > > The MatOrdering_MyOrdering() simply sets the isrow and iscol pointer > > and returns. > > > > On Wed, 21 Feb 2007, Laslo Tibor Diosady wrote: > > > > > Hi, > > > > > > I want to be able to define an ordering type for the ILU factorization > > > to be used as a preconditioner to GMRES. I have successfully used > > > PCFactorSetMatOrdering(), to set standard types, however I have my own > > > reordering method which I would like to be able to use. > > > > > > Since my reordering method is based on information not found within the > > > matrix and I may want to solve several different systems with the same > > > ordering, what I really want to be able to do is create an index set in > > > advance and pass it is to be used for the reordering. > > > > > > Is this possible, and how would you suggest I go about doing this? > > > > > > Thanks, > > > > > > Laslo > > > > > > > > > > From diosady at MIT.EDU Wed Feb 21 19:36:05 2007 From: diosady at MIT.EDU (Laslo T. Diosady) Date: Wed, 21 Feb 2007 20:36:05 -0500 Subject: Defining my own reordering method for ILU In-Reply-To: References: <1172078064.18714.17.camel@splinter.mit.edu> <1172080396.18714.24.camel@splinter.mit.edu> Message-ID: Hi Barry, I tried the "PETSc" way and was successful. Thanks for the help, Laslo On Feb 21, 2007, at 4:04 PM, Barry Smith wrote: > > Laslo, > > I would just "cheat" and compute the ordering up front and > then just have it as a global variable that you access in the routine. > > The more "PETSc" way would be to use PetscObjectCompose to stick the > ordering you have computed in the matrix then PetscObjectQuery() to get > it out inside the MyOrdering routine. > > Barry > > On Wed, 21 Feb 2007, Laslo Tibor Diosady wrote: > >> Hi Barry, >> >> If I understand how this works then I need to create a function >> MatOrdering_MyOrdering(Mat mat,const MatOrderingType type,IS *irow,IS >> *icol) >> >> My problem is that in order to compute my matrix ordering I need data >> other than the matrix as input. >> >> I can't see how I can do that if those are the only four arguments >> passed to the function. >> >> Thanks, >> >> Laslo >> >> >> >> On Wed, 2007-02-21 at 11:27 -0600, Barry Smith wrote: >>> Laslo, >>> >>> You can do your ordering up front to generate an index set. >>> But you will still need to write a routine MatOrdering_MyOrdering() >>> that >>> returns the isrow and iscol, then register the routine >>> with MatOrderingRegisterDynamic(). Then use PCFactorSetMatOrdering() >>> to tell the PC to use your new ordering routine. >>> >>> To have the preconditioner reuse the reordering for several matrices >>> use PCFactorSetReuseOrdering(). >>> >>> Good luck, >>> >>> Barry >>> >>> I know it is a little strange to need to provide the call back >>> function >>> MatOrdering_MyOrdering() that simply returns the ordering you already >>> created but that is the only way to get the ordering into the right >>> place inside the PC objects. >>> >>> The MatOrdering_MyOrdering() simply sets the isrow and iscol pointer >>> and returns. >>> >>> On Wed, 21 Feb 2007, Laslo Tibor Diosady wrote: >>> >>>> Hi, >>>> >>>> I want to be able to define an ordering type for the ILU >>>> factorization >>>> to be used as a preconditioner to GMRES. I have successfully used >>>> PCFactorSetMatOrdering(), to set standard types, however I have my >>>> own >>>> reordering method which I would like to be able to use. >>>> >>>> Since my reordering method is based on information not found within >>>> the >>>> matrix and I may want to solve several different systems with the >>>> same >>>> ordering, what I really want to be able to do is create an index >>>> set in >>>> advance and pass it is to be used for the reordering. >>>> >>>> Is this possible, and how would you suggest I go about doing this? >>>> >>>> Thanks, >>>> >>>> Laslo >>>> >>>> >>> >> >> > From zonexo at gmail.com Thu Feb 22 01:33:23 2007 From: zonexo at gmail.com (Ben Tay) Date: Thu, 22 Feb 2007 15:33:23 +0800 Subject: Using Compaq visual fortran with PETSc and not installing Intel MKL/MPICH Message-ID: <804ab5d40702212333l71dccda4yab44aa730effae9a@mail.gmail.com> Hi, I have been using PETSc with visual fortran/intel mkl/mpich installed. This has the same configuration as the configuration file .dsw supplied by PETSc. However, now using another of my school's computer, MKL and MPICH are not installed. Is there anyway I can still use compaq visual fortran with PETSc? By using --download-f-blas-lapack=1, how can I make them work? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From balay at mcs.anl.gov Thu Feb 22 09:15:57 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Thu, 22 Feb 2007 09:15:57 -0600 (CST) Subject: Using Compaq visual fortran with PETSc and not installing Intel MKL/MPICH In-Reply-To: <804ab5d40702212333l71dccda4yab44aa730effae9a@mail.gmail.com> References: <804ab5d40702212333l71dccda4yab44aa730effae9a@mail.gmail.com> Message-ID: On Thu, 22 Feb 2007, Ben Tay wrote: > Hi, > > I have been using PETSc with visual fortran/intel mkl/mpich installed. This > has the same configuration as the configuration file .dsw supplied by PETSc. > However, now using another of my school's computer, MKL and MPICH are not > installed. > > Is there anyway I can still use compaq visual fortran with PETSc? By using > --download-f-blas-lapack=1, how can I make them work? For blas/lapack the above should work, but for MPI - you'll have to either install mpich or use --with-mpi=0. Satish From hzhang at mcs.anl.gov Thu Feb 22 16:35:54 2007 From: hzhang at mcs.anl.gov (Hong Zhang) Date: Thu, 22 Feb 2007 16:35:54 -0600 (CST) Subject: efficiency of tranposing a Matrix? In-Reply-To: <20070221203758.2941.qmail@web36206.mail.mud.yahoo.com> References: <20070221203758.2941.qmail@web36206.mail.mud.yahoo.com> Message-ID: Shi, I added preallocation to MatTranspose_MPIAIJ(), in which, d_nnz is computed from L, but o_nnz is set as d_nnz. This avoids data communication, and allocates sufficient space in most cases, I believe :-) You may either get petsc-dev, or replace MatTranspose_MPIAIJ() in your ~petsc/src/mat/impls/aij/mpi/mpiaij.c with the one attached. Then rebuild the petsc lib. Let us know if you still have slow down in MatTranspose(). Hong On Wed, 21 Feb 2007, Shi Jin wrote: > Hi there, > > I have a code that keeps on using the same matrix L > and its transpose in all time updates. > I can improve the performance of the code by replacing > the MatMultTranspose() with MatMult() and computing > the transposed matrix at the beginning of the code for > only once. The cost is of course extra storage of the > transposed matrix. > > However, I have a question regarding the efficiency of > transposing the matrix. I created the Matrix L with > MPIAIJ and preallocated the proper memory for it. > Then I call MatTranspose(L,<) to compute LT which is > the transposed L. But I noticed that this process is > extremely slow, 6 times slower than the creation of > Matrix L itself. > > The first question is do I need to preallocate the > memory for LT also? I didn't do it since I suppose > PETSc is smart enough to figure out the necessary > storage. > > Secondly, I am not sure why MatTranspose is so slow. I > understand in order to transpose a Matrix, one may > need to call MPI_Alltoall which is extremely > expensive. But it seems trivial that I can go through > a similar process of creating the Matrix L and be much > faster. I am not sure how MatTraspose() is implemented > and whether I should actually compose LT instead of > transpose L. > > Thank you very much. > > Shi > > > > ____________________________________________________________________________________ > Don't get soaked. Take a quick peak at the forecast > with the Yahoo! Search weather shortcut. > http://tools.search.yahoo.com/shortcuts/#loc_weather > > -------------- next part -------------- A non-text attachment was scrubbed... Name: mattranspose.c Type: text/x-csrc Size: 2314 bytes Desc: mattranspose.c URL: From niriedith at gmail.com Fri Feb 23 09:08:42 2007 From: niriedith at gmail.com (Niriedith Karina ) Date: Fri, 23 Feb 2007 11:08:42 -0400 Subject: about Unstructured Meshes Message-ID: Hi !! I'm new here...and i'm new a petsc user.. :P I want to know if petsc has support for meshes... I was reading about that and i need create a mesh but all the software that i see are comercial ...so I use petsc for linear solver and matrices so if petsc create meshes it would be but easy for me .. so.. help me ! :D Thanks anyway... -------------- next part -------------- An HTML attachment was scrubbed... URL: From niriedith at gmail.com Fri Feb 23 09:11:54 2007 From: niriedith at gmail.com (Niriedith Karina ) Date: Fri, 23 Feb 2007 11:11:54 -0400 Subject: about SAMG Message-ID: Hi again :P how can work whit the algebraic multigrid in petsc? i really don't know :( thaks :D -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Feb 23 09:13:23 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 23 Feb 2007 09:13:23 -0600 Subject: about Unstructured Meshes In-Reply-To: References: Message-ID: On 2/23/07, Niriedith Karina wrote: > Hi !! > > I'm new here...and i'm new a petsc user.. :P > I want to know if petsc has support for meshes... > I was reading about that and i need create a mesh but all the software that > i see are comercial ...so I use petsc for linear solver and matrices so if > petsc create meshes it would be but easy for me .. so.. > help me ! :D 1) PETSc does not make meshes, but there are good free meshing packages, Triangle in 2D and TetGen in 3D. 2) The unstructured mesh support in PETSc is very new, and at this point is probably only usable by expert programmers. If you feel up to it, take a look at the examples in src/dm/mesh/examples/tutorials. Otherwise, you can use the packages above and manage the construction of Mats and Vecs yourself. Thanks, Matt > Thanks anyway... From knepley at gmail.com Fri Feb 23 09:14:53 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 23 Feb 2007 09:14:53 -0600 Subject: about SAMG In-Reply-To: References: Message-ID: On 2/23/07, Niriedith Karina wrote: > Hi again :P > > how can work whit the algebraic multigrid in petsc? i really don't know :( You need to configure with an AMG package, for instance --download-hypre. Then the solver will be available -pc_type hypre -pc_hypre_type boomeramg. Matt > thaks :D > From niriedith at gmail.com Fri Feb 23 09:50:59 2007 From: niriedith at gmail.com (Niriedith Karina ) Date: Fri, 23 Feb 2007 11:50:59 -0400 Subject: about Unstructured Meshes In-Reply-To: References: Message-ID: oka oka... thaks but... i don't understand very well.... what can i do with petsc and the unstructured mesh support? because you say that petsc does not make meshes so what it's new about the meshes in petsc ? i'm sorry but i'm very new in this area... thaks again... On 2/23/07, Matthew Knepley wrote: > > On 2/23/07, Niriedith Karina wrote: > > Hi !! > > > > I'm new here...and i'm new a petsc user.. :P > > I want to know if petsc has support for meshes... > > I was reading about that and i need create a mesh but all the software > that > > i see are comercial ...so I use petsc for linear solver and matrices so > if > > petsc create meshes it would be but easy for me .. so.. > > help me ! :D > > 1) PETSc does not make meshes, but there are good free meshing packages, > Triangle in 2D and TetGen in 3D. > > 2) The unstructured mesh support in PETSc is very new, and at this point > is > probably only usable by expert programmers. If you feel up to it, > take a look > at the examples in src/dm/mesh/examples/tutorials. Otherwise, you can > use > the packages above and manage the construction of Mats and Vecs > yourself. > > Thanks, > > Matt > > > Thanks anyway... > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Fri Feb 23 09:53:07 2007 From: knepley at gmail.com (Matthew Knepley) Date: Fri, 23 Feb 2007 09:53:07 -0600 Subject: about Unstructured Meshes In-Reply-To: References: Message-ID: On 2/23/07, Niriedith Karina wrote: > oka oka... > thaks > but... i don't understand very well.... > what can i do with petsc and the unstructured mesh support? because you say > that petsc does not make meshes so what it's new about the meshes in petsc ? > i'm sorry but i'm very new in this area... Then I think you should not try out the PETSc stuff yet. I would just get the appropriate mesh generator and go from there. The extra support is for construction of functions and operators over a mesh, but you can do that yourself after generating them. Matt > thaks again... > > > On 2/23/07, Matthew Knepley wrote: > > On 2/23/07, Niriedith Karina < niriedith at gmail.com> wrote: > > > Hi !! > > > > > > I'm new here...and i'm new a petsc user.. :P > > > I want to know if petsc has support for meshes... > > > I was reading about that and i need create a mesh but all the software > that > > > i see are comercial ...so I use petsc for linear solver and matrices so > if > > > petsc create meshes it would be but easy for me .. so.. > > > help me ! :D > > > > 1) PETSc does not make meshes, but there are good free meshing packages, > > Triangle in 2D and TetGen in 3D. > > > > 2) The unstructured mesh support in PETSc is very new, and at this point > is > > probably only usable by expert programmers. If you feel up to it, > > take a look > > at the examples in src/dm/mesh/examples/tutorials. Otherwise, you can > use > > the packages above and manage the construction of Mats and Vecs > yourself. > > > > Thanks, > > > > Matt > > > > > Thanks anyway... > > > > > > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie From niriedith at gmail.com Tue Feb 27 08:23:20 2007 From: niriedith at gmail.com (Niriedith Karina ) Date: Tue, 27 Feb 2007 10:23:20 -0400 Subject: about mesh generators Message-ID: Hi!! I read about a software Hexgen .... Anyone knows where find this software...is very very important because i need that a software for meshes hexa... Thanks !! -------------- next part -------------- An HTML attachment was scrubbed... URL: From niriedith at gmail.com Tue Feb 27 10:24:33 2007 From: niriedith at gmail.com (Niriedith Karina ) Date: Tue, 27 Feb 2007 12:24:33 -0400 Subject: about AMG Message-ID: Hi! How configure hypre with petsc?... I have installed hypre in the cluster...but when i run a program in petsc with -pc_type hypre -pc_type_hypre boomeramg it doesn't work :( Help me! :( -------------- next part -------------- An HTML attachment was scrubbed... URL: From bsmith at mcs.anl.gov Tue Feb 27 11:09:22 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 27 Feb 2007 11:09:22 -0600 (CST) Subject: about AMG In-Reply-To: References: Message-ID: Add the config/configure.py option --download-hypre Note that the hypre from the LLNL has some bugs that make it unusable from PETSc. Barry On Tue, 27 Feb 2007, Niriedith Karina wrote: > Hi! > > How configure hypre with petsc?... > > I have installed hypre in the cluster...but when i run a program in petsc > with -pc_type hypre -pc_type_hypre boomeramg it doesn't work :( > Help me! :( > From niriedith at gmail.com Tue Feb 27 11:31:53 2007 From: niriedith at gmail.com (Niriedith Karina ) Date: Tue, 27 Feb 2007 13:31:53 -0400 Subject: about AMG In-Reply-To: References: Message-ID: i did that...and the configuration was successful ./configure and then make install but it doesn't work :( I think that may be the dir where i did it is the problem... petsc in /opt/petsc/ hypre in /opt/hypre/ (sorry also i'm new in english and linux :P ) Help me.... :( -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Feb 27 12:49:11 2007 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 27 Feb 2007 12:49:11 -0600 Subject: about AMG In-Reply-To: References: Message-ID: If you have a configure problem, please send configure.log to petsc-maint at mcs.anl.gov. Matt On 2/27/07, Niriedith Karina wrote: > i did that...and the configuration was successful > ./configure > and then > make install > > but it doesn't work :( > I think that may be the dir where i did it is the problem... > petsc in /opt/petsc/ > hypre in /opt/hypre/ > (sorry also i'm new in english and linux :P ) > > Help me.... > :( > -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie From dalcinl at gmail.com Tue Feb 27 13:35:41 2007 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Tue, 27 Feb 2007 16:35:41 -0300 Subject: about AMG In-Reply-To: References: Message-ID: In case this helps you, here goes what I usually do for installing petsc in our cluster, in a central location. Make sure 'mpicc' is in your $PATH, the first line is to be sure about this. $ which mpicc /usr/local/mpich2/1.0.5/bin/mpicc $ tar -zxf petsc-2.3.2-p8.tar.gz $ cd petsc-2.3.2-p8 $ export PETSC_DIR=`pwd` $ export PETSC_ARCH=linux-gnu $ touch ~/.hypre_license $ python config/configure.py --prefix=/usr/local/petsc/2.3.2 --with-shared=1 --with-hypre=1 --download-hypre=ifneeded $ make $ su -c 'make install' $ export PETSC_DIR=/usr/local/petsc/2.3.2 $ make test On 2/27/07, Niriedith Karina wrote: > i did that...and the configuration was successful > ./configure > and then > make install > > but it doesn't work :( > I think that may be the dir where i did it is the problem... > petsc in /opt/petsc/ > hypre in /opt/hypre/ > (sorry also i'm new in english and linux :P ) > > Help me.... > :( > -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From bsmith at mcs.anl.gov Tue Feb 27 15:45:28 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Tue, 27 Feb 2007 15:45:28 -0600 (CST) Subject: about AMG In-Reply-To: References: Message-ID: Please send to petsc-maint at mcs.anl.gov configure.log and make_* from building PETSc. Barry On Tue, 27 Feb 2007, Niriedith Karina wrote: > i did that...and the configuration was successful > ./configure > and then > make install > > but it doesn't work :( > I think that may be the dir where i did it is the problem... > petsc in /opt/petsc/ > hypre in /opt/hypre/ > (sorry also i'm new in english and linux :P ) > > Help me.... > :( > From niriedith at gmail.com Tue Feb 27 15:54:22 2007 From: niriedith at gmail.com (Niriedith Karina ) Date: Tue, 27 Feb 2007 17:54:22 -0400 Subject: algebraic multigrid preconditioner Message-ID: Hi! I need a linear solver with GMRES and a Algebraiv multigrid I read about SAMG y BoomerAMG...but i dont know how use SAMG and with Hypre i had some problem to configure....also you say that hypre has some bugs that make it unusable from PETSc...so my question how can i do that? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From knepley at gmail.com Tue Feb 27 15:57:39 2007 From: knepley at gmail.com (Matthew Knepley) Date: Tue, 27 Feb 2007 15:57:39 -0600 Subject: algebraic multigrid preconditioner In-Reply-To: References: Message-ID: On 2/27/07, Niriedith Karina wrote: > Hi! > > I need a linear solver with GMRES and a Algebraiv multigrid > I read about SAMG y BoomerAMG...but i dont know how use SAMG > and with Hypre i had some problem to configure....also you say that hypre > has some bugs that make it unusable > from PETSc...so my question how can i do that? > > Thanks! You must reconfigure using --download-hypre. If you have a problem, send the configure.log to petsc-maint at mcs.anl.gov. Matt -- One trouble is that despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature citation too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. -- Drummond Rennie From niriedith at gmail.com Tue Feb 27 16:54:50 2007 From: niriedith at gmail.com (Niriedith Karina ) Date: Tue, 27 Feb 2007 18:54:50 -0400 Subject: algebraic multigrid preconditioner In-Reply-To: References: Message-ID: Finally I understood =D ...but...i have a problem...my account in the cluster is limited...i use petsc but it's installed previously in the university i have a month using petsc...so...my question is it's the only way to do that right,( reinstall petsc)? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From dalcinl at gmail.com Tue Feb 27 17:41:11 2007 From: dalcinl at gmail.com (Lisandro Dalcin) Date: Tue, 27 Feb 2007 20:41:11 -0300 Subject: algebraic multigrid preconditioner In-Reply-To: References: Message-ID: Well, in our cluster, the debug version need [dalcinl at aquiles ~]$ du -sh /usr/local/petsc/dev/lib/linux-gnu 163M /usr/local/petsc/dev/lib/linux-gnu but the optimized version only need [dalcinl at aquiles ~]$ du -sh /usr/local/petsc/dev/lib/linux-gnu-O 18M /usr/local/petsc/dev/lib/linux-gnu-O Can you afford to have 20M in your cluster account? Of course, you will need many more space for building PETSc, but perhaps you can do that on /tmp or some public scratch space. Or even better, ask your cluster sys admin to build PETSc with your specific configure options (in your case, with hypre), usig an appropriate name for PETSC_ARCH, for example PETSC_ARCH=linux-gnu-O-hypre. On 2/27/07, Niriedith Karina wrote: > Finally I understood =D ...but...i have a problem...my account in the > cluster is limited...i use petsc but it's installed previously in the > university i have a month using petsc...so...my question is it's the only > way to do that right,( reinstall petsc)? > > Thanks! > -- Lisandro Dalc?n --------------- Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC) Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC) Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET) PTLC - G?emes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 From jinzishuai at yahoo.com Wed Feb 28 16:14:19 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Wed, 28 Feb 2007 14:14:19 -0800 (PST) Subject: Memory allocated by PETSC? Message-ID: <414054.87968.qm@web36208.mail.mud.yahoo.com> Hi, I am curious how much extra memory PETSc allocates in the background. Since my estimate of memory usage of the code is much smaller than what I see when it runs. So I did this simple test: First I used PETSc to dump a matrix in binary format into a file. The file has a size of 13MB. I assume this should be the same size that is used to store the matrix in memory. Then I wrote a simple code that does nothing but to load this matrix from the file by MatLoad(). However, I found that the code consumes 29MB of memory (VIRT=29M from top) using single process. This is confirmed by the -malloc_log option where it says Maximum memory PetscMalloc()ed 29246912 maximum size of entire process 0 I've attached the output of the code with detailed malloc information. Could you please explain to me about the difference of over two time? I don't want to criticize anything but need an clear idea of how much memory is needed so that I know whether there is a chance for me to reduce the memory usage of my production code. Thank you very much. Shi ____________________________________________________________________________________ Need a quick answer? Get one in minutes from people who know. Ask your question on www.Answers.yahoo.com -------------- next part -------------- A non-text attachment was scrubbed... Name: out Type: application/octet-stream Size: 1455 bytes Desc: 1857821269-out URL: From balay at mcs.anl.gov Wed Feb 28 16:51:38 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 28 Feb 2007 16:51:38 -0600 (CST) Subject: Memory allocated by PETSC? In-Reply-To: <414054.87968.qm@web36208.mail.mud.yahoo.com> References: <414054.87968.qm@web36208.mail.mud.yahoo.com> Message-ID: > Maximum memory PetscMalloc()ed 29246912 maximum size of entire process 0 The choice of wording here is a bit misleading. PETSc is using getrusage(ru_maxrss) - which is resident set size. [so top should show similar numbers got RSS] This might include both code segment and data segments - and the code segment part could be a few MB - perhaps up to 10 MB] 0: [0] 10 15321472 MatSeqAIJSetPreallocation_SeqAIJ() This indicates that the matrix is taking approximately 15MB. And there are other datastructures that are taking about another couple of MB space. Depending upon how the malloc()/free() is implemented in the OS - some of the freed memory might not immediately reflect on th RSS count. Hope this helps.. Satish On Wed, 28 Feb 2007, Shi Jin wrote: > Hi, > I am curious how much extra memory PETSc allocates in > the background. Since my estimate of memory usage of > the code is much smaller than what I see when it runs. > So I did this simple test: > First I used PETSc to dump a matrix in binary format > into a file. The file has a size of 13MB. I assume > this should be the same size that is used to store the > matrix in memory. Then I wrote a simple code that does > nothing but to load this matrix from the file by > MatLoad(). However, I found that the code consumes > 29MB of memory (VIRT=29M from top) using single > process. > This is confirmed by the -malloc_log option where it > says > Maximum memory PetscMalloc()ed 29246912 maximum size > of entire process 0 > I've attached the output of the code with detailed > malloc information. > Could you please explain to me about the difference > of over two time? > I don't want to criticize anything but need an clear > idea of how much memory is needed so that I know > whether there is a chance for me to reduce the memory > usage of my production code. > Thank you very much. > > Shi > > > > ____________________________________________________________________________________ > Need a quick answer? Get one in minutes from people who know. > Ask your question on www.Answers.yahoo.com From bsmith at mcs.anl.gov Wed Feb 28 16:59:51 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 28 Feb 2007 16:59:51 -0600 (CST) Subject: Memory allocated by PETSC? In-Reply-To: <414054.87968.qm@web36208.mail.mud.yahoo.com> References: <414054.87968.qm@web36208.mail.mud.yahoo.com> Message-ID: Shi, The current algorithm used to do a MatLoad_MPIAIJ requires memory on each process of about TWICE the memory required just for the matrix. For example, if a matrix requires 40 megabytes total, after it is completely loaded on 4 processes it will take about 10 megabytes on each process, BUT during MatLoad each process will use 20 megabytes (10 for the final matrix and 10 for work space to receive message in). This is why the total is 29 meg: 15meg for the final matrix and around 18meg for the MatLoad. Barry We could work hard and reduce the amount of memory used during the load process if this is problem for you. We are not fans of loading huge matrices from files so generally this is not a problem. On Wed, 28 Feb 2007, Shi Jin wrote: > Hi, > I am curious how much extra memory PETSc allocates in > the background. Since my estimate of memory usage of > the code is much smaller than what I see when it runs. > So I did this simple test: > First I used PETSc to dump a matrix in binary format > into a file. The file has a size of 13MB. I assume > this should be the same size that is used to store the > matrix in memory. Then I wrote a simple code that does > nothing but to load this matrix from the file by > MatLoad(). However, I found that the code consumes > 29MB of memory (VIRT=29M from top) using single > process. > This is confirmed by the -malloc_log option where it > says > Maximum memory PetscMalloc()ed 29246912 maximum size > of entire process 0 > I've attached the output of the code with detailed > malloc information. > Could you please explain to me about the difference > of over two time? > I don't want to criticize anything but need an clear > idea of how much memory is needed so that I know > whether there is a chance for me to reduce the memory > usage of my production code. > Thank you very much. > > Shi > > > > ____________________________________________________________________________________ > Need a quick answer? Get one in minutes from people who know. > Ask your question on www.Answers.yahoo.com From jinzishuai at yahoo.com Wed Feb 28 17:16:15 2007 From: jinzishuai at yahoo.com (Shi Jin) Date: Wed, 28 Feb 2007 15:16:15 -0800 (PST) Subject: Memory allocated by PETSC? In-Reply-To: Message-ID: <168399.47181.qm@web36205.mail.mud.yahoo.com> Thank you very much. This is very helpful. So the mismatch in size only comes from MatLoad()? I am actually not a big fan of loading the matrices either. I used it just to do some test. There is no need to change the implementation for me at all. So can I say that if I am going to construct the same Matrix in the code using MatCreateMPIAIJ() and if I do the preallocation exactly, then I should see roughly 15MB of memory used? Thank you. Shi --- Barry Smith wrote: > > Shi, > > The current algorithm used to do a > MatLoad_MPIAIJ requires > memory on each process of about TWICE the memory > required just for > the matrix. For example, if a matrix requires 40 > megabytes total, > after it is completely loaded on 4 processes it will > take about > 10 megabytes on each process, BUT during MatLoad > each process will > use 20 megabytes (10 for the final matrix and 10 for > work space > to receive message in). This is why the total is 29 > meg: 15meg for the final > matrix and around 18meg for the MatLoad. > > Barry > > We could work hard and reduce the amount of memory > used during the > load process if this is problem for you. We are not > fans of loading > huge matrices from files so generally this is not a > problem. > > > On Wed, 28 Feb 2007, Shi Jin wrote: > > > Hi, > > I am curious how much extra memory PETSc allocates > in > > the background. Since my estimate of memory usage > of > > the code is much smaller than what I see when it > runs. > > So I did this simple test: > > First I used PETSc to dump a matrix in binary > format > > into a file. The file has a size of 13MB. I assume > > this should be the same size that is used to store > the > > matrix in memory. Then I wrote a simple code that > does > > nothing but to load this matrix from the file by > > MatLoad(). However, I found that the code consumes > > 29MB of memory (VIRT=29M from top) using single > > process. > > This is confirmed by the -malloc_log option where > it > > says > > Maximum memory PetscMalloc()ed 29246912 maximum > size > > of entire process 0 > > I've attached the output of the code with detailed > > malloc information. > > Could you please explain to me about the > difference > > of over two time? > > I don't want to criticize anything but need an > clear > > idea of how much memory is needed so that I know > > whether there is a chance for me to reduce the > memory > > usage of my production code. > > Thank you very much. > > > > Shi > > > > > > > > > ____________________________________________________________________________________ > > Need a quick answer? Get one in minutes from > people who know. > > Ask your question on www.Answers.yahoo.com > > ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From balay at mcs.anl.gov Wed Feb 28 17:23:51 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 28 Feb 2007 17:23:51 -0600 (CST) Subject: Memory allocated by PETSC? In-Reply-To: References: <414054.87968.qm@web36208.mail.mud.yahoo.com> Message-ID: On Wed, 28 Feb 2007, Satish Balay wrote: > > Maximum memory PetscMalloc()ed 29246912 maximum size of entire process 0 > > The choice of wording here is a bit misleading. PETSc is using > getrusage(ru_maxrss) - which is resident set size. [so top should show > similar numbers got RSS] Ops - my comments are wrong here.. RSS here is printed as '0' - so there is a problem with PETSc code somewhere.. Satish From balay at mcs.anl.gov Wed Feb 28 17:36:46 2007 From: balay at mcs.anl.gov (Satish Balay) Date: Wed, 28 Feb 2007 17:36:46 -0600 (CST) Subject: Memory allocated by PETSC? In-Reply-To: <168399.47181.qm@web36205.mail.mud.yahoo.com> References: <168399.47181.qm@web36205.mail.mud.yahoo.com> Message-ID: On Wed, 28 Feb 2007, Shi Jin wrote: > Thank you very much. > This is very helpful. > So the mismatch in size only comes from MatLoad()? > I am actually not a big fan of loading the matrices > either. I used it just to do some test. There is no > need to change the implementation for me at all. > > So can I say that if I am going to construct the same > Matrix in the code using MatCreateMPIAIJ() and if I do > the preallocation exactly, then I should see roughly > 15MB of memory used? more or less.. However this number [Maximum memory PetscMalloc()ed] corresponds to malloced memory only. RSS numbers would be different. Note that the extra memory in MatLoad() is just temporary - i.e its freed at the end of this function. [and in the next stage the solver might take lot more memory than this temporary matload stuff] Satish From bsmith at mcs.anl.gov Wed Feb 28 20:39:55 2007 From: bsmith at mcs.anl.gov (Barry Smith) Date: Wed, 28 Feb 2007 20:39:55 -0600 (CST) Subject: Memory allocated by PETSC? In-Reply-To: References: <414054.87968.qm@web36208.mail.mud.yahoo.com> Message-ID: > > Ops - my comments are wrong here.. RSS here is printed as '0' - so > there is a problem with PETSc code somewhere.. > I could never get Linux to give me this number correctly :-( > Satish > >