From Stephen.R.Ball at awe.co.uk  Thu Apr  3 06:34:13 2008
From: Stephen.R.Ball at awe.co.uk (Stephen R Ball)
Date: Thu, 3 Apr 2008 12:34:13 +0100
Subject: Default convergence test used in PETSc
Message-ID: <843CZB023912@awe.co.uk>

Hi

I am trying to determine the default convergence test used by PETSc.
Your pdf user's guide states the default test is based on the decrease
of the residual norm relative to the right-hand-side while your html
documentation for KSPDefaultConverged() seems to imply that the default
test is based on the decrease of the residual norm relative to the
initial residual norm. Can you tell me which is actually the default?

Regards

Stephen


From knepley at gmail.com  Thu Apr  3 08:53:23 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Thu, 3 Apr 2008 08:53:23 -0500
Subject: Default convergence test used in PETSc
In-Reply-To: <843CZB023912@awe.co.uk>
References: <843CZB023912@awe.co.uk>
Message-ID: <a9f269830804030653i2afc618cyac63856e957fde99@mail.gmail.com>

There are special cases, the explanation can be rather lengthy. I instead point
you to the actual code:

http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/src/ksp/ksp/interface/iterativ.c.html#KSPDefaultConverged

   Matt

On Thu, Apr 3, 2008 at 6:34 AM, Stephen R Ball <Stephen.R.Ball at awe.co.uk> wrote:
> Hi
>
>  I am trying to determine the default convergence test used by PETSc.
>  Your pdf user's guide states the default test is based on the decrease
>  of the residual norm relative to the right-hand-side while your html
>  documentation for KSPDefaultConverged() seems to imply that the default
>  test is based on the decrease of the residual norm relative to the
>  initial residual norm. Can you tell me which is actually the default?
>
>  Regards
>
>  Stephen
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From bsmith at mcs.anl.gov  Wed Apr  2 01:58:42 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Tue, 1 Apr 2008 23:58:42 -0700
Subject: Non repeatability issue
In-Reply-To: <47F0FF08.9000800@unibas.it>
References: <4715D89B.8070005@unibas.it> <47D6B3E1.5070606@unibas.it> <9D646B5A-2D5C-4492-82AF-8E732AF848BD@mcs.anl.gov> <47DFA037.7030707@unibas.it> <81BF6B4F-9199-41AA-B571-20186E2DE43E@mcs.anl.gov> <47F0FF08.9000800@unibas.it>
Message-ID: <4DC60834-9639-47A1-A0E9-17145DF972B8@mcs.anl.gov>


   Try a ksp_tol of 1.e-14 instead of 1.e-12?

    Barry

On Mar 31, 2008, at 8:11 AM, Aldo Bonfiglioli wrote:
> Barry, Matt
> I am back on the Non repeatability issue with answers
> to your questions.
>
>> 2) did you do the -ksp_rtol 1.e-12 at the same time as the -   
>> vecscatter_reproduce? They
>> must be done together.
>
> The enclosed plot (res_vs_step) shows the mass residual
> history versus the Newton step counter.
>
> For these same runs, the continuation parameter (CFL) shows similar  
> jumps
> being based upon the SER approach, see plot cfl_vs_its.pdf
>
>> When you just fix the CFL and run Newton  runs to completion
>> is it stable?
>
> I have restarted the code from an almost fully converged solution
> using infinite CFL and let it run for 30 Newton steps.
> The behaviour is much more "reasonable" and the solution
> remains within the steady state (see plot restarted....)
>
>> Then if you ramp up the CFL much more slowly is it  stable and Newton
>> convergence much smoother?
>
> I have not tried yet.
> I know there exist smoother strategies
> than SER to rump the continuation parameter
> (I know this
> http://www.cs.kuleuven.ac.be/publicaties/rapporten/tw/TW304.ps.gz
> for instance)
>
>
> Aldo
>
> -- 
> Dr. Aldo Bonfiglioli
> Dip.to di Ingegneria e Fisica dell'Ambiente (DIFA)
> Universita' della Basilicata
> V.le dell'Ateneo lucano, 10 85100 Potenza ITALY
> tel:+39.0971.205203 fax:+39.0971.205160
>
>
> < 
> res_vs_step 
> .pdf><cfl_vs_its.pdf><restarted_from_converged_solution.pdf>


From amjad11 at gmail.com  Sun Apr  6 23:49:58 2008
From: amjad11 at gmail.com (amjad ali)
Date: Mon, 7 Apr 2008 09:49:58 +0500
Subject: Installation with Intel or PGI compilers
Message-ID: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com>

*Hello all,

I installed PETSc with intel compilers. Please comment on that what is the
difference between the PETSc installed with gnu compilers and the PETSC
installed with intel compilers. Any difference in efficiency? or what so
ever?

What you say if we intall PETSc with PGI compilers and also we use
MPI-profiler/debugger (available in PGI Cluster Toolkit) for PETSc
applications? Is it possible? and beneficial?

with best regards,
Amjad Ali.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080407/8b3ef343/attachment.htm>

From petsc-maint at mcs.anl.gov  Sun Apr  6 23:59:07 2008
From: petsc-maint at mcs.anl.gov (Satish Balay)
Date: Sun, 6 Apr 2008 23:59:07 -0500 (CDT)
Subject: Installation with Intel or PGI compilers
In-Reply-To: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com>
References: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804062351310.3222@asterix>

On Mon, 7 Apr 2008, amjad ali wrote:

> *Hello all,
> 
> I installed PETSc with intel compilers. Please comment on that what is the
> difference between the PETSc installed with gnu compilers and the PETSC
> installed with intel compilers. Any difference in efficiency? or what so
> ever?
> 
> What you say if we intall PETSc with PGI compilers and also we use
> MPI-profiler/debugger (available in PGI Cluster Toolkit) for PETSc
> applications? Is it possible? and beneficial?

We expect Intel compilers to be able to optimize code better than GNU
compilers. Perhaps PGI might do the same.

Its best to run your code [with PETSc compiled with the desired
compilers - and optimization options] and run with -log_summary to
compare performance differences.

And we have no experience with PGI Cluster Toolkit or its usefulness.

Note: The choice of compilers [and their cost/benifit] varies
depending on the usage.

- For code development - debuggability matters. gnu compilers work
reasonably well for this usage [with gdb, valgrind, etc..]

- For production runs, a 10% performance difference might not matter
for code that runs for less than an hour. But it might be significant
for long-running jobs.

etc..

Satish


From dave.knez at gmail.com  Mon Apr  7 04:52:30 2008
From: dave.knez at gmail.com (David Knezevic)
Date: Mon, 07 Apr 2008 10:52:30 +0100
Subject: Stalling once linear system becomes a certain size
Message-ID: <47F9EEDE.1050604@gmail.com>

Hello,

I am trying to run a PETSc code on a parallel machine (it may be 
relevant that each node contains four AMD Opteron Quad-Core 64-bit 
processors (16 cores in all) as an SMP unit with 32GB of memory) and I'm 
observing some behaviour I don't understand.

I'm using PETSC_COMM_SELF in order to construct the same matrix on each 
processor (and solve the system with a different right-hand side vector 
on each processor), and when each linear system is around 315x315 
(block-sparse), then each linear system is solved very quickly on each 
processor (approx 7x10^{-4} seconds), but when I increase the size of 
the linear system to 350x350 (or larger), the linear solves completely 
stall. I've tried a number of different solvers and preconditioners, but 
nothing seems to help. Also, this code has worked very well on other 
machines, although the machines I have used it on before have not had 
this architecture in which each node is an SMP unit. I was wondering if 
you have observed this kind of issue before?

I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler.

Thanks very much,
David


From niko.karin at gmail.com  Mon Apr  7 08:16:44 2008
From: niko.karin at gmail.com (Nicolas Tardieu)
Date: Mon, 7 Apr 2008 09:16:44 -0400
Subject: Installation with Intel or PGI compilers
In-Reply-To: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com>
References: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com>
Message-ID: <c2b2dd230804070616r554d883q7076ec7ce2c4551b@mail.gmail.com>

Hi,

I have some troubles using PETSc compiled with Intel compilers (version
10.1)  in Fortran language in parallel on a 64 bits machine. The
PetscInitialize always fails. In order to make it work, I have to make the
following changes in petscconf.h.

320,321c320,321
< #ifndef PETSC_HAVE_IPXFARGC_
< #define PETSC_HAVE_IPXFARGC_ 1
---
> #ifndef PETSC_HAVE_IARGC_
> #define PETSC_HAVE_IARGC_ 1
488,489c488,489
< #ifndef PETSC_HAVE_PXFGETARG_NEW
< #define PETSC_HAVE_PXFGETARG_NEW 1
---
> #ifndef PETSC_HAVE_BGL_IARGC
> #define PETSC_HAVE_BGL_IARGC 1

Once this is done, PetscInitialize and the rest of the code works fine.
Strange, isn't it.....

Nicolas

2008/4/7, amjad ali <amjad11 at gmail.com>:
>
> *Hello all,
>
> I installed PETSc with intel compilers. Please comment on that what is the
> difference between the PETSc installed with gnu compilers and the PETSC
> installed with intel compilers. Any difference in efficiency? or what so
> ever?
>
> What you say if we intall PETSc with PGI compilers and also we use
> MPI-profiler/debugger (available in PGI Cluster Toolkit) for PETSc
> applications? Is it possible? and beneficial?
>
> with best regards,
> Amjad Ali.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080407/f50a5f59/attachment.htm>

From balay at mcs.anl.gov  Mon Apr  7 08:28:18 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Mon, 7 Apr 2008 08:28:18 -0500 (CDT)
Subject: Stalling once linear system becomes a certain size
In-Reply-To: <47F9EEDE.1050604@gmail.com>
References: <47F9EEDE.1050604@gmail.com>
Message-ID: <alpine.LFD.1.10.0804070824080.5395@asterix>

On Mon, 7 Apr 2008, David Knezevic wrote:

> Hello,
> 
> I am trying to run a PETSc code on a parallel machine (it may be relevant that
> each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in
> all) as an SMP unit with 32GB of memory) and I'm observing some behaviour I
> don't understand.
> 
> I'm using PETSC_COMM_SELF in order to construct the same matrix on each
> processor (and solve the system with a different right-hand side vector on
> each processor), and when each linear system is around 315x315 (block-sparse),
> then each linear system is solved very quickly on each processor (approx
> 7x10^{-4} seconds), but when I increase the size of the linear system to
> 350x350 (or larger), the linear solves completely stall. I've tried a number
> of different solvers and preconditioners, but nothing seems to help. Also,
> this code has worked very well on other machines, although the machines I have
> used it on before have not had this architecture in which each node is an SMP
> unit. I was wondering if you have observed this kind of issue before?
> 
> I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler.

I would sugest running the code in a debugger to determine the exact
location where the stall happens [with the minimum number of procs]

mpiexec -n 4 ./exe -start_in_debugger

By default the above tries to open xterms on the localhost - so to get
this working on the cluster - you might need proper
ssh-x11-portforwarding setup to the node, and then use the extra
command line option '-display'

[when the job kinda hangs - I would do ctrl-c in gdb and look at the
stack trace on each mpi-thread]

Satish


From balay at mcs.anl.gov  Mon Apr  7 08:29:31 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Mon, 7 Apr 2008 08:29:31 -0500 (CDT)
Subject: Installation with Intel or PGI compilers
In-Reply-To: <c2b2dd230804070616r554d883q7076ec7ce2c4551b@mail.gmail.com>
References: <428810f20804062149o6429f771l4a35903b2ff9dbd@mail.gmail.com> <c2b2dd230804070616r554d883q7076ec7ce2c4551b@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804070829050.5395@asterix>

Please send the corresponding confiure.log to petsc-maint at mcs.anl.gov

Satish

On Mon, 7 Apr 2008, Nicolas Tardieu wrote:

> Hi,
> 
> I have some troubles using PETSc compiled with Intel compilers (version
> 10.1)  in Fortran language in parallel on a 64 bits machine. The
> PetscInitialize always fails. In order to make it work, I have to make the
> following changes in petscconf.h.
> 
> 320,321c320,321
> < #ifndef PETSC_HAVE_IPXFARGC_
> < #define PETSC_HAVE_IPXFARGC_ 1
> ---
> > #ifndef PETSC_HAVE_IARGC_
> > #define PETSC_HAVE_IARGC_ 1
> 488,489c488,489
> < #ifndef PETSC_HAVE_PXFGETARG_NEW
> < #define PETSC_HAVE_PXFGETARG_NEW 1
> ---
> > #ifndef PETSC_HAVE_BGL_IARGC
> > #define PETSC_HAVE_BGL_IARGC 1
> 
> Once this is done, PetscInitialize and the rest of the code works fine.
> Strange, isn't it.....
> 
> Nicolas


From knepley at gmail.com  Mon Apr  7 08:34:19 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Mon, 7 Apr 2008 08:34:19 -0500
Subject: Stalling once linear system becomes a certain size
In-Reply-To: <alpine.LFD.1.10.0804070824080.5395@asterix>
References: <47F9EEDE.1050604@gmail.com>
	 <alpine.LFD.1.10.0804070824080.5395@asterix>
Message-ID: <a9f269830804070634v689e6bb8qd006a1952c41106b@mail.gmail.com>

It sounds like he is saying that the iterative solvers fail to
converge. It could be
that the systems become much more ill-conditioned. When solving anything,
first use LU

  -ksp_type preonly -pc_type lu

to determine if the system is consistent. Then use something simple, like
GMRES by itself

  -ksp_type gmres -pc_type none -ksp_monitor_singular_value
-ksp_gmres_restart 500

to get an idea of the condition number. Then start trying other solvers and PCs.

   Matt

On Mon, Apr 7, 2008 at 8:28 AM, Satish Balay <balay at mcs.anl.gov> wrote:
>
> On Mon, 7 Apr 2008, David Knezevic wrote:
>
>  > Hello,
>  >
>  > I am trying to run a PETSc code on a parallel machine (it may be relevant that
>  > each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in
>  > all) as an SMP unit with 32GB of memory) and I'm observing some behaviour I
>  > don't understand.
>  >
>  > I'm using PETSC_COMM_SELF in order to construct the same matrix on each
>  > processor (and solve the system with a different right-hand side vector on
>  > each processor), and when each linear system is around 315x315 (block-sparse),
>  > then each linear system is solved very quickly on each processor (approx
>  > 7x10^{-4} seconds), but when I increase the size of the linear system to
>  > 350x350 (or larger), the linear solves completely stall. I've tried a number
>  > of different solvers and preconditioners, but nothing seems to help. Also,
>  > this code has worked very well on other machines, although the machines I have
>  > used it on before have not had this architecture in which each node is an SMP
>  > unit. I was wondering if you have observed this kind of issue before?
>  >
>  > I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler.
>
>  I would sugest running the code in a debugger to determine the exact
>  location where the stall happens [with the minimum number of procs]
>
>  mpiexec -n 4 ./exe -start_in_debugger
>
>  By default the above tries to open xterms on the localhost - so to get
>  this working on the cluster - you might need proper
>  ssh-x11-portforwarding setup to the node, and then use the extra
>  command line option '-display'
>
>  [when the job kinda hangs - I would do ctrl-c in gdb and look at the
>  stack trace on each mpi-thread]
>
>  Satish
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From balay at mcs.anl.gov  Mon Apr  7 09:42:27 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Mon, 7 Apr 2008 09:42:27 -0500 (CDT)
Subject: Stalling once linear system becomes a certain size
In-Reply-To: <a9f269830804070634v689e6bb8qd006a1952c41106b@mail.gmail.com>
References: <47F9EEDE.1050604@gmail.com>  <alpine.LFD.1.10.0804070824080.5395@asterix> <a9f269830804070634v689e6bb8qd006a1952c41106b@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804070940240.5395@asterix>

Matt,

>  > I'm using PETSC_COMM_SELF in order to construct the same matrix
>  > on each processor (and solve the system with a different
>  > right-hand side vector on each processor),

So its a bunch of similar sequential solves - over PETSC_COMM_SELF. So
a seq solve on a given mpi-thread should not affect another seq solve
on another thread..

Satish

On Mon, 7 Apr 2008, Matthew Knepley wrote:

> It sounds like he is saying that the iterative solvers fail to
> converge. It could be
> that the systems become much more ill-conditioned. When solving anything,
> first use LU
> 
>   -ksp_type preonly -pc_type lu
> 
> to determine if the system is consistent. Then use something simple, like
> GMRES by itself
> 
>   -ksp_type gmres -pc_type none -ksp_monitor_singular_value
> -ksp_gmres_restart 500
> 
> to get an idea of the condition number. Then start trying other solvers and PCs.
> 
>    Matt
> 
> On Mon, Apr 7, 2008 at 8:28 AM, Satish Balay <balay at mcs.anl.gov> wrote:
> >
> > On Mon, 7 Apr 2008, David Knezevic wrote:
> >
> >  > Hello,
> >  >
> >  > I am trying to run a PETSc code on a parallel machine (it may be relevant that
> >  > each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in
> >  > all) as an SMP unit with 32GB of memory) and I'm observing some behaviour I
> >  > don't understand.
> >  >
> >  > I'm using PETSC_COMM_SELF in order to construct the same matrix on each
> >  > processor (and solve the system with a different right-hand side vector on
> >  > each processor), and when each linear system is around 315x315 (block-sparse),
> >  > then each linear system is solved very quickly on each processor (approx
> >  > 7x10^{-4} seconds), but when I increase the size of the linear system to
> >  > 350x350 (or larger), the linear solves completely stall. I've tried a number
> >  > of different solvers and preconditioners, but nothing seems to help. Also,
> >  > this code has worked very well on other machines, although the machines I have
> >  > used it on before have not had this architecture in which each node is an SMP
> >  > unit. I was wondering if you have observed this kind of issue before?
> >  >
> >  > I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler.
> >
> >  I would sugest running the code in a debugger to determine the exact
> >  location where the stall happens [with the minimum number of procs]
> >
> >  mpiexec -n 4 ./exe -start_in_debugger
> >
> >  By default the above tries to open xterms on the localhost - so to get
> >  this working on the cluster - you might need proper
> >  ssh-x11-portforwarding setup to the node, and then use the extra
> >  command line option '-display'
> >
> >  [when the job kinda hangs - I would do ctrl-c in gdb and look at the
> >  stack trace on each mpi-thread]
> >
> >  Satish
> >
> >
> 
> 
> 
> 


From rlmackie862 at gmail.com  Mon Apr  7 14:27:01 2008
From: rlmackie862 at gmail.com (Randall Mackie)
Date: Mon, 07 Apr 2008 12:27:01 -0700
Subject: error in creating 3d DA
Message-ID: <47FA7585.3060700@gmail.com>

I've run into a problem with my code where, for a smaller problem, it
bombs out in creating a 3D DA (with error message about the partition being
too fine in the z direction) for the case where np=121, but works fine
for the case np=484.

I would think that the creation of the DA should work fine for the smaller
number of processors as well, but maybe there is a bug in the logic?

Randy


From knepley at gmail.com  Mon Apr  7 15:20:06 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Mon, 7 Apr 2008 15:20:06 -0500
Subject: error in creating 3d DA
In-Reply-To: <47FA7585.3060700@gmail.com>
References: <47FA7585.3060700@gmail.com>
Message-ID: <a9f269830804071320n795d051euc6282fa5b7d13b17@mail.gmail.com>

On Mon, Apr 7, 2008 at 2:27 PM, Randall Mackie <rlmackie862 at gmail.com> wrote:
> I've run into a problem with my code where, for a smaller problem, it
>  bombs out in creating a 3D DA (with error message about the partition being
>  too fine in the z direction) for the case where np=121, but works fine
>  for the case np=484.
>
>  I would think that the creation of the DA should work fine for the smaller
>  number of processors as well, but maybe there is a bug in the logic?

DA does only the very simplest partitioning. Thus, the number of processors must
factor into np = np_x * np_y * np_z. However, things will be clearer if you send
the actual error message to petsc-maint at mcs.anl.gov.

  Matt

>  Randy
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From tyoung at ippt.gov.pl  Tue Apr  8 05:40:10 2008
From: tyoung at ippt.gov.pl (Toby D. Young)
Date: Tue, 8 Apr 2008 12:40:10 +0200
Subject: MatTranspose
Message-ID: <20080408124010.7d183e23@rav.ippt.gov.pl>


Hello all.

I confused about the statement on MatTranspose() on the manual pages at

http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatTranspose.html

where for 

#include "petscmat.h" 
PetscErrorCode  MatTranspose(Mat mat,Mat *B)

is the statement:

Notes
If you pass in PETSC_NULL for B an in-place transpose in mat will be
done

Does this mean that if I pass PETSC_NULL then the matrix "A" will be
returned as its own transpose? Does this save memory if I do not need
the original matrix and only its transpose? If not, is there an
efficient way to destroy the original matrix, thus keeping the
transpose only?

Can anyone please clarify for me what this statement means?

...and finally thanks to all for answering my previous confused
questions.   :-)

Best,
	Toby


-- 

Toby D. Young - Adiunkt (Assistant Professor)
Department of Computational Science
Institute of Fundamental Technological Research
Polish Academy of Sciences
Room 206, ul. Swietokrzyska 21
00-049 Warszawa, Polska

+48 22 826 12 81 ext. 184
http://rav.ippt.gov.pl/~tyoung


From knepley at gmail.com  Tue Apr  8 07:34:27 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 8 Apr 2008 07:34:27 -0500
Subject: MatTranspose
In-Reply-To: <20080408124010.7d183e23@rav.ippt.gov.pl>
References: <20080408124010.7d183e23@rav.ippt.gov.pl>
Message-ID: <a9f269830804080534q307c5efbt9fbb6d14974dd3a4@mail.gmail.com>

On Tue, Apr 8, 2008 at 5:40 AM, Toby D. Young <tyoung at ippt.gov.pl> wrote:
>
>
>  Hello all.
>
>  I confused about the statement on MatTranspose() on the manual pages at
>
>  http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatTranspose.html
>
>  where for
>
>  #include "petscmat.h"
>  PetscErrorCode  MatTranspose(Mat mat,Mat *B)
>
>  is the statement:
>
>  Notes
>  If you pass in PETSC_NULL for B an in-place transpose in mat will be
>  done
>
>  Does this mean that if I pass PETSC_NULL then the matrix "A" will be
>  returned as its own transpose? Does this save memory if I do not need

Yes.

>  the original matrix and only its transpose? If not, is there an

Yes.

   Matt

>  efficient way to destroy the original matrix, thus keeping the
>  transpose only?
>
>  Can anyone please clarify for me what this statement means?
>
>  ...and finally thanks to all for answering my previous confused
>  questions.   :-)
>
>  Best,
>         Toby
>
>
>
>  --
>
>  Toby D. Young - Adiunkt (Assistant Professor)
>  Department of Computational Science
>  Institute of Fundamental Technological Research
>  Polish Academy of Sciences
>  Room 206, ul. Swietokrzyska 21
>  00-049 Warszawa, Polska
>
>  +48 22 826 12 81 ext. 184
>  http://rav.ippt.gov.pl/~tyoung
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From balay at mcs.anl.gov  Tue Apr  8 08:52:06 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Tue, 8 Apr 2008 08:52:06 -0500 (CDT)
Subject: MatTranspose
In-Reply-To: <a9f269830804080534q307c5efbt9fbb6d14974dd3a4@mail.gmail.com>
References: <20080408124010.7d183e23@rav.ippt.gov.pl> <a9f269830804080534q307c5efbt9fbb6d14974dd3a4@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804080848330.30383@asterix>

On Tue, 8 Apr 2008, Matthew Knepley wrote:

> On Tue, Apr 8, 2008 at 5:40 AM, Toby D. Young <tyoung at ippt.gov.pl> wrote:
> >
> >
> >  Hello all.
> >
> >  I confused about the statement on MatTranspose() on the manual pages at
> >
> >  http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatTranspose.html
> >
> >  where for
> >
> >  #include "petscmat.h"
> >  PetscErrorCode  MatTranspose(Mat mat,Mat *B)
> >
> >  is the statement:
> >
> >  Notes
> >  If you pass in PETSC_NULL for B an in-place transpose in mat will be
> >  done
> >
> >  Does this mean that if I pass PETSC_NULL then the matrix "A" will be
> >  returned as its own transpose? Does this save memory if I do not need
> 
> Yes.
> 
> >  the original matrix and only its transpose? If not, is there an
> 
> Yes.
> 
>    Matt
> 
> >  efficient way to destroy the original matrix, thus keeping the
> >  transpose only?


Jut a note: MatTranspose(A,PETSC_NULL) is *almost* equivalent to:

MatTranspose(A,&B) 
MatDestroy(A)
A=B

So there is temporary increase in memory usage - until the original
matrix is deallocated.

Satish


From tyoung at ippt.gov.pl  Tue Apr  8 09:15:37 2008
From: tyoung at ippt.gov.pl (Toby D. Young)
Date: Tue, 8 Apr 2008 16:15:37 +0200 (CEST)
Subject: MatTranspose
In-Reply-To: <a9f269830804080534q307c5efbt9fbb6d14974dd3a4@mail.gmail.com>
References: <20080408124010.7d183e23@rav.ippt.gov.pl>
 <a9f269830804080534q307c5efbt9fbb6d14974dd3a4@mail.gmail.com>
Message-ID: <Pine.LNX.4.58.0804081614310.5337@brama.ippt.gov.pl>


Problem cleared up.    :-)
Thank you Matt and Satish!
Best,
	Toby

-----

Toby D. Young - Adiunkt (Assistant Professor)
Department of Computational Science
Institute of Fundamental Technological Research
Polish Academy of Sciences
Room 206, ul. Swietokrzyska 21
00-049 Warszawa, Polska

+48 22 826 12 81 ext. 184
http://rav.ippt.gov.pl/~tyoung


From jinzishuai at yahoo.com  Tue Apr  8 19:16:41 2008
From: jinzishuai at yahoo.com (Shi Jin)
Date: Tue, 8 Apr 2008 17:16:41 -0700 (PDT)
Subject: Further question about PC with Jaocbi Row Sum
Message-ID: <158168.69319.qm@web36208.mail.mud.yahoo.com>

Hi there,

First of all, I want to thank Matt  for his previous suggestion on the use of -pc_jacobi_rowsum 1 option. Now I have a more theoretical question I hope you may help although it does not necessarily connects to PETsc directly.

Basically, I am trying to speed up the solution of a finite element mass matrix, which is constructed on a second order tetrahedral element. My idea is to use the lumped mass matrix to precondition it. However, it does not seem to work originally but I guessed it may has something to do with the fact that it is second order since the theory for second order lumped mass matrix is not so clear, at least to me. So I decided to work out the linear element first, since everything there is mathematically well established.

OK. I first constructed a mass matrix based on linear elements. I can construct its lumped mass matrix by three methods:
1. sum of each row
2. scale the diagonal entry
3. use a nodal quadrature rule to construct a diagonal matrix
These three methods turned out to produce identical diagonal matrices as the lumped mass matrix, just as the theory predicts. So perfect!
I can further test that the lumped mass matrix has similar eigenvalues to the original consistent mass matrix, although different. And the solutions of a linear system is quite similar too. So I understand the reason lots of people replace solving the consistent mass matrix with solving the lumped one to achieve much improved efficiency but without losing much of the accuracy. So far, it is all making sense.

I naturally think it could be very helpful if I can use this lumped mass matrix  as the prediction  matrix  for  the  consistent  mass matrix solver,  where  the  consistent  matrix  is  kept  to  have the  better accuracy. However, my tests show that it does not help at all. I tried several ways to do it within PETSc, such as setting
KSPSetOperators( solM, M, lumpedM SAME_PRECONDITIONER);
or directly use the -pc_type  jacobi -pc_jacobi_rowsum 1 to the built-in method.
These two methods turned out to be equivalent but they both produce less efficient solutions: it actually took twice more steps to converge than without these options.
This is quite puzzling to me. Although I have to admit that I have seen a lot of replacing the consistent mass matrix with the lumped one in the literature but have not seen much of using the lumped mass matrix as a preconditioner. Maybe using the lumped matrix for preconditioning simply does not work? I would love to hear some comment here.

If that's all, I don't feel too bad. Then I came back to the second order elements since that's what I want to use. Accidentally  I decided to try to solve the  second  order  consistent mass matrix with the   -pc_type  jacobi -pc_jacobi_rowsum 1 option. Bang! It converges almost three times faster.  For a particular system, it usually converges in 9-10 iterations and now it converges in 2-3 iterations.  This is amazing! But I don't know why it is so. 

If that's all, I would be just happy. Then I ran my single particle sedimentation code with -pc_type  jacobi -pc_jacobi_rowsum 1 and it does run a lot faster. However, the results I got are slightly different from what I used to, which is weird since the only thing changed is the preconditioner while the same linear system was solved. I tried several -pc_type options, and they are all consistent with the old one. So I am a little bit hesitant adapting this new speed up method. What troubles me most is that the new simulation results are actually even closer to the experiments we are comparing, which may suggest that the row sum PC is even better. But this is just one test case I would rather believe it happens to cause errors in the direction to compensate  other simulation errors. If I had other well established test case which unfortunately I don't, I would imagine it may work differently.

So my strongest puzzle is that how could a change in the pre-conditioner make such an observable change in the solutions. I understand different PCs produce different solutions but they should be numerically very close and non-detectable on a physical quantity level plot, right? Is there something particular about this rowsum method?

I apologize about this lengthy email but I do hope to have some in depth scientific discussion.
Thank you very much.

Shi Jin, PhD


      ____________________________________________________________________________________
You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.  
http://tc.deals.yahoo.com/tc/blockbuster/text5.com


From bsmith at mcs.anl.gov  Tue Apr  8 21:38:00 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Tue, 8 Apr 2008 21:38:00 -0500
Subject: Further question about PC with Jaocbi Row Sum
In-Reply-To: <158168.69319.qm@web36208.mail.mud.yahoo.com>
References: <158168.69319.qm@web36208.mail.mud.yahoo.com>
Message-ID: <E1DBFF71-2BB8-4117-A1BA-DFA0CC0CDC2F@mcs.anl.gov>


On Apr 8, 2008, at 7:16 PM, Shi Jin wrote:
> Hi there,
>
> First of all, I want to thank Matt  for his previous suggestion on  
> the use of -pc_jacobi_rowsum 1 option. Now I have a more theoretical  
> question I hope you may help although it does not necessarily  
> connects to PETsc directly.
>
> Basically, I am trying to speed up the solution of a finite element  
> mass matrix, which is constructed on a second order tetrahedral  
> element. My idea is to use the lumped mass matrix to precondition  
> it. However, it does not seem to work originally but I guessed it  
> may has something to do with the fact that it is second order since  
> the theory for second order lumped mass matrix is not so clear, at  
> least to me. So I decided to work out the linear element first,  
> since everything there is mathematically well established.
>
> OK. I first constructed a mass matrix based on linear elements. I  
> can construct its lumped mass matrix by three methods:
> 1. sum of each row
> 2. scale the diagonal entry
> 3. use a nodal quadrature rule to construct a diagonal matrix
> These three methods turned out to produce identical diagonal  
> matrices as the lumped mass matrix, just as the theory predicts. So  
> perfect!
> I can further test that the lumped mass matrix has similar  
> eigenvalues to the original consistent mass matrix, although  
> different. And the solutions of a linear system is quite similar  
> too. So I understand the reason lots of people replace solving the  
> consistent mass matrix with solving the lumped one to achieve much  
> improved efficiency but without losing much of the accuracy. So far,  
> it is all making sense.
>
> I naturally think it could be very helpful if I can use this lumped  
> mass matrix  as the prediction  matrix  for  the  consistent  mass  
> matrix solver,  where  the  consistent  matrix  is  kept  to  have  
> the  better accuracy. However, my tests show that it does not help  
> at all. I tried several ways to do it within PETSc, such as setting
> KSPSetOperators( solM, M, lumpedM SAME_PRECONDITIONER);
> or directly use the -pc_type  jacobi -pc_jacobi_rowsum 1 to the  
> built-in method.
> These two methods turned out to be equivalent but they both produce  
> less efficient solutions: it actually took twice more steps to  
> converge than without these options.
> This is quite puzzling to me. Although I have to admit that I have  
> seen a lot of replacing the consistent mass matrix with the lumped  
> one in the literature but have not seen much of using the lumped  
> mass matrix as a preconditioner. Maybe using the lumped matrix for  
> preconditioning simply does not work? I would love to hear some  
> comment here.

  The lumped mass matrix being a good replacement for the mass matrix  
is a question about approximation. How good is each to the continuous  
L_2 norm? It isn't
really a question about how close each is to the other.

Being a good preconditioner is a linear algebra question, what is the  
(complex) relationship between the eigenvalues and eigenvectors of the  
two matrices?  (what happens to the eigenvalues of B^{-1} M?)

I think these two questions are distinct, intuitively they
seem to be related, but mathematically I don't think there is a direct  
relationship so I am not surprised by your observations.


    By the way, I have seen cases where using the lumped mass matrix  
resulted in BETTER approximation to the continuous solution then using  
the
"true" mass matrix; again this is counter intuitive but there is  
nothing mathematically that says it shouldn't be.

>
>
> If that's all, I don't feel too bad. Then I came back to the second  
> order elements since that's what I want to use. Accidentally  I  
> decided to try to solve the  second  order  consistent mass matrix  
> with the   -pc_type  jacobi -pc_jacobi_rowsum 1 option. Bang! It  
> converges almost three times faster.  For a particular system, it  
> usually converges in 9-10 iterations and now it converges in 2-3  
> iterations.  This is amazing! But I don't know why it is so.
>
> If that's all, I would be just happy. Then I ran my single particle  
> sedimentation code with -pc_type  jacobi -pc_jacobi_rowsum 1 and it  
> does run a lot faster. However, the results I got are slightly  
> different from what I used to, which is weird since the only thing  
> changed is the preconditioner while the same linear system was  
> solved. I tried several -pc_type options, and they are all  
> consistent with the old one. So I am a little bit hesitant adapting  
> this new speed up method. What troubles me most is that the new  
> simulation results are actually even closer to the experiments we  
> are comparing, which may suggest that the row sum PC is even better.  
> But this is just one test case I would rather believe it happens to  
> cause errors in the direction to compensate  other simulation  
> errors. If I had other well established test case which  
> unfortunately I don't, I would imagine it may work differently.
>
> So my strongest puzzle is that how could a change in the pre- 
> conditioner make such an observable change in the solutions. I  
> understand different PCs produce different solutions but they should  
> be numerically very close and non-detectable on a physical quantity  
> level plot, right?

    Yes, so long as you use a tight enough convergence tolerance with  
KSPSetTolerances(). Also by default with most KSP solvers the  
"preconditioned" residual
norm is used to determine convergence, thus in some way the  
preconditioner helps determines when the KSP stops.

> Is there something particular about this rowsum method?

    No. If you use a -ksp_rtol of 1.e-12 and still get different  
answers, this needs to be investigated.

    Barry

>
>
> I apologize about this lengthy email but I do hope to have some in  
> depth scientific discussion.
> Thank you very much.
>
> Shi Jin, PhD
>
>
>
>
>       
> ____________________________________________________________________________________
> You rock. That's why Blockbuster's offering you one month of  
> Blockbuster Total Access, No Cost.
> http://tc.deals.yahoo.com/tc/blockbuster/text5.com
>


From recrusader at gmail.com  Wed Apr  9 12:02:41 2008
From: recrusader at gmail.com (Yujie)
Date: Wed, 9 Apr 2008 10:02:41 -0700
Subject: about MatMatMultTranspose_seqdense_seqdense()
Message-ID: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com>

hi, everyone

My codes are as follows:

ierr=MatGetSubMatrices(tempM_mat,1,&is_row,&is_col,MAT_INITIAL_MATRIX,&tempA_mat);
CHKERRQ(ierr);
  A_mat=*tempA_mat;
  ierr=MatDestroy(tempM_mat);CHKERRQ(ierr);
  ierr=MatGetSize(A_mat,&M,&N);CHKERRQ(ierr);
  //AtA
  ierr=MatMatMultTranspose(A_mat,A_mat,MAT_INITIAL_MATRIX,fill,&AtA_mat);

I get a seqdense submatrix "A_mat" by
MatGetSubMatrices(). I further get At*A by MatMatMultTranspose().
However, I meet an error:
" ** On entry to DGEMM parameter number 8 had an illegal value"

I debug my codes.
In MatMatMultTranspose_seqdense_seqdense(), the codes call
"BLASgemm_("T","N",&m,&n,&k,&_DOne,a->v,&a->lda,b->v,&b->lda,&_DZero,c->v,&c->lda);"
I don't know the meaning of the 8th parameters"&a->lda". In my codes, its
value is "0".

Are there any problems in my codes? could you give me some advice? thanks a lot.

Regards,
Yujie
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080409/e2dd7c96/attachment.htm>

From Amit.Itagi at seagate.com  Wed Apr  9 13:35:44 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 9 Apr 2008 14:35:44 -0400
Subject: DA question
Message-ID: <OF646A7ECB.31CF2E15-ON85257426.0065E0EB-85257426.0066DFA9@seagate.com>


Hi,

Is it possible to use DA to perform finite differences on two staggered
regular grids (as in the electromagnetic finite difference time domain
method) ? Surrounding nodes from one grid are used to update the value in
the dual grid. In addition, local manipulations need to be done on the
nodal values.

Thanks

Rgds,
Amit


From berend at chalmers.se  Wed Apr  9 13:59:49 2008
From: berend at chalmers.se (Berend van Wachem)
Date: Wed, 09 Apr 2008 20:59:49 +0200
Subject: DA question
In-Reply-To: <OF646A7ECB.31CF2E15-ON85257426.0065E0EB-85257426.0066DFA9@seagate.com>
References: <OF646A7ECB.31CF2E15-ON85257426.0065E0EB-85257426.0066DFA9@seagate.com>
Message-ID: <47FD1225.4020704@chalmers.se>

Dear Amit,

Could you explain how the two grids are attached?
I am using multiple DA's for multiple structured grids glued together. 
I've done the gluing with setting up various IS objects. From the 
multiple DA's, one global variable vector is formed. Is that what you 
are looking for?

Best regards,

Berend.


Amit.Itagi at seagate.com wrote:
> Hi,
> 
> Is it possible to use DA to perform finite differences on two staggered
> regular grids (as in the electromagnetic finite difference time domain
> method) ? Surrounding nodes from one grid are used to update the value in
> the dual grid. In addition, local manipulations need to be done on the
> nodal values.
> 
> Thanks
> 
> Rgds,
> Amit
> 


From knepley at gmail.com  Wed Apr  9 14:10:19 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Wed, 9 Apr 2008 14:10:19 -0500
Subject: DA question
In-Reply-To: <OF646A7ECB.31CF2E15-ON85257426.0065E0EB-85257426.0066DFA9@seagate.com>
References: <OF646A7ECB.31CF2E15-ON85257426.0065E0EB-85257426.0066DFA9@seagate.com>
Message-ID: <a9f269830804091210g741b950fp510e62113cb405bb@mail.gmail.com>

DAs only know about vertex values. You can simulate staggered grids by
storing one grid on top of the other, so each vertex has two values.

  Matt

On Wed, Apr 9, 2008 at 1:35 PM,  <Amit.Itagi at seagate.com> wrote:
>
>  Hi,
>
>  Is it possible to use DA to perform finite differences on two staggered
>  regular grids (as in the electromagnetic finite difference time domain
>  method) ? Surrounding nodes from one grid are used to update the value in
>  the dual grid. In addition, local manipulations need to be done on the
>  nodal values.
>
>  Thanks
>
>  Rgds,
>  Amit
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From jed at 59A2.org  Wed Apr  9 14:13:37 2008
From: jed at 59A2.org (Jed Brown)
Date: Wed, 9 Apr 2008 21:13:37 +0200
Subject: Forming a sparse approximation of a MatShell
Message-ID: <20080409191337.GA6137@brakk.ethz.ch>

I'm trying to improve the preconditioning of my spectral collocation method for
non-Newtonian incompressible Stokes flow.  My current algorithm uses MatShell
for the full Jacobian as well as each of its blocks [A B1'; B2 0] and the Schur
complement S = -B2*A*B1'.  I needed a preconditioner for A so I thought I'd
solve the same problem using finite differences on the Chebyshev nodes.  In
reality, the stencil is really ugly in 3D so I just used a simpler elliptic
operator.  This works okay, but it's performance decays significantly as I
increase the continuation parameter.  Also, dealing with general boundary
conditions is rather tricky and it seems to be a much weaker preconditioner when
I have mixed boundary conditions.  To rectify this, I tried a finite element
discretization on the Chebyshev nodes (using Q1 elements).  This must be scaled
by the inverse (lumped) mass matrix due to the collocation nature of the
spectral method.  Strangely, even though it captures all the terms in the
Jacobian, it is slightly weaker than the finite difference version.  At least it
is less error-prone and boundary conditions are easier to get right.
Regardless, forming the explicit matrix separately from the spectral matrix
causes a duplication of concepts that have to be kept in sync.  So I started
thinking, the spectral matrix is pretty cheap to apply a few times, so perhaps I
can use a coloring to compute a sparse approximation.  However, the
documentation I found is using the function from the SNES context to form the
matrix.  In my case, the entire Jacobian doesn't help, I just want an
approximation of A.  (A itself is full, but implemented via FFT.)  What is the
correct way to do this?  Should I just stick with finite differences or finite
elements?

Also, any ideas for preconditioning S?  It's condition number also grows
significantly with the continuation parameter.

Thanks,

Jed
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080409/ec04efec/attachment.pgp>

From Amit.Itagi at seagate.com  Wed Apr  9 14:38:56 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 9 Apr 2008 15:38:56 -0400
Subject: DA question
In-Reply-To: <47FD1225.4020704@chalmers.se>
Message-ID: <OF26431A73.E38F1EB1-ON85257426.006AB8EF-85257426.006CA8EB@seagate.com>

Hi Berend,

A detailed explanation of the finite difference scheme is given here :

http://en.wikipedia.org/wiki/Finite-difference_time-domain_method


Thanks

Rgds,
Amit


             Berend van Wachem                                             
             <berend at chalmers.                                             
             se>                                                        To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: DA question                     
                                                                           
                                                                           
             04/09/2008 02:59                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
Dear Amit,

Could you explain how the two grids are attached?
I am using multiple DA's for multiple structured grids glued together.
I've done the gluing with setting up various IS objects. From the
multiple DA's, one global variable vector is formed. Is that what you
are looking for?

Best regards,

Berend.


Amit.Itagi at seagate.com wrote:
> Hi,
>
> Is it possible to use DA to perform finite differences on two staggered
> regular grids (as in the electromagnetic finite difference time domain
> method) ? Surrounding nodes from one grid are used to update the value in
> the dual grid. In addition, local manipulations need to be done on the
> nodal values.
>
> Thanks
>
> Rgds,
> Amit
>


From schuang at ats.ucla.edu  Wed Apr  9 14:18:42 2008
From: schuang at ats.ucla.edu (Shao-Ching Huang)
Date: Wed, 9 Apr 2008 12:18:42 -0700
Subject: about MatMatMultTranspose_seqdense_seqdense()
In-Reply-To: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com>
References: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com>
Message-ID: <20080409191842.GA24448@ats.ucla.edu>


In BLAS API, the eight parameter for DGEMM is the physically-allocated
leading (in Fortran sense) dimension of matrix B, as in C=(alpha)*A*B
+ (beta)*C.

See the comments in
http://www.netlib.org/blas/dgemm.f

Shao-Ching


On Wed, Apr 09, 2008 at 10:02:41AM -0700, Yujie wrote:
> hi, everyone
> 
> My codes are as follows:
> 
> ierr=MatGetSubMatrices(tempM_mat,1,&is_row,&is_col,MAT_INITIAL_MATRIX,&tempA_mat);
> CHKERRQ(ierr);
>   A_mat=*tempA_mat;
>   ierr=MatDestroy(tempM_mat);CHKERRQ(ierr);
>   ierr=MatGetSize(A_mat,&M,&N);CHKERRQ(ierr);
>   //AtA
>   ierr=MatMatMultTranspose(A_mat,A_mat,MAT_INITIAL_MATRIX,fill,&AtA_mat);
> 
> I get a seqdense submatrix "A_mat" by
> MatGetSubMatrices(). I further get At*A by MatMatMultTranspose().
> However, I meet an error:
> " ** On entry to DGEMM parameter number 8 had an illegal value"
> 
> I debug my codes.
> In MatMatMultTranspose_seqdense_seqdense(), the codes call
> "BLASgemm_("T","N",&m,&n,&k,&_DOne,a->v,&a->lda,b->v,&b->lda,&_DZero,c->v,&c->lda);"
> I don't know the meaning of the 8th parameters"&a->lda". In my codes, its
> value is "0".
> 
> Are there any problems in my codes? could you give me some advice? thanks a lot.
> 
> Regards,
> Yujie


From bsmith at mcs.anl.gov  Wed Apr  9 15:10:23 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Wed, 9 Apr 2008 15:10:23 -0500
Subject: Forming a sparse approximation of a MatShell
In-Reply-To: <20080409191337.GA6137@brakk.ethz.ch>
References: <20080409191337.GA6137@brakk.ethz.ch>
Message-ID: <F0F4CD97-D837-43BA-82D8-961EB54FCD4C@mcs.anl.gov>


    Jed,

     The Mat coloring code can also be used directly, not through  
SNES. Once you have
the coloring for the matrix (you can get that with MatGetColoring(),  
of course, this assumes
you have already set a nonzero pattern for your matrix)). Call  
MatFDColoringCreate()
then MatFDColoringSetFunction(), MatFDColoringSetFromOptions() and  
then MatFDColoringApply().

    Good luck,

     Barry

On Apr 9, 2008, at 2:13 PM, Jed Brown wrote:
> I'm trying to improve the preconditioning of my spectral collocation  
> method for
> non-Newtonian incompressible Stokes flow.  My current algorithm uses  
> MatShell
> for the full Jacobian as well as each of its blocks [A B1'; B2 0]  
> and the Schur
> complement S = -B2*A*B1'.  I needed a preconditioner for A so I  
> thought I'd
> solve the same problem using finite differences on the Chebyshev  
> nodes.  In
> reality, the stencil is really ugly in 3D so I just used a simpler  
> elliptic
> operator.  This works okay, but it's performance decays  
> significantly as I
> increase the continuation parameter.  Also, dealing with general  
> boundary
> conditions is rather tricky and it seems to be a much weaker  
> preconditioner when
> I have mixed boundary conditions.  To rectify this, I tried a finite  
> element
> discretization on the Chebyshev nodes (using Q1 elements).  This  
> must be scaled
> by the inverse (lumped) mass matrix due to the collocation nature of  
> the
> spectral method.  Strangely, even though it captures all the terms  
> in the
> Jacobian, it is slightly weaker than the finite difference version.   
> At least it
> is less error-prone and boundary conditions are easier to get right.
> Regardless, forming the explicit matrix separately from the spectral  
> matrix
> causes a duplication of concepts that have to be kept in sync.  So I  
> started
> thinking, the spectral matrix is pretty cheap to apply a few times,  
> so perhaps I
> can use a coloring to compute a sparse approximation.  However, the
> documentation I found is using the function from the SNES context to  
> form the
> matrix.  In my case, the entire Jacobian doesn't help, I just want an
> approximation of A.  (A itself is full, but implemented via FFT.)   
> What is the
> correct way to do this?  Should I just stick with finite differences  
> or finite
> elements?
>
> Also, any ideas for preconditioning S?  It's condition number also  
> grows
> significantly with the continuation parameter.
>
> Thanks,
>
> Jed


From rlmackie862 at gmail.com  Wed Apr  9 15:09:59 2008
From: rlmackie862 at gmail.com (Randall Mackie)
Date: Wed, 09 Apr 2008 13:09:59 -0700
Subject: DA question
In-Reply-To: <OF26431A73.E38F1EB1-ON85257426.006AB8EF-85257426.006CA8EB@seagate.com>
References: <OF26431A73.E38F1EB1-ON85257426.006AB8EF-85257426.006CA8EB@seagate.com>
Message-ID: <47FD2297.1010602@gmail.com>

Hi Amit,

Why do you need two staggered grids? I do EM finite difference frequency
domain modeling on a staggered grid using just one DA. Works perfectly fine.
There are some grid points that are not used, but you just set them to zero
and put a 1 on the diagonal of the coefficient matrix.


Randy


Amit.Itagi at seagate.com wrote:
> Hi Berend,
> 
> A detailed explanation of the finite difference scheme is given here :
> 
> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
> 
> 
> Thanks
> 
> Rgds,
> Amit
> 
> 
> 
>                                                                            
>              Berend van Wachem                                             
>              <berend at chalmers.                                             
>              se>                                                        To 
>              Sent by:                  petsc-users at mcs.anl.gov             
>              owner-petsc-users                                          cc 
>              @mcs.anl.gov                                                  
>              No Phone Info                                         Subject 
>              Available                 Re: DA question                     
>                                                                            
>                                                                            
>              04/09/2008 02:59                                              
>              PM                                                            
>                                                                            
>                                                                            
>              Please respond to                                             
>              petsc-users at mcs.a                                             
>                   nl.gov                                                   
>                                                                            
>                                                                            
> 
> 
> 
> 
> Dear Amit,
> 
> Could you explain how the two grids are attached?
> I am using multiple DA's for multiple structured grids glued together.
> I've done the gluing with setting up various IS objects. From the
> multiple DA's, one global variable vector is formed. Is that what you
> are looking for?
> 
> Best regards,
> 
> Berend.
> 
> 
> Amit.Itagi at seagate.com wrote:
>> Hi,
>>
>> Is it possible to use DA to perform finite differences on two staggered
>> regular grids (as in the electromagnetic finite difference time domain
>> method) ? Surrounding nodes from one grid are used to update the value in
>> the dual grid. In addition, local manipulations need to be done on the
>> nodal values.
>>
>> Thanks
>>
>> Rgds,
>> Amit
>>
> 
> 
> 


From jinzishuai at yahoo.com  Wed Apr  9 15:25:30 2008
From: jinzishuai at yahoo.com (Shi Jin)
Date: Wed, 9 Apr 2008 13:25:30 -0700 (PDT)
Subject: Further question about PC with Jaocbi Row Sum
Message-ID: <780632.38038.qm@web36201.mail.mud.yahoo.com>

Thank you very much.
 

> > Is there something particular about this rowsum method?
> 
>     No. If you use a -ksp_rtol of 1.e-12 and still get different  
> answers, this needs to be investigated.
> 
>

I have tried even with -ksp_rtol   1.e-20 but still got different results.

Here is what I got when solving the mass matrix with 
-pc_type  jacobi
-pc_jacobi_rowsum 1
-ksp_type cg
-sub_pc_type icc
-ksp_rtol 1.e-20
-ksp_monitor
-ksp_view

  0 KSP Residual norm 2.975203858623e+00
  1 KSP Residual norm 2.674371671721e-01
  2 KSP Residual norm 1.841074927355e-01
KSP Object:
  type: cg
  maximum iterations=10000, initial guess is zero
  tolerances:  relative=1e-20, absolute=1e-50, divergence=10000
  left preconditioning
PC Object:
  type: jacobi
  linear system matrix = precond matrix:
  Matrix Object:
    type=seqaij, rows=8775, cols=8775
    total: nonzeros=214591, allocated nonzeros=214591
      not using I-node routines

I realize that the iteration ended when the residual norm is quite large. 
Do you think this indicates something wrong here?

Thank you again.

Shi


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From knepley at gmail.com  Wed Apr  9 15:50:29 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Wed, 9 Apr 2008 15:50:29 -0500
Subject: Further question about PC with Jaocbi Row Sum
In-Reply-To: <780632.38038.qm@web36201.mail.mud.yahoo.com>
References: <780632.38038.qm@web36201.mail.mud.yahoo.com>
Message-ID: <a9f269830804091350u7e2ea6a7x98a6166155a1e6e@mail.gmail.com>

On Wed, Apr 9, 2008 at 3:25 PM, Shi Jin <jinzishuai at yahoo.com> wrote:
> Thank you very much.
>
>
>
>  > > Is there something particular about this rowsum method?
>  >
>  >     No. If you use a -ksp_rtol of 1.e-12 and still get different
>  > answers, this needs to be investigated.
>  >
>  >
>
>  I have tried even with -ksp_rtol   1.e-20 but still got different results.
>
>  Here is what I got when solving the mass matrix with
>
> -pc_type  jacobi
>  -pc_jacobi_rowsum 1
>  -ksp_type cg
>  -sub_pc_type icc
>  -ksp_rtol 1.e-20
>  -ksp_monitor
>  -ksp_view
>
>   0 KSP Residual norm 2.975203858623e+00
>   1 KSP Residual norm 2.674371671721e-01
>   2 KSP Residual norm 1.841074927355e-01
>  KSP Object:
>   type: cg
>   maximum iterations=10000, initial guess is zero
>   tolerances:  relative=1e-20, absolute=1e-50, divergence=10000
>   left preconditioning
>  PC Object:
>   type: jacobi
>   linear system matrix = precond matrix:
>   Matrix Object:
>     type=seqaij, rows=8775, cols=8775
>     total: nonzeros=214591, allocated nonzeros=214591
>       not using I-node routines
>
>  I realize that the iteration ended when the residual norm is quite large.
>  Do you think this indicates something wrong here?

Can you run with

  -ksp_converged_reason

It appears that the solve fails rather than terminates with an answer. Is it
possible that your matrix is not SPD?

  Matt

>  Thank you again.
>
>  Shi
>
>
>
>  __________________________________________________
>  Do You Yahoo!?
>  Tired of spam?  Yahoo! Mail has the best spam protection around
>  http://mail.yahoo.com
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From Amit.Itagi at seagate.com  Wed Apr  9 16:06:18 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 9 Apr 2008 17:06:18 -0400
Subject: DA question
In-Reply-To: <47FD2297.1010602@gmail.com>
Message-ID: <OF212C0D39.CF4735B2-ON85257426.0073C6DA-85257426.0074A882@seagate.com>

Randy,

I guess, since you are doing a frequency domain calculation, you eventually
end up with a single matrix equation.

I am planning to work in the time domain. Will that change things ?

Thanks

Rgds,
Amit


             Randall Mackie                                                
             <rlmackie862 at gmai                                             
             l.com>                                                     To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: DA question                     
                                                                           
                                                                           
             04/09/2008 04:09                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
Hi Amit,

Why do you need two staggered grids? I do EM finite difference frequency
domain modeling on a staggered grid using just one DA. Works perfectly
fine.
There are some grid points that are not used, but you just set them to zero
and put a 1 on the diagonal of the coefficient matrix.


Randy


Amit.Itagi at seagate.com wrote:
> Hi Berend,
>
> A detailed explanation of the finite difference scheme is given here :
>
> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>
>
> Thanks
>
> Rgds,
> Amit
>
>
>
>

>              Berend van Wachem

>              <berend at chalmers.

>              se>
To
>              Sent by:                  petsc-users at mcs.anl.gov

>              owner-petsc-users
cc
>              @mcs.anl.gov

>              No Phone Info
Subject
>              Available                 Re: DA question

>

>

>              04/09/2008 02:59

>              PM

>

>

>              Please respond to

>              petsc-users at mcs.a

>                   nl.gov

>

>

>
>
>
>
> Dear Amit,
>
> Could you explain how the two grids are attached?
> I am using multiple DA's for multiple structured grids glued together.
> I've done the gluing with setting up various IS objects. From the
> multiple DA's, one global variable vector is formed. Is that what you
> are looking for?
>
> Best regards,
>
> Berend.
>
>
> Amit.Itagi at seagate.com wrote:
>> Hi,
>>
>> Is it possible to use DA to perform finite differences on two staggered
>> regular grids (as in the electromagnetic finite difference time domain
>> method) ? Surrounding nodes from one grid are used to update the value
in
>> the dual grid. In addition, local manipulations need to be done on the
>> nodal values.
>>
>> Thanks
>>
>> Rgds,
>> Amit
>>
>
>
>


From bsmith at mcs.anl.gov  Wed Apr  9 16:35:34 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Wed, 9 Apr 2008 16:35:34 -0500
Subject: about MatMatMultTranspose_seqdense_seqdense()
In-Reply-To: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com>
References: <7ff0ee010804091002u69766386gfb1b31c1ede7b755@mail.gmail.com>
Message-ID: <F6420FD9-C8B8-4256-B1BC-4D62AE943716@mcs.anl.gov>


    Please send us a compilable code that reproduces this problem to petsc-maint at mcs.anl.gov

    Barry

I cannot reproduce it.

On Apr 9, 2008, at 12:02 PM, Yujie wrote:
> hi, everyone
>
> My codes are as follows:
>   ierr=MatGetSubMatrices(tempM_mat, 
> 1,&is_row,&is_col,MAT_INITIAL_MATRIX,&tempA_mat); CHKERRQ(ierr);
>   A_mat=*tempA_mat;
>   ierr=MatDestroy(tempM_mat);CHKERRQ(ierr);
>   ierr=MatGetSize(A_mat,&M,&N);CHKERRQ(ierr);
>   //AtA
>    
> ierr 
> =MatMatMultTranspose(A_mat,A_mat,MAT_INITIAL_MATRIX,fill,&AtA_mat);
>
> I get a seqdense submatrix "A_mat" by MatGetSubMatrices(). I further  
> get At*A by MatMatMultTranspose(). However, I meet an error:
> " ** On entry to DGEMM parameter number 8 had an illegal value"
>
> I debug my codes. In MatMatMultTranspose_seqdense_seqdense(), the  
> codes call
> "BLASgemm_("T","N",&m,&n,&k,&_DOne,a->v,&a->lda,b->v,&b- 
> >lda,&_DZero,c->v,&c->lda);"
> I don't know the meaning of the 8th parameters"&a->lda". In my  
> codes, its value is "0".
>
> Are there any problems in my codes? could you give me some advice?  
> thanks a lot.
>
> Regards,
> Yujie


From sdettrick at gmail.com  Wed Apr  9 16:36:05 2008
From: sdettrick at gmail.com (Sean Dettrick)
Date: Wed, 9 Apr 2008 17:36:05 -0400
Subject: DA question
In-Reply-To: <OF212C0D39.CF4735B2-ON85257426.0073C6DA-85257426.0074A882@seagate.com>
References: <47FD2297.1010602@gmail.com>
	 <OF212C0D39.CF4735B2-ON85257426.0073C6DA-85257426.0074A882@seagate.com>
Message-ID: <44114ec40804091436o25657b1eua89cba52848d5717@mail.gmail.com>

To elaborate on Matt's suggestion, a staggered grid/Yee mesh code
could use a single DA with one degree-of-freedom per component of H
and E.  The extra overlap required for staggered guard cells at the
domain boundaries could be dealt with by having a bigger-than-usual
stencil width.  For the 2nd order 3D case, this suggests the
DACreate3d routine would have arguments dof=6, s=2, and
stencil_type=DA_STENCIL_STAR.

It is just a suggestion - I have not tried it.

Sean

On Wed, Apr 9, 2008 at 5:06 PM,  <Amit.Itagi at seagate.com> wrote:
> Randy,
>
>  I guess, since you are doing a frequency domain calculation, you eventually
>  end up with a single matrix equation.
>
>  I am planning to work in the time domain. Will that change things ?
>
>  Thanks
>
>  Rgds,
>  Amit
>
>
>
>
>              Randall Mackie
>              <rlmackie862 at gmai
>              l.com>                                                     To
>
>              Sent by:                  petsc-users at mcs.anl.gov
>              owner-petsc-users                                          cc
>              @mcs.anl.gov
>              No Phone Info                                         Subject
>              Available                 Re: DA question
>
>
>              04/09/2008 04:09
>
>
>              PM
>
>
>              Please respond to
>              petsc-users at mcs.a
>                   nl.gov
>
>
>
>
>
>
>  Hi Amit,
>
>  Why do you need two staggered grids? I do EM finite difference frequency
>  domain modeling on a staggered grid using just one DA. Works perfectly
>  fine.
>  There are some grid points that are not used, but you just set them to zero
>  and put a 1 on the diagonal of the coefficient matrix.
>
>
>  Randy
>
>
>  Amit.Itagi at seagate.com wrote:
>  > Hi Berend,
>  >
>  > A detailed explanation of the finite difference scheme is given here :
>  >
>  > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>  >
>  >
>  > Thanks
>  >
>  > Rgds,
>  > Amit
>  >
>  >
>  >
>  >
>
>  >              Berend van Wachem
>
>  >              <berend at chalmers.
>
>  >              se>
>  To
>  >              Sent by:                  petsc-users at mcs.anl.gov
>
>  >              owner-petsc-users
>  cc
>  >              @mcs.anl.gov
>
>  >              No Phone Info
>  Subject
>  >              Available                 Re: DA question
>
>  >
>
>  >
>
>  >              04/09/2008 02:59
>
>  >              PM
>
>  >
>
>  >
>
>  >              Please respond to
>
>  >              petsc-users at mcs.a
>
>  >                   nl.gov
>
>  >
>
>  >
>
>  >
>  >
>  >
>  >
>  > Dear Amit,
>  >
>  > Could you explain how the two grids are attached?
>  > I am using multiple DA's for multiple structured grids glued together.
>  > I've done the gluing with setting up various IS objects. From the
>  > multiple DA's, one global variable vector is formed. Is that what you
>  > are looking for?
>  >
>  > Best regards,
>  >
>  > Berend.
>  >
>  >
>  > Amit.Itagi at seagate.com wrote:
>  >> Hi,
>  >>
>  >> Is it possible to use DA to perform finite differences on two staggered
>  >> regular grids (as in the electromagnetic finite difference time domain
>  >> method) ? Surrounding nodes from one grid are used to update the value
>  in
>  >> the dual grid. In addition, local manipulations need to be done on the
>  >> nodal values.
>  >>
>  >> Thanks
>  >>
>  >> Rgds,
>  >> Amit
>  >>
>  >
>  >
>  >
>
>
>
>


From jed at 59A2.org  Wed Apr  9 16:40:06 2008
From: jed at 59A2.org (Jed Brown)
Date: Wed, 9 Apr 2008 23:40:06 +0200
Subject: Forming a sparse approximation of a MatShell
In-Reply-To: <F0F4CD97-D837-43BA-82D8-961EB54FCD4C@mcs.anl.gov>
References: <20080409191337.GA6137@brakk.ethz.ch> <F0F4CD97-D837-43BA-82D8-961EB54FCD4C@mcs.anl.gov>
Message-ID: <20080409214006.GB6137@brakk.ethz.ch>

On Wed 2008-04-09 15:10, Barry Smith wrote:
>
>    Jed,
>
>     The Mat coloring code can also be used directly, not through SNES. Once 
> you have
> the coloring for the matrix (you can get that with MatGetColoring(), of 
> course, this assumes
> you have already set a nonzero pattern for your matrix)). Call 
> MatFDColoringCreate()
> then MatFDColoringSetFunction(), MatFDColoringSetFromOptions() and then 
> MatFDColoringApply().

Cool, I tried this and I can confirm that it is generating the correct matrix
(by comparing entries with the output of -snes_fd) but unfortunately the matrix
entries of the spectral operator corresponding to neighbors are actually not a
very good approximation of the full operator.  Bummer.  It looks like I'm stuck
with formulating the problem twice, once for the spectral operators and once for
the FD/FE preconditioner.  Thanks for the help.

Jed

> On Apr 9, 2008, at 2:13 PM, Jed Brown wrote:
>> I'm trying to improve the preconditioning of my spectral collocation 
>> method for
>> non-Newtonian incompressible Stokes flow.  My current algorithm uses 
>> MatShell
>> for the full Jacobian as well as each of its blocks [A B1'; B2 0] and the 
>> Schur
>> complement S = -B2*A*B1'.  I needed a preconditioner for A so I thought 
>> I'd
>> solve the same problem using finite differences on the Chebyshev nodes.  
>> In
>> reality, the stencil is really ugly in 3D so I just used a simpler 
>> elliptic
>> operator.  This works okay, but it's performance decays significantly as I
>> increase the continuation parameter.  Also, dealing with general boundary
>> conditions is rather tricky and it seems to be a much weaker 
>> preconditioner when
>> I have mixed boundary conditions.  To rectify this, I tried a finite 
>> element
>> discretization on the Chebyshev nodes (using Q1 elements).  This must be 
>> scaled
>> by the inverse (lumped) mass matrix due to the collocation nature of the
>> spectral method.  Strangely, even though it captures all the terms in the
>> Jacobian, it is slightly weaker than the finite difference version.  At 
>> least it
>> is less error-prone and boundary conditions are easier to get right.
>> Regardless, forming the explicit matrix separately from the spectral 
>> matrix
>> causes a duplication of concepts that have to be kept in sync.  So I 
>> started
>> thinking, the spectral matrix is pretty cheap to apply a few times, so 
>> perhaps I
>> can use a coloring to compute a sparse approximation.  However, the
>> documentation I found is using the function from the SNES context to form 
>> the
>> matrix.  In my case, the entire Jacobian doesn't help, I just want an
>> approximation of A.  (A itself is full, but implemented via FFT.)  What is 
>> the
>> correct way to do this?  Should I just stick with finite differences or 
>> finite
>> elements?
>>
>> Also, any ideas for preconditioning S?  It's condition number also grows
>> significantly with the continuation parameter.
>>
>> Thanks,
>>
>> Jed
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080409/793844d4/attachment.pgp>

From rlmackie862 at gmail.com  Wed Apr  9 17:44:15 2008
From: rlmackie862 at gmail.com (Randall Mackie)
Date: Wed, 09 Apr 2008 15:44:15 -0700
Subject: DA question
In-Reply-To: <44114ec40804091436o25657b1eua89cba52848d5717@mail.gmail.com>
References: <47FD2297.1010602@gmail.com>	 <OF212C0D39.CF4735B2-ON85257426.0073C6DA-85257426.0074A882@seagate.com> <44114ec40804091436o25657b1eua89cba52848d5717@mail.gmail.com>
Message-ID: <47FD46BF.3050901@gmail.com>

Amit,

I have a staggered grid with H defined along the edges and E as normals
across the block faces. So if you have l x m x n blocks, then you need
to define your DA as l+1, m+1, n+1, to handle the extra grid point you
need for the staggered grid. I use 3 degrees of freedom (for Hx, Hy, and
Hz), and all my local calculations just need the box stencil.

Randy


Sean Dettrick wrote:
> To elaborate on Matt's suggestion, a staggered grid/Yee mesh code
> could use a single DA with one degree-of-freedom per component of H
> and E.  The extra overlap required for staggered guard cells at the
> domain boundaries could be dealt with by having a bigger-than-usual
> stencil width.  For the 2nd order 3D case, this suggests the
> DACreate3d routine would have arguments dof=6, s=2, and
> stencil_type=DA_STENCIL_STAR.
> 
> It is just a suggestion - I have not tried it.
> 
> Sean
> 
> On Wed, Apr 9, 2008 at 5:06 PM,  <Amit.Itagi at seagate.com> wrote:
>> Randy,
>>
>>  I guess, since you are doing a frequency domain calculation, you eventually
>>  end up with a single matrix equation.
>>
>>  I am planning to work in the time domain. Will that change things ?
>>
>>  Thanks
>>
>>  Rgds,
>>  Amit
>>
>>
>>
>>
>>              Randall Mackie
>>              <rlmackie862 at gmai
>>              l.com>                                                     To
>>
>>              Sent by:                  petsc-users at mcs.anl.gov
>>              owner-petsc-users                                          cc
>>              @mcs.anl.gov
>>              No Phone Info                                         Subject
>>              Available                 Re: DA question
>>
>>
>>              04/09/2008 04:09
>>
>>
>>              PM
>>
>>
>>              Please respond to
>>              petsc-users at mcs.a
>>                   nl.gov
>>
>>
>>
>>
>>
>>
>>  Hi Amit,
>>
>>  Why do you need two staggered grids? I do EM finite difference frequency
>>  domain modeling on a staggered grid using just one DA. Works perfectly
>>  fine.
>>  There are some grid points that are not used, but you just set them to zero
>>  and put a 1 on the diagonal of the coefficient matrix.
>>
>>
>>  Randy
>>
>>
>>  Amit.Itagi at seagate.com wrote:
>>  > Hi Berend,
>>  >
>>  > A detailed explanation of the finite difference scheme is given here :
>>  >
>>  > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>>  >
>>  >
>>  > Thanks
>>  >
>>  > Rgds,
>>  > Amit
>>  >
>>  >
>>  >
>>  >
>>
>>  >              Berend van Wachem
>>
>>  >              <berend at chalmers.
>>
>>  >              se>
>>  To
>>  >              Sent by:                  petsc-users at mcs.anl.gov
>>
>>  >              owner-petsc-users
>>  cc
>>  >              @mcs.anl.gov
>>
>>  >              No Phone Info
>>  Subject
>>  >              Available                 Re: DA question
>>
>>  >
>>
>>  >
>>
>>  >              04/09/2008 02:59
>>
>>  >              PM
>>
>>  >
>>
>>  >
>>
>>  >              Please respond to
>>
>>  >              petsc-users at mcs.a
>>
>>  >                   nl.gov
>>
>>  >
>>
>>  >
>>
>>  >
>>  >
>>  >
>>  >
>>  > Dear Amit,
>>  >
>>  > Could you explain how the two grids are attached?
>>  > I am using multiple DA's for multiple structured grids glued together.
>>  > I've done the gluing with setting up various IS objects. From the
>>  > multiple DA's, one global variable vector is formed. Is that what you
>>  > are looking for?
>>  >
>>  > Best regards,
>>  >
>>  > Berend.
>>  >
>>  >
>>  > Amit.Itagi at seagate.com wrote:
>>  >> Hi,
>>  >>
>>  >> Is it possible to use DA to perform finite differences on two staggered
>>  >> regular grids (as in the electromagnetic finite difference time domain
>>  >> method) ? Surrounding nodes from one grid are used to update the value
>>  in
>>  >> the dual grid. In addition, local manipulations need to be done on the
>>  >> nodal values.
>>  >>
>>  >> Thanks
>>  >>
>>  >> Rgds,
>>  >> Amit
>>  >>
>>  >
>>  >
>>  >
>>
>>
>>
>>
> 


From jinzishuai at yahoo.com  Thu Apr 10 00:04:03 2008
From: jinzishuai at yahoo.com (Shi Jin)
Date: Wed, 9 Apr 2008 22:04:03 -0700 (PDT)
Subject: Further question about PC with Jaocbi Row Sum
Message-ID: <149510.10833.qm@web36208.mail.mud.yahoo.com>


Thank you. I have used the -ksp_converged_reason option.
The result says:
Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 2
I then further checked the row sum matrix, it has negative eigenvalues.
So I guess it does not work at all.
Thank you all for your help.

--
Shi Jin, PhD

----- Original Message ----
> From: Matthew Knepley <knepley at gmail.com>
> To: petsc-users at mcs.anl.gov
> Sent: Wednesday, April 9, 2008 2:50:29 PM
> Subject: Re: Further question about PC with Jaocbi Row Sum
> 
> On Wed, Apr 9, 2008 at 3:25 PM, Shi Jin  wrote:
> > Thank you very much.
> >
> >
> >
> >  > > Is there something particular about this rowsum method?
> >  >
> >  >     No. If you use a -ksp_rtol of 1.e-12 and still get different
> >  > answers, this needs to be investigated.
> >  >
> >  >
> >
> >  I have tried even with -ksp_rtol   1.e-20 but still got different results.
> >
> >  Here is what I got when solving the mass matrix with
> >
> > -pc_type  jacobi
> >  -pc_jacobi_rowsum 1
> >  -ksp_type cg
> >  -sub_pc_type icc
> >  -ksp_rtol 1.e-20
> >  -ksp_monitor
> >  -ksp_view
> >
> >   0 KSP Residual norm 2.975203858623e+00
> >   1 KSP Residual norm 2.674371671721e-01
> >   2 KSP Residual norm 1.841074927355e-01
> >  KSP Object:
> >   type: cg
> >   maximum iterations=10000, initial guess is zero
> >   tolerances:  relative=1e-20, absolute=1e-50, divergence=10000
> >   left preconditioning
> >  PC Object:
> >   type: jacobi
> >   linear system matrix = precond matrix:
> >   Matrix Object:
> >     type=seqaij, rows=8775, cols=8775
> >     total: nonzeros=214591, allocated nonzeros=214591
> >       not using I-node routines
> >
> >  I realize that the iteration ended when the residual norm is quite large.
> >  Do you think this indicates something wrong here?
> 
> Can you run with
> 
>   -ksp_converged_reason
> 
> It appears that the solve fails rather than terminates with an answer. Is it
> possible that your matrix is not SPD?
> 
>   Matt
> 
> >  Thank you again.
> >
> >  Shi
> >
> >
> >
> >  __________________________________________________
> >  Do You Yahoo!?
> >  Tired of spam?  Yahoo! Mail has the best spam protection around
> >  http://mail.yahoo.com
> >
> >
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From Amit.Itagi at seagate.com  Thu Apr 10 08:10:34 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Thu, 10 Apr 2008 09:10:34 -0400
Subject: DA question
In-Reply-To: <47FD46BF.3050901@gmail.com>
Message-ID: <OF91381E74.D30EA9BD-ON85257427.00484CAC-85257427.00491A89@seagate.com>

Randy/Sean/Matt,

Thanks for the suggestions. I will try to implement the algorithm on the
suggested lines.

Rgds,
Amit


             Randall Mackie                                                
             <rlmackie862 at gmai                                             
             l.com>                                                     To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: DA question                     
                                                                           
                                                                           
             04/09/2008 06:44                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
Amit,

I have a staggered grid with H defined along the edges and E as normals
across the block faces. So if you have l x m x n blocks, then you need
to define your DA as l+1, m+1, n+1, to handle the extra grid point you
need for the staggered grid. I use 3 degrees of freedom (for Hx, Hy, and
Hz), and all my local calculations just need the box stencil.

Randy


Sean Dettrick wrote:
> To elaborate on Matt's suggestion, a staggered grid/Yee mesh code
> could use a single DA with one degree-of-freedom per component of H
> and E.  The extra overlap required for staggered guard cells at the
> domain boundaries could be dealt with by having a bigger-than-usual
> stencil width.  For the 2nd order 3D case, this suggests the
> DACreate3d routine would have arguments dof=6, s=2, and
> stencil_type=DA_STENCIL_STAR.
>
> It is just a suggestion - I have not tried it.
>
> Sean
>
> On Wed, Apr 9, 2008 at 5:06 PM,  <Amit.Itagi at seagate.com> wrote:
>> Randy,
>>
>>  I guess, since you are doing a frequency domain calculation, you
eventually
>>  end up with a single matrix equation.
>>
>>  I am planning to work in the time domain. Will that change things ?
>>
>>  Thanks
>>
>>  Rgds,
>>  Amit
>>
>>
>>
>>
>>              Randall Mackie
>>              <rlmackie862 at gmai
>>              l.com>
To
>>
>>              Sent by:                  petsc-users at mcs.anl.gov
>>              owner-petsc-users
cc
>>              @mcs.anl.gov
>>              No Phone Info
Subject
>>              Available                 Re: DA question
>>
>>
>>              04/09/2008 04:09
>>
>>
>>              PM
>>
>>
>>              Please respond to
>>              petsc-users at mcs.a
>>                   nl.gov
>>
>>
>>
>>
>>
>>
>>  Hi Amit,
>>
>>  Why do you need two staggered grids? I do EM finite difference
frequency
>>  domain modeling on a staggered grid using just one DA. Works perfectly
>>  fine.
>>  There are some grid points that are not used, but you just set them to
zero
>>  and put a 1 on the diagonal of the coefficient matrix.
>>
>>
>>  Randy
>>
>>
>>  Amit.Itagi at seagate.com wrote:
>>  > Hi Berend,
>>  >
>>  > A detailed explanation of the finite difference scheme is given here
:
>>  >
>>  > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>>  >
>>  >
>>  > Thanks
>>  >
>>  > Rgds,
>>  > Amit
>>  >
>>  >
>>  >
>>  >
>>
>>  >              Berend van Wachem
>>
>>  >              <berend at chalmers.
>>
>>  >              se>
>>  To
>>  >              Sent by:                  petsc-users at mcs.anl.gov
>>
>>  >              owner-petsc-users
>>  cc
>>  >              @mcs.anl.gov
>>
>>  >              No Phone Info
>>  Subject
>>  >              Available                 Re: DA question
>>
>>  >
>>
>>  >
>>
>>  >              04/09/2008 02:59
>>
>>  >              PM
>>
>>  >
>>
>>  >
>>
>>  >              Please respond to
>>
>>  >              petsc-users at mcs.a
>>
>>  >                   nl.gov
>>
>>  >
>>
>>  >
>>
>>  >
>>  >
>>  >
>>  >
>>  > Dear Amit,
>>  >
>>  > Could you explain how the two grids are attached?
>>  > I am using multiple DA's for multiple structured grids glued
together.
>>  > I've done the gluing with setting up various IS objects. From the
>>  > multiple DA's, one global variable vector is formed. Is that what you
>>  > are looking for?
>>  >
>>  > Best regards,
>>  >
>>  > Berend.
>>  >
>>  >
>>  > Amit.Itagi at seagate.com wrote:
>>  >> Hi,
>>  >>
>>  >> Is it possible to use DA to perform finite differences on two
staggered
>>  >> regular grids (as in the electromagnetic finite difference time
domain
>>  >> method) ? Surrounding nodes from one grid are used to update the
value
>>  in
>>  >> the dual grid. In addition, local manipulations need to be done on
the
>>  >> nodal values.
>>  >>
>>  >> Thanks
>>  >>
>>  >> Rgds,
>>  >> Amit
>>  >>
>>  >
>>  >
>>  >
>>
>>
>>
>>
>


From hzhang at mcs.anl.gov  Thu Apr 10 09:01:13 2008
From: hzhang at mcs.anl.gov (Hong Zhang)
Date: Thu, 10 Apr 2008 09:01:13 -0500 (CDT)
Subject: Further question about PC with Jaocbi Row Sum
In-Reply-To: <149510.10833.qm@web36208.mail.mud.yahoo.com>
References: <149510.10833.qm@web36208.mail.mud.yahoo.com>
Message-ID: <Pine.LNX.4.58.0804100857400.10264@shakey.mcs.anl.gov>


Then you may try direct sparse linear solver,
sequential run:
-ksp_type preonly -pc_type cholesky
parallel run (install external packages superlu_dist or mumps):
-ksp_type preonly -pc_type lu -mat_type superlu_dist
or
-ksp_type preonly -pc_type cholesky -mat_type sbaijmumps

Hong

On Wed, 9 Apr 2008, Shi Jin wrote:

>
> Thank you. I have used the -ksp_converged_reason option.
> The result says:
> Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 2
> I then further checked the row sum matrix, it has negative eigenvalues.
> So I guess it does not work at all.
> Thank you all for your help.
>
> --
> Shi Jin, PhD
>
> ----- Original Message ----
> > From: Matthew Knepley <knepley at gmail.com>
> > To: petsc-users at mcs.anl.gov
> > Sent: Wednesday, April 9, 2008 2:50:29 PM
> > Subject: Re: Further question about PC with Jaocbi Row Sum
> >
> > On Wed, Apr 9, 2008 at 3:25 PM, Shi Jin  wrote:
> > > Thank you very much.
> > >
> > >
> > >
> > >  > > Is there something particular about this rowsum method?
> > >  >
> > >  >     No. If you use a -ksp_rtol of 1.e-12 and still get different
> > >  > answers, this needs to be investigated.
> > >  >
> > >  >
> > >
> > >  I have tried even with -ksp_rtol   1.e-20 but still got different results.
> > >
> > >  Here is what I got when solving the mass matrix with
> > >
> > > -pc_type  jacobi
> > >  -pc_jacobi_rowsum 1
> > >  -ksp_type cg
> > >  -sub_pc_type icc
> > >  -ksp_rtol 1.e-20
> > >  -ksp_monitor
> > >  -ksp_view
> > >
> > >   0 KSP Residual norm 2.975203858623e+00
> > >   1 KSP Residual norm 2.674371671721e-01
> > >   2 KSP Residual norm 1.841074927355e-01
> > >  KSP Object:
> > >   type: cg
> > >   maximum iterations=10000, initial guess is zero
> > >   tolerances:  relative=1e-20, absolute=1e-50, divergence=10000
> > >   left preconditioning
> > >  PC Object:
> > >   type: jacobi
> > >   linear system matrix = precond matrix:
> > >   Matrix Object:
> > >     type=seqaij, rows=8775, cols=8775
> > >     total: nonzeros=214591, allocated nonzeros=214591
> > >       not using I-node routines
> > >
> > >  I realize that the iteration ended when the residual norm is quite large.
> > >  Do you think this indicates something wrong here?
> >
> > Can you run with
> >
> >   -ksp_converged_reason
> >
> > It appears that the solve fails rather than terminates with an answer. Is it
> > possible that your matrix is not SPD?
> >
> >   Matt
> >
> > >  Thank you again.
> > >
> > >  Shi
> > >
> > >
> > >
> > >  __________________________________________________
> > >  Do You Yahoo!?
> > >  Tired of spam?  Yahoo! Mail has the best spam protection around
> > >  http://mail.yahoo.com
> > >
> > >
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> > experiments is infinitely more interesting than any results to which
> > their experiments lead.
> > -- Norbert Wiener
> >
> >
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
>


From bsmith at mcs.anl.gov  Thu Apr 10 11:39:21 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Thu, 10 Apr 2008 11:39:21 -0500
Subject: Further question about PC with Jaocbi Row Sum
In-Reply-To: <149510.10833.qm@web36208.mail.mud.yahoo.com>
References: <149510.10833.qm@web36208.mail.mud.yahoo.com>
Message-ID: <DF9AB2F9-66F9-49A6-9299-BEDE06BF134A@mcs.anl.gov>


    the row sum option assumes that all the entries of the matrix are  
positive; this is true to linear elements
and mass matrices. If you have negative entries in your mass matrix  
then I would not trust any kind of
mass lumping as a preconditioner.

    Barry

On Apr 10, 2008, at 12:04 AM, Shi Jin wrote:
>
> Thank you. I have used the -ksp_converged_reason option.
> The result says:
> Linear solve did not converge due to DIVERGED_INDEFINITE_PC  
> iterations 2
> I then further checked the row sum matrix, it has negative  
> eigenvalues.
> So I guess it does not work at all.
> Thank you all for your help.
>
> --
> Shi Jin, PhD
>
> ----- Original Message ----
>> From: Matthew Knepley <knepley at gmail.com>
>> To: petsc-users at mcs.anl.gov
>> Sent: Wednesday, April 9, 2008 2:50:29 PM
>> Subject: Re: Further question about PC with Jaocbi Row Sum
>>
>> On Wed, Apr 9, 2008 at 3:25 PM, Shi Jin  wrote:
>>> Thank you very much.
>>>
>>>
>>>
>>>>> Is there something particular about this rowsum method?
>>>>
>>>>    No. If you use a -ksp_rtol of 1.e-12 and still get different
>>>> answers, this needs to be investigated.
>>>>
>>>>
>>>
>>> I have tried even with -ksp_rtol   1.e-20 but still got different  
>>> results.
>>>
>>> Here is what I got when solving the mass matrix with
>>>
>>> -pc_type  jacobi
>>> -pc_jacobi_rowsum 1
>>> -ksp_type cg
>>> -sub_pc_type icc
>>> -ksp_rtol 1.e-20
>>> -ksp_monitor
>>> -ksp_view
>>>
>>>  0 KSP Residual norm 2.975203858623e+00
>>>  1 KSP Residual norm 2.674371671721e-01
>>>  2 KSP Residual norm 1.841074927355e-01
>>> KSP Object:
>>>  type: cg
>>>  maximum iterations=10000, initial guess is zero
>>>  tolerances:  relative=1e-20, absolute=1e-50, divergence=10000
>>>  left preconditioning
>>> PC Object:
>>>  type: jacobi
>>>  linear system matrix = precond matrix:
>>>  Matrix Object:
>>>    type=seqaij, rows=8775, cols=8775
>>>    total: nonzeros=214591, allocated nonzeros=214591
>>>      not using I-node routines
>>>
>>> I realize that the iteration ended when the residual norm is quite  
>>> large.
>>> Do you think this indicates something wrong here?
>>
>> Can you run with
>>
>>  -ksp_converged_reason
>>
>> It appears that the solve fails rather than terminates with an  
>> answer. Is it
>> possible that your matrix is not SPD?
>>
>>  Matt
>>
>>> Thank you again.
>>>
>>> Shi
>>>
>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam?  Yahoo! Mail has the best spam protection around
>>> http://mail.yahoo.com
>>>
>>>
>>
>>
>>
>> -- 
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>>
>>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


From nliu at fit.edu  Thu Apr 10 23:28:13 2008
From: nliu at fit.edu (Ningyu  Liu)
Date: Fri, 11 Apr 2008 00:28:13 -0400 (EDT)
Subject: Question on DA
Message-ID: <51333.68.202.24.62.1207888093.squirrel@webaccess.fit.edu>

Hello,

I have a question on DA. If I create two DAs using DACreate2D() with the
same input except different degrees of freedom, will they share the same
communication information. If not, how can I create two DAs corresponding
to the same structured grid and communication information but different
degrees of freedom. Thank you very much!

Ningyu


From bsmith at mcs.anl.gov  Fri Apr 11 08:24:36 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Fri, 11 Apr 2008 08:24:36 -0500
Subject: Question on DA
In-Reply-To: <51333.68.202.24.62.1207888093.squirrel@webaccess.fit.edu>
References: <51333.68.202.24.62.1207888093.squirrel@webaccess.fit.edu>
Message-ID: <3041331F-0906-4076-916A-B120ED60F506@mcs.anl.gov>


    The default layouts of "grid points" is independent of the number  
of degree's
of freedom per point so each process will get the same "patch" for  
both DA's.

   If you are worried about the two DA's having some duplicate  
information that
wastes memory (information that could be shared between the two),  
don't; the
amount of excess data is very small relative to everything else in the  
code and
is not worth worrying about.

    Barry

On Apr 10, 2008, at 11:28 PM, Ningyu Liu wrote:
> Hello,
>
> I have a question on DA. If I create two DAs using DACreate2D() with  
> the
> same input except different degrees of freedom, will they share the  
> same
> communication information. If not, how can I create two DAs  
> corresponding
> to the same structured grid and communication information but  
> different
> degrees of freedom. Thank you very much!
>
> Ningyu
>


From jinzishuai at yahoo.com  Fri Apr 11 15:56:56 2008
From: jinzishuai at yahoo.com (Shi Jin)
Date: Fri, 11 Apr 2008 13:56:56 -0700 (PDT)
Subject: Further question about PC with Jaocbi Row Sum
Message-ID: <548693.2032.qm@web36202.mail.mud.yahoo.com>

Thank you.
Suppose I have a diagonal matrix, what is the best way to invert it in PETSc?
Do I have to install the external packages superlu_dist or mumps?
I realized that LU or Cholesky decomposition does not work with MPIAIJ matrices.
I also know the best way is probably to directly call Vector operations directly.
However, I want to keep the same KSPSolve structure so that the same code can be used  for non-diagonal MPIAIJ matrices without changing each call to KSPSolve.
Thank you very much.

Shi
> Then you may try direct sparse linear solver,
> sequential run:
> -ksp_type preonly -pc_type cholesky
> parallel run (install external packages superlu_dist or mumps):
> -ksp_type preonly -pc_type lu -mat_type superlu_dist
> or
> -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps
> 
> Hong
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From hzhang at mcs.anl.gov  Fri Apr 11 16:19:20 2008
From: hzhang at mcs.anl.gov (Hong Zhang)
Date: Fri, 11 Apr 2008 16:19:20 -0500 (CDT)
Subject: Further question about PC with Jaocbi Row Sum
In-Reply-To: <548693.2032.qm@web36202.mail.mud.yahoo.com>
References: <548693.2032.qm@web36202.mail.mud.yahoo.com>
Message-ID: <Pine.LNX.4.64.0804111612220.3609@terra.mcs.anl.gov>


Shi,

> Suppose I have a diagonal matrix, what is the best way to invert it in PETSc?
> Do I have to install the external packages superlu_dist or mumps?
> I realized that LU or Cholesky decomposition does not work with MPIAIJ matrices.
> I also know the best way is probably to directly call Vector operations directly.
> However, I want to keep the same KSPSolve structure so that the same code can be used  for non-diagonal MPIAIJ matrices without changing each call to KSPSolve.
> Thank you very much.

Without changing your application code, i.e., keep the same KSPSolve 
structure,
running it with the option
'-pc_type jacobi'
actually inverts the diagonal matrix, in both sequential and parallel 
cases.

Install external packages superlu_dist or mumps, then run your 
code in sequential or parallel with
-ksp_type preonly -pc_type lu -mat_type superlu_dist
(work with mpiaij matrix)
>> or
>> -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps
(work with mpisbaij matrix format).

Hong

>
> Shi
>> Then you may try direct sparse linear solver,
>> sequential run:
>> -ksp_type preonly -pc_type cholesky
>> parallel run (install external packages superlu_dist or mumps):
>> -ksp_type preonly -pc_type lu -mat_type superlu_dist
>> or
>> -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps
>>
>> Hong
>>
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
>


From bsmith at mcs.anl.gov  Fri Apr 11 16:04:12 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Fri, 11 Apr 2008 16:04:12 -0500
Subject: Further question about PC with Jaocbi Row Sum
In-Reply-To: <548693.2032.qm@web36202.mail.mud.yahoo.com>
References: <548693.2032.qm@web36202.mail.mud.yahoo.com>
Message-ID: <6BB577AC-01F8-4D52-B13C-82863778450E@mcs.anl.gov>


   There is no super easy way to do this that I can think of.

   The diagonal cases you can run with -pc_type jacobi and the  
nondiagonal with -pc_type lu (or Cholesky)
I realize this is not exactly what you want.

    Barry

On Apr 11, 2008, at 3:56 PM, Shi Jin wrote:
> Thank you.
> Suppose I have a diagonal matrix, what is the best way to invert it  
> in PETSc?
> Do I have to install the external packages superlu_dist or mumps?
> I realized that LU or Cholesky decomposition does not work with  
> MPIAIJ matrices.
> I also know the best way is probably to directly call Vector  
> operations directly.
> However, I want to keep the same KSPSolve structure so that the same  
> code can be used  for non-diagonal MPIAIJ matrices without changing  
> each call to KSPSolve.
> Thank you very much.
>
> Shi
>> Then you may try direct sparse linear solver,
>> sequential run:
>> -ksp_type preonly -pc_type cholesky
>> parallel run (install external packages superlu_dist or mumps):
>> -ksp_type preonly -pc_type lu -mat_type superlu_dist
>> or
>> -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps
>>
>> Hong
>>
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


From jinzishuai at yahoo.com  Fri Apr 11 16:40:10 2008
From: jinzishuai at yahoo.com (Shi Jin)
Date: Fri, 11 Apr 2008 14:40:10 -0700 (PDT)
Subject: Further question about PC with Jaocbi Row Sum
Message-ID: <932872.39873.qm@web36207.mail.mud.yahoo.com>

Thank you very much.

-pc_type jacobi -ksp_type preonly 
does exactly what I want, even in parallel.

Shi
----- Original Message ----
> From: Hong Zhang <hzhang at mcs.anl.gov>
 > >Suppose I have a diagonal matrix, what is the best way to invert it in PETSc?
> > Do I have to install the external packages superlu_dist or mumps?
> > I realized that LU or Cholesky decomposition does not work with MPIAIJ 
> matrices.
> > I also know the best way is probably to directly call Vector operations 
> directly.
> > However, I want to keep the same KSPSolve structure so that the same code can 
> be used  for non-diagonal MPIAIJ matrices without changing each call to 
> KSPSolve.
> > Thank you very much.
> 
> Without changing your application code, i.e., keep the same KSPSolve 
> structure,
> running it with the option
> '-pc_type jacobi'
> actually inverts the diagonal matrix, in both sequential and parallel 
> cases.
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From knepley at gmail.com  Fri Apr 11 16:04:54 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Fri, 11 Apr 2008 16:04:54 -0500
Subject: Further question about PC with Jaocbi Row Sum
In-Reply-To: <548693.2032.qm@web36202.mail.mud.yahoo.com>
References: <548693.2032.qm@web36202.mail.mud.yahoo.com>
Message-ID: <a9f269830804111404h31d3015dy8c05b0938d9f5517@mail.gmail.com>

On Fri, Apr 11, 2008 at 3:56 PM, Shi Jin <jinzishuai at yahoo.com> wrote:
> Thank you.
>  Suppose I have a diagonal matrix, what is the best way to invert it in PETSc?

If you have a diagonal matrix, you just use -ksp_type preonly -pc_type jacobi

   Matt

>  Do I have to install the external packages superlu_dist or mumps?
>  I realized that LU or Cholesky decomposition does not work with MPIAIJ matrices.
>  I also know the best way is probably to directly call Vector operations directly.
>  However, I want to keep the same KSPSolve structure so that the same code can be used  for non-diagonal MPIAIJ matrices without changing each call to KSPSolve.
>  Thank you very much.
>
>  Shi
>  > Then you may try direct sparse linear solver,
>  > sequential run:
>  > -ksp_type preonly -pc_type cholesky
>  > parallel run (install external packages superlu_dist or mumps):
>  > -ksp_type preonly -pc_type lu -mat_type superlu_dist
>  > or
>  > -ksp_type preonly -pc_type cholesky -mat_type sbaijmumps
>  >
>  > Hong
>  >
>
>
>
>
>
>  __________________________________________________
>  Do You Yahoo!?
>  Tired of spam?  Yahoo! Mail has the best spam protection around
>  http://mail.yahoo.com
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From recrusader at gmail.com  Sat Apr 12 12:52:50 2008
From: recrusader at gmail.com (Yujie)
Date: Sat, 12 Apr 2008 10:52:50 -0700
Subject: how to create sequential Vec or Mat in parallel mode.
Message-ID: <7ff0ee010804121052j1d4d517foae52149ea7b79ac8@mail.gmail.com>

Now, I use several processor to run my codes. However, I need to create
sequential
Vec and Mat. I use VecCreateSeq() to create Vec. I get error
information. How to get it? thanks a lot.

Regards,
Yujie
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080412/949e3c81/attachment.htm>

From knepley at gmail.com  Sat Apr 12 13:16:15 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Sat, 12 Apr 2008 13:16:15 -0500
Subject: how to create sequential Vec or Mat in parallel mode.
In-Reply-To: <7ff0ee010804121052j1d4d517foae52149ea7b79ac8@mail.gmail.com>
References: <7ff0ee010804121052j1d4d517foae52149ea7b79ac8@mail.gmail.com>
Message-ID: <a9f269830804121116u301f02ebxc984665627491edf@mail.gmail.com>

You cannot create a sequential Vec with a parallel communicator. If
you truly want
a VecSeq, use PETSC_COMM_SELF.

  Matt

On Sat, Apr 12, 2008 at 12:52 PM, Yujie <recrusader at gmail.com> wrote:
> Now, I use several processor to run my codes. However, I need to create
> sequential Vec and Mat. I use VecCreateSeq() to create Vec. I get error
> information. How to get it? thanks a lot.
>
> Regards,
> Yujie
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From recrusader at gmail.com  Sat Apr 12 13:21:27 2008
From: recrusader at gmail.com (Yujie)
Date: Sat, 12 Apr 2008 11:21:27 -0700
Subject: how to create sequential Vec or Mat in parallel mode.
In-Reply-To: <a9f269830804121116u301f02ebxc984665627491edf@mail.gmail.com>
References: <7ff0ee010804121052j1d4d517foae52149ea7b79ac8@mail.gmail.com>
	 <a9f269830804121116u301f02ebxc984665627491edf@mail.gmail.com>
Message-ID: <7ff0ee010804121121v7267bc69v421a5f5b1fe495b8@mail.gmail.com>

I got it. thanks a lot:)

Regards,
Yujie

On 4/12/08, Matthew Knepley <knepley at gmail.com> wrote:
>
> You cannot create a sequential Vec with a parallel communicator. If
> you truly want
> a VecSeq, use PETSC_COMM_SELF.
>
>   Matt
>
>
> On Sat, Apr 12, 2008 at 12:52 PM, Yujie <recrusader at gmail.com> wrote:
> > Now, I use several processor to run my codes. However, I need to create
> > sequential Vec and Mat. I use VecCreateSeq() to create Vec. I get error
> > information. How to get it? thanks a lot.
> >
> > Regards,
> > Yujie
> >
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080412/728c00c3/attachment.htm>

From zonexo at gmail.com  Sun Apr 13 04:12:41 2008
From: zonexo at gmail.com (Ben Tay)
Date: Sun, 13 Apr 2008 17:12:41 +0800
Subject: Slow speed after changing from serial to parallel
Message-ID: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>

Hi,

I've a serial 2D CFD code. As my grid size requirement increases, the
simulation takes longer. Also, memory requirement becomes a problem. Grid
size 've reached 1200x1200. Going higher is not possible due to memory
problem.

I tried to convert my code to a parallel one, following the examples given.
I also need to restructure parts of my code to enable parallel looping. I
1st changed the PETSc solver to be parallel enabled and then I restructured
parts of my code. I proceed on as longer as the answer for a simple test
case is correct. I thought it's not really possible to do any speed testing
since the code is not fully parallelized yet. When I finished during most of
the conversion, I found that in the actual run that it is much slower,
although the answer is correct.

So what is the remedy now? I wonder what I should do to check what's wrong.
Must I restart everything again? Btw, my grid size is 1200x1200. I believed
it should be suitable for parallel run of 4 processors? Is that so?

Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080413/9c4ce213/attachment.htm>

From knepley at gmail.com  Sun Apr 13 12:47:41 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Sun, 13 Apr 2008 12:47:41 -0500
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
Message-ID: <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>

1) There is no way to have any idea what is going on in your code
    without -log_summary output

2) Looking at that output, look at the percentage taken by the solver
    KSPSolve event. I suspect it is not the biggest component, because
   it is very scalable.

   Matt

On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
>
>
> Hi,
>
> I've a serial 2D CFD code. As my grid size requirement increases, the
> simulation takes longer. Also, memory requirement becomes a problem. Grid
> size 've reached 1200x1200. Going higher is not possible due to memory
> problem.
>
> I tried to convert my code to a parallel one, following the examples given.
> I also need to restructure parts of my code to enable parallel looping. I
> 1st changed the PETSc solver to be parallel enabled and then I restructured
> parts of my code. I proceed on as longer as the answer for a simple test
> case is correct. I thought it's not really possible to do any speed testing
> since the code is not fully parallelized yet. When I finished during most of
> the conversion, I found that in the actual run that it is much slower,
> although the answer is correct.
>
> So what is the remedy now? I wonder what I should do to check what's wrong.
> Must I restart everything again? Btw, my grid size is 1200x1200. I believed
> it should be suitable for parallel run of 4 processors? Is that so?
>
> Thank you.


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From zonexo at gmail.com  Mon Apr 14 05:49:34 2008
From: zonexo at gmail.com (Ben Tay)
Date: Mon, 14 Apr 2008 18:49:34 +0800
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com> <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>
Message-ID: <480336BE.3070507@gmail.com>

Thank you Matthew. Sorry to trouble you again.

I tried to run it with -log_summary output and I found that there's some 
errors in the execution. Well, I was busy with other things and I just 
came back to this problem. Some of my files on the server has also been 
deleted. It has been a while and I  remember that  it worked before, 
only much slower.

Anyway, most of the serial code has been updated and maybe it's easier 
to convert the new serial code instead of debugging on the old parallel 
code now. I believe I can still reuse part of the old parallel code. 
However, I hope I can approach it better this time.

So supposed I need to start converting my new serial code to parallel. 
There's 2 eqns to be solved using PETSc, the momentum and poisson. I 
also need to parallelize other parts of my code. I wonder which route is 
the best:

1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify 
other parts of my code to parallel e.g. looping, updating of values etc. 
Once the execution is fine and speedup is reasonable, then modify the 
PETSc part - poisson eqn 1st followed by the momentum eqn.

2. Reverse the above order ie modify the PETSc part - poisson eqn 1st 
followed by the momentum eqn. Then do other parts of my code.

I'm not sure if the above 2 mtds can work or if there will be conflicts. 
Of course, an alternative will be:

3. Do the poisson, momentum eqns and other parts of the code separately. 
That is, code a standalone parallel poisson eqn and use samples values 
to test it. Same for the momentum and other parts of the code. When each 
of them is working, combine them to form the full parallel code. 
However, this will be much more troublesome.

I hope someone can give me some recommendations.

Thank you once again.

Matthew Knepley wrote:
> 1) There is no way to have any idea what is going on in your code
>     without -log_summary output
>
> 2) Looking at that output, look at the percentage taken by the solver
>     KSPSolve event. I suspect it is not the biggest component, because
>    it is very scalable.
>
>    Matt
>
> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
>   
>> Hi,
>>
>> I've a serial 2D CFD code. As my grid size requirement increases, the
>> simulation takes longer. Also, memory requirement becomes a problem. Grid
>> size 've reached 1200x1200. Going higher is not possible due to memory
>> problem.
>>
>> I tried to convert my code to a parallel one, following the examples given.
>> I also need to restructure parts of my code to enable parallel looping. I
>> 1st changed the PETSc solver to be parallel enabled and then I restructured
>> parts of my code. I proceed on as longer as the answer for a simple test
>> case is correct. I thought it's not really possible to do any speed testing
>> since the code is not fully parallelized yet. When I finished during most of
>> the conversion, I found that in the actual run that it is much slower,
>> although the answer is correct.
>>
>> So what is the remedy now? I wonder what I should do to check what's wrong.
>> Must I restart everything again? Btw, my grid size is 1200x1200. I believed
>> it should be suitable for parallel run of 4 processors? Is that so?
>>
>> Thank you.
>>     
>
>
>
>   


From knepley at gmail.com  Mon Apr 14 08:23:48 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Mon, 14 Apr 2008 08:23:48 -0500
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <480336BE.3070507@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
	 <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>
	 <480336BE.3070507@gmail.com>
Message-ID: <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>

I am not sure why you would ever have two codes. I never do this. PETSc
is designed to write one code to run in serial and parallel. The PETSc part
should look identical. To test, run the code yo uhave verified in serial and
output PETSc data structures (like Mat and Vec) using a binary viewer.
Then run in parallel with the same code, which will output the same
structures. Take the two files and write a small verification code that
loads both versions and calls MatEqual and VecEqual.

  Matt

On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
> Thank you Matthew. Sorry to trouble you again.
>
>  I tried to run it with -log_summary output and I found that there's some
> errors in the execution. Well, I was busy with other things and I just came
> back to this problem. Some of my files on the server has also been deleted.
> It has been a while and I  remember that  it worked before, only much
> slower.
>
>  Anyway, most of the serial code has been updated and maybe it's easier to
> convert the new serial code instead of debugging on the old parallel code
> now. I believe I can still reuse part of the old parallel code. However, I
> hope I can approach it better this time.
>
>  So supposed I need to start converting my new serial code to parallel.
> There's 2 eqns to be solved using PETSc, the momentum and poisson. I also
> need to parallelize other parts of my code. I wonder which route is the
> best:
>
>  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify
> other parts of my code to parallel e.g. looping, updating of values etc.
> Once the execution is fine and speedup is reasonable, then modify the PETSc
> part - poisson eqn 1st followed by the momentum eqn.
>
>  2. Reverse the above order ie modify the PETSc part - poisson eqn 1st
> followed by the momentum eqn. Then do other parts of my code.
>
>  I'm not sure if the above 2 mtds can work or if there will be conflicts. Of
> course, an alternative will be:
>
>  3. Do the poisson, momentum eqns and other parts of the code separately.
> That is, code a standalone parallel poisson eqn and use samples values to
> test it. Same for the momentum and other parts of the code. When each of
> them is working, combine them to form the full parallel code. However, this
> will be much more troublesome.
>
>  I hope someone can give me some recommendations.
>
>  Thank you once again.
>
>
>
>  Matthew Knepley wrote:
>
> > 1) There is no way to have any idea what is going on in your code
> >    without -log_summary output
> >
> > 2) Looking at that output, look at the percentage taken by the solver
> >    KSPSolve event. I suspect it is not the biggest component, because
> >   it is very scalable.
> >
> >   Matt
> >
> > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
> >
> >
> > > Hi,
> > >
> > > I've a serial 2D CFD code. As my grid size requirement increases, the
> > > simulation takes longer. Also, memory requirement becomes a problem.
> Grid
> > > size 've reached 1200x1200. Going higher is not possible due to memory
> > > problem.
> > >
> > > I tried to convert my code to a parallel one, following the examples
> given.
> > > I also need to restructure parts of my code to enable parallel looping.
> I
> > > 1st changed the PETSc solver to be parallel enabled and then I
> restructured
> > > parts of my code. I proceed on as longer as the answer for a simple test
> > > case is correct. I thought it's not really possible to do any speed
> testing
> > > since the code is not fully parallelized yet. When I finished during
> most of
> > > the conversion, I found that in the actual run that it is much slower,
> > > although the answer is correct.
> > >
> > > So what is the remedy now? I wonder what I should do to check what's
> wrong.
> > > Must I restart everything again? Btw, my grid size is 1200x1200. I
> believed
> > > it should be suitable for parallel run of 4 processors? Is that so?
> > >
> > > Thank you.
> > >
> > >
> >
> >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From zonexo at gmail.com  Mon Apr 14 08:43:36 2008
From: zonexo at gmail.com (Ben Tay)
Date: Mon, 14 Apr 2008 21:43:36 +0800
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>	 <480336BE.3070507@gmail.com> <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>
Message-ID: <48035F88.2080003@gmail.com>

Hi Matthew,

I think you've misunderstood what I meant. What I'm trying to say is 
initially I've got a serial code. I tried to convert to a parallel one. 
Then I tested it and it was pretty slow. Due to some work requirement, I 
need to go back to make some changes to my code. Since the parallel is 
not working well, I updated and changed the serial one.

Well, that was a while ago and now, due to the updates and changes, the 
serial code is different from the old converted parallel code. Some 
files were also deleted and I can't seem to get it working now. So I 
thought I might as well convert the new serial code to parallel. But I'm 
not very sure what I should do 1st.

Maybe I should rephrase my question in that if I just convert my poisson 
equation subroutine from a serial PETSc to a parallel PETSc version, 
will it work? Should I expect a speedup? The rest of my code is still 
serial.

Thank you very much.

Matthew Knepley wrote:
> I am not sure why you would ever have two codes. I never do this. PETSc
> is designed to write one code to run in serial and parallel. The PETSc part
> should look identical. To test, run the code yo uhave verified in serial and
> output PETSc data structures (like Mat and Vec) using a binary viewer.
> Then run in parallel with the same code, which will output the same
> structures. Take the two files and write a small verification code that
> loads both versions and calls MatEqual and VecEqual.
>
>   Matt
>
> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
>   
>> Thank you Matthew. Sorry to trouble you again.
>>
>>  I tried to run it with -log_summary output and I found that there's some
>> errors in the execution. Well, I was busy with other things and I just came
>> back to this problem. Some of my files on the server has also been deleted.
>> It has been a while and I  remember that  it worked before, only much
>> slower.
>>
>>  Anyway, most of the serial code has been updated and maybe it's easier to
>> convert the new serial code instead of debugging on the old parallel code
>> now. I believe I can still reuse part of the old parallel code. However, I
>> hope I can approach it better this time.
>>
>>  So supposed I need to start converting my new serial code to parallel.
>> There's 2 eqns to be solved using PETSc, the momentum and poisson. I also
>> need to parallelize other parts of my code. I wonder which route is the
>> best:
>>
>>  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify
>> other parts of my code to parallel e.g. looping, updating of values etc.
>> Once the execution is fine and speedup is reasonable, then modify the PETSc
>> part - poisson eqn 1st followed by the momentum eqn.
>>
>>  2. Reverse the above order ie modify the PETSc part - poisson eqn 1st
>> followed by the momentum eqn. Then do other parts of my code.
>>
>>  I'm not sure if the above 2 mtds can work or if there will be conflicts. Of
>> course, an alternative will be:
>>
>>  3. Do the poisson, momentum eqns and other parts of the code separately.
>> That is, code a standalone parallel poisson eqn and use samples values to
>> test it. Same for the momentum and other parts of the code. When each of
>> them is working, combine them to form the full parallel code. However, this
>> will be much more troublesome.
>>
>>  I hope someone can give me some recommendations.
>>
>>  Thank you once again.
>>
>>
>>
>>  Matthew Knepley wrote:
>>
>>     
>>> 1) There is no way to have any idea what is going on in your code
>>>    without -log_summary output
>>>
>>> 2) Looking at that output, look at the percentage taken by the solver
>>>    KSPSolve event. I suspect it is not the biggest component, because
>>>   it is very scalable.
>>>
>>>   Matt
>>>
>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>
>>>
>>>       
>>>> Hi,
>>>>
>>>> I've a serial 2D CFD code. As my grid size requirement increases, the
>>>> simulation takes longer. Also, memory requirement becomes a problem.
>>>>         
>> Grid
>>     
>>>> size 've reached 1200x1200. Going higher is not possible due to memory
>>>> problem.
>>>>
>>>> I tried to convert my code to a parallel one, following the examples
>>>>         
>> given.
>>     
>>>> I also need to restructure parts of my code to enable parallel looping.
>>>>         
>> I
>>     
>>>> 1st changed the PETSc solver to be parallel enabled and then I
>>>>         
>> restructured
>>     
>>>> parts of my code. I proceed on as longer as the answer for a simple test
>>>> case is correct. I thought it's not really possible to do any speed
>>>>         
>> testing
>>     
>>>> since the code is not fully parallelized yet. When I finished during
>>>>         
>> most of
>>     
>>>> the conversion, I found that in the actual run that it is much slower,
>>>> although the answer is correct.
>>>>
>>>> So what is the remedy now? I wonder what I should do to check what's
>>>>         
>> wrong.
>>     
>>>> Must I restart everything again? Btw, my grid size is 1200x1200. I
>>>>         
>> believed
>>     
>>>> it should be suitable for parallel run of 4 processors? Is that so?
>>>>
>>>> Thank you.
>>>>
>>>>
>>>>         
>>>
>>>
>>>
>>>       
>>     
>
>
>
>   


From knepley at gmail.com  Mon Apr 14 08:58:20 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Mon, 14 Apr 2008 08:58:20 -0500
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <48035F88.2080003@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
	 <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>
	 <480336BE.3070507@gmail.com>
	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>
	 <48035F88.2080003@gmail.com>
Message-ID: <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>

On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
> Hi Matthew,
>
>  I think you've misunderstood what I meant. What I'm trying to say is
> initially I've got a serial code. I tried to convert to a parallel one. Then
> I tested it and it was pretty slow. Due to some work requirement, I need to
> go back to make some changes to my code. Since the parallel is not working
> well, I updated and changed the serial one.
>
>  Well, that was a while ago and now, due to the updates and changes, the
> serial code is different from the old converted parallel code. Some files
> were also deleted and I can't seem to get it working now. So I thought I
> might as well convert the new serial code to parallel. But I'm not very sure
> what I should do 1st.
>
>  Maybe I should rephrase my question in that if I just convert my poisson
> equation subroutine from a serial PETSc to a parallel PETSc version, will it
> work? Should I expect a speedup? The rest of my code is still serial.

You should, of course, only expect speedup in the parallel parts

  Matt

>  Thank you very much.
>
>
>
>  Matthew Knepley wrote:
>
> > I am not sure why you would ever have two codes. I never do this. PETSc
> > is designed to write one code to run in serial and parallel. The PETSc
> part
> > should look identical. To test, run the code yo uhave verified in serial
> and
> > output PETSc data structures (like Mat and Vec) using a binary viewer.
> > Then run in parallel with the same code, which will output the same
> > structures. Take the two files and write a small verification code that
> > loads both versions and calls MatEqual and VecEqual.
> >
> >  Matt
> >
> > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
> >
> >
> > > Thank you Matthew. Sorry to trouble you again.
> > >
> > >  I tried to run it with -log_summary output and I found that there's
> some
> > > errors in the execution. Well, I was busy with other things and I just
> came
> > > back to this problem. Some of my files on the server has also been
> deleted.
> > > It has been a while and I  remember that  it worked before, only much
> > > slower.
> > >
> > >  Anyway, most of the serial code has been updated and maybe it's easier
> to
> > > convert the new serial code instead of debugging on the old parallel
> code
> > > now. I believe I can still reuse part of the old parallel code. However,
> I
> > > hope I can approach it better this time.
> > >
> > >  So supposed I need to start converting my new serial code to parallel.
> > > There's 2 eqns to be solved using PETSc, the momentum and poisson. I
> also
> > > need to parallelize other parts of my code. I wonder which route is the
> > > best:
> > >
> > >  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
> modify
> > > other parts of my code to parallel e.g. looping, updating of values etc.
> > > Once the execution is fine and speedup is reasonable, then modify the
> PETSc
> > > part - poisson eqn 1st followed by the momentum eqn.
> > >
> > >  2. Reverse the above order ie modify the PETSc part - poisson eqn 1st
> > > followed by the momentum eqn. Then do other parts of my code.
> > >
> > >  I'm not sure if the above 2 mtds can work or if there will be
> conflicts. Of
> > > course, an alternative will be:
> > >
> > >  3. Do the poisson, momentum eqns and other parts of the code
> separately.
> > > That is, code a standalone parallel poisson eqn and use samples values
> to
> > > test it. Same for the momentum and other parts of the code. When each of
> > > them is working, combine them to form the full parallel code. However,
> this
> > > will be much more troublesome.
> > >
> > >  I hope someone can give me some recommendations.
> > >
> > >  Thank you once again.
> > >
> > >
> > >
> > >  Matthew Knepley wrote:
> > >
> > >
> > >
> > > > 1) There is no way to have any idea what is going on in your code
> > > >   without -log_summary output
> > > >
> > > > 2) Looking at that output, look at the percentage taken by the solver
> > > >   KSPSolve event. I suspect it is not the biggest component, because
> > > >  it is very scalable.
> > > >
> > > >  Matt
> > > >
> > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
> > > >
> > > >
> > > >
> > > >
> > > > > Hi,
> > > > >
> > > > > I've a serial 2D CFD code. As my grid size requirement increases,
> the
> > > > > simulation takes longer. Also, memory requirement becomes a problem.
> > > > >
> > > > >
> > > >
> > > Grid
> > >
> > >
> > > >
> > > > > size 've reached 1200x1200. Going higher is not possible due to
> memory
> > > > > problem.
> > > > >
> > > > > I tried to convert my code to a parallel one, following the examples
> > > > >
> > > > >
> > > >
> > > given.
> > >
> > >
> > > >
> > > > > I also need to restructure parts of my code to enable parallel
> looping.
> > > > >
> > > > >
> > > >
> > > I
> > >
> > >
> > > >
> > > > > 1st changed the PETSc solver to be parallel enabled and then I
> > > > >
> > > > >
> > > >
> > > restructured
> > >
> > >
> > > >
> > > > > parts of my code. I proceed on as longer as the answer for a simple
> test
> > > > > case is correct. I thought it's not really possible to do any speed
> > > > >
> > > > >
> > > >
> > > testing
> > >
> > >
> > > >
> > > > > since the code is not fully parallelized yet. When I finished during
> > > > >
> > > > >
> > > >
> > > most of
> > >
> > >
> > > >
> > > > > the conversion, I found that in the actual run that it is much
> slower,
> > > > > although the answer is correct.
> > > > >
> > > > > So what is the remedy now? I wonder what I should do to check what's
> > > > >
> > > > >
> > > >
> > > wrong.
> > >
> > >
> > > >
> > > > > Must I restart everything again? Btw, my grid size is 1200x1200. I
> > > > >
> > > > >
> > > >
> > > believed
> > >
> > >
> > > >
> > > > > it should be suitable for parallel run of 4 processors? Is that so?
> > > > >
> > > > > Thank you.
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From pivello at gmail.com  Tue Apr 15 09:22:54 2008
From: pivello at gmail.com (=?ISO-8859-1?Q?M=E1rcio_Ricardo_Pivello?=)
Date: Tue, 15 Apr 2008 11:22:54 -0300
Subject: PETSc + HYPRE
Message-ID: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>

Hi, I want to use hypre preconditioners coupled with PETSc, but so far I
have not succeeded. Here's what I've done:

Firstly I create the preconditioner:


           Mat A_Par(NSubSteps)
           Vec Unk_Par(NSubSteps)
           Vec B_Load_Par(NSubSteps)
           KSP KspSolv
--->      PC  precond

******************************

Later in the code I set the preconditioner type and create the Krylov
solver:

---->    call PCSetType(precond,'hypre',iError)
---->    call PCHYPRESetType(precond,'boomeramg',iError)
---->    call KSPCreate (PETSC_COMM_WORLD, KspSolv, iError)
---->    call KSPSetFromOptions (KspSolv, iError)
          call KSPSetOperators (KspSolv, A_Par(nstp), A_Par(nstp),
SAME_NONZERO_PATTERN, iError)
          call KSPSolve (KspSolv, B_Load_Par(nstp), Unk_Par(nstp), iError)


***************************

Then, when I run the program I put the following options in the command
line:

mpirun -np 2 /home/mpivello/bin/SolverGP.x -pc_type hypre -pc_hypre_type
boomeramg -pc_hypre_boomeramg_sweep_all true -pc_hypre_boomeramg_grid_sweeps
1 -pc_hypre_boomeramg_strong_threshold 0.9 -pc_hypre_boomeramg_max_iter 5
-pc_hypre_boomeramg_coarsen_type modifiedRuge-Stueben -f0 dummy.tmp 2>&1
-ksp_gmres_restart 200 -ksp_max_it 3000 -ksp_rtol 1.0e-10 -ksp_atol 1.0e-15
-ksp_monitor -log_summary < /dev/null > run.parallel.log &

But this proceeding is not working. What am I doing wrong?

Thanks in advance

M?rcio Ricardo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/1fb62c95/attachment.htm>

From knepley at gmail.com  Tue Apr 15 09:36:23 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 09:36:23 -0500
Subject: PETSc + HYPRE
In-Reply-To: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
Message-ID: <a9f269830804150736s5a291478y5d2f531789cb7673@mail.gmail.com>

On Tue, Apr 15, 2008 at 9:22 AM, M?rcio Ricardo Pivello
<pivello at gmail.com> wrote:
> Hi, I want to use hypre preconditioners coupled with PETSc, but so far I
> have not succeeded. Here's what I've done:
>
> Firstly I create the preconditioner:
>
>
>            Mat A_Par(NSubSteps)
>            Vec Unk_Par(NSubSteps)
>             Vec B_Load_Par(NSubSteps)
>            KSP KspSolv
> --->      PC  precond
>
> ******************************
>
> Later in the code I set the preconditioner type and create the Krylov
> solver:
>
>  ---->    call PCSetType(precond,'hypre',iError)
> ---->    call PCHYPRESetType(precond,'boomeramg',iError)
> ---->    call KSPCreate (PETSC_COMM_WORLD, KspSolv, iError)
> ---->    call KSPSetFromOptions (KspSolv, iError)
>            call KSPSetOperators (KspSolv, A_Par(nstp), A_Par(nstp),
> SAME_NONZERO_PATTERN, iError)
>           call KSPSolve (KspSolv, B_Load_Par(nstp), Unk_Par(nstp), iError)
>
>
> ***************************
>
>  Then, when I run the program I put the following options in the command
> line:
>
> mpirun -np 2 /home/mpivello/bin/SolverGP.x -pc_type hypre -pc_hypre_type
> boomeramg -pc_hypre_boomeramg_sweep_all true -pc_hypre_boomeramg_grid_sweeps
> 1 -pc_hypre_boomeramg_strong_threshold 0.9 -pc_hypre_boomeramg_max_iter 5
> -pc_hypre_boomeramg_coarsen_type modifiedRuge-Stueben -f0 dummy.tmp 2>&1
> -ksp_gmres_restart 200 -ksp_max_it 3000 -ksp_rtol 1.0e-10 -ksp_atol 1.0e-15
> -ksp_monitor -log_summary < /dev/null > run.parallel.log &
>
> But this proceeding is not working. What am I doing wrong?

What does "not working" mean?

 1) What is actually being run? Use -ksp_view to find out (always).

 2) Above you set the PC type before creating the KSP. How does the
KSP know about
     the PC? You should retrieve the PC from the KSP using KSPGetPC() and then
     customize it. Better yet, do everything from the command line

       -pc_type hypre -pc_hypre_type boomeramg ...

 3) Did you configure with HYPRE?

    Matt

> Thanks in advance
>
> M?rcio Ricardo
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
- Norbert Wiener


From dalcinl at gmail.com  Tue Apr 15 09:52:03 2008
From: dalcinl at gmail.com (Lisandro Dalcin)
Date: Tue, 15 Apr 2008 11:52:03 -0300
Subject: PETSc + HYPRE
In-Reply-To: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
Message-ID: <e7ba66e40804150752lce3cee2k875b5d8b55c0f41@mail.gmail.com>

Do not create the PC !!

Create first the KSP, next do KSPGetPC, and then configure the PC


On 4/15/08, M?rcio Ricardo Pivello <pivello at gmail.com> wrote:
> Hi, I want to use hypre preconditioners coupled with PETSc, but so far I
> have not succeeded. Here's what I've done:
>
> Firstly I create the preconditioner:
>
>
>            Mat A_Par(NSubSteps)
>            Vec Unk_Par(NSubSteps)
>             Vec B_Load_Par(NSubSteps)
>            KSP KspSolv
> --->      PC  precond
>
> ******************************
>
> Later in the code I set the preconditioner type and create the Krylov
> solver:
>
>  ---->    call PCSetType(precond,'hypre',iError)
> ---->    call PCHYPRESetType(precond,'boomeramg',iError)
> ---->    call KSPCreate (PETSC_COMM_WORLD, KspSolv, iError)
> ---->    call KSPSetFromOptions (KspSolv, iError)
>            call KSPSetOperators (KspSolv, A_Par(nstp), A_Par(nstp),
> SAME_NONZERO_PATTERN, iError)
>           call KSPSolve (KspSolv, B_Load_Par(nstp), Unk_Par(nstp), iError)
>
>
> ***************************
>
>  Then, when I run the program I put the following options in the command
> line:
>
> mpirun -np 2 /home/mpivello/bin/SolverGP.x -pc_type hypre -pc_hypre_type
> boomeramg -pc_hypre_boomeramg_sweep_all true -pc_hypre_boomeramg_grid_sweeps
> 1 -pc_hypre_boomeramg_strong_threshold 0.9
> -pc_hypre_boomeramg_max_iter 5
> -pc_hypre_boomeramg_coarsen_type modifiedRuge-Stueben -f0
> dummy.tmp 2>&1 -ksp_gmres_restart 200 -ksp_max_it 3000 -ksp_rtol 1.0e-10
> -ksp_atol 1.0e-15 -ksp_monitor -log_summary < /dev/null > run.parallel.log &
>
> But this proceeding is not working. What am I doing wrong?
>
> Thanks in advance
>
> M?rcio Ricardo
>
>
>
>
>


-- 
Lisandro Dalc?n
---------------
Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC)
Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC)
Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET)
PTLC - G?emes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594


From zonexo at gmail.com  Tue Apr 15 10:33:20 2008
From: zonexo at gmail.com (Ben Tay)
Date: Tue, 15 Apr 2008 23:33:20 +0800
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>	 <480336BE.3070507@gmail.com>	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>	 <48035F88.2080003@gmail.com> <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>
Message-ID: <4804CAC0.6060201@gmail.com>

Hi,

I have converted the poisson eqn part of the CFD code to parallel. The 
grid size tested is 600x720. For the momentum eqn, I used another serial 
linear solver (nspcg) to prevent mixing of results. Here's the output 
summary:

--- Event Stage 0: Main Stage

MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 
0.0e+00 10 11100100  0  10 11100100  0   217
MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 
0.0e+00 17 11  0  0  0  17 11  0  0  0   120
MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   140
MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0     0
*MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 
0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 
7.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 
8.5e+03 50 72  0  0 49  50 72  0  0 49   363
KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 
1.7e+04 89100100100100  89100100100100   317
PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 
3.0e+00  0  0  0  0  0   0  0  0  0  0    69
PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 
3.0e+00  0  0  0  0  0   0  0  0  0  0    69
PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 
0.0e+00 18 11  0  0  0  18 11  0  0  0   114
VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 
8.5e+03 35 36  0  0 49  35 36  0  0 49   213
*VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 
0.0e+00 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
*VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 
0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   346
VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 
0.0e+00 16 38  0  0  0  16 38  0  0  0   453
VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
*VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 
4.8e+03 0.0e+00  0  0100100  0   0  0100100  0     0*
*VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 
0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
*VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 
0.0e+00 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
------------------------------------------------------------------------------------------------------------------------
  
Memory usage is given in bytes:
  
Object Type          Creations   Destructions   Memory  Descendants' Mem.
     
--- Event Stage 0: Main Stage
     
              Matrix     4              4   49227380     0
       Krylov Solver     2              2      17216     0
      Preconditioner     2              2        256     0
           Index Set     5              5    2596120     0
                 Vec    40             40   62243224     0
         Vec Scatter     1              1          0     0    
========================================================================================================================
Average time to get PetscTime(): 4.05312e-07                  
Average time for MPI_Barrier(): 7.62939e-07
Average time for zero size MPI_Send(): 2.02656e-06
OptionTable: -log_summary


The PETSc manual states that ratio should be close to 1. There's quite a 
few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. 
So what could be the cause?

I wonder if it has to do the way I insert the matrix. My steps are: 
(cartesian grids, i loop faster than j, fortran)

For matrix A and rhs

Insert left extreme cells values belonging to myid

if (myid==0) then

    insert corner cells values

    insert south cells values

    insert internal cells values

else if (myid==num_procs-1) then

    insert corner cells values

    insert north cells values

    insert internal cells values

else

    insert internal cells values

end if

Insert right extreme cells values belonging to myid

All these values are entered into a big_A(size_x*size_y,5) matrix. int_A 
stores the position of the values. I then do

call MatZeroEntries(A_mat,ierr)

    do k=ksta_p+1,kend_p   !for cells belonging to myid

        do kk=1,5

            II=k-1

            JJ=int_A(k,kk)-1

            call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
           
        end do

    end do

    call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)

    call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)


I wonder if the problem lies here.I used the big_A matrix because I was 
migrating from an old linear solver. Lastly, I was told to widen my 
window to 120 characters. May I know how do I do it?

Thank you very much.

Matthew Knepley wrote:
> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
>   
>> Hi Matthew,
>>
>>  I think you've misunderstood what I meant. What I'm trying to say is
>> initially I've got a serial code. I tried to convert to a parallel one. Then
>> I tested it and it was pretty slow. Due to some work requirement, I need to
>> go back to make some changes to my code. Since the parallel is not working
>> well, I updated and changed the serial one.
>>
>>  Well, that was a while ago and now, due to the updates and changes, the
>> serial code is different from the old converted parallel code. Some files
>> were also deleted and I can't seem to get it working now. So I thought I
>> might as well convert the new serial code to parallel. But I'm not very sure
>> what I should do 1st.
>>
>>  Maybe I should rephrase my question in that if I just convert my poisson
>> equation subroutine from a serial PETSc to a parallel PETSc version, will it
>> work? Should I expect a speedup? The rest of my code is still serial.
>>     
>
> You should, of course, only expect speedup in the parallel parts
>
>   Matt
>
>   
>>  Thank you very much.
>>
>>
>>
>>  Matthew Knepley wrote:
>>
>>     
>>> I am not sure why you would ever have two codes. I never do this. PETSc
>>> is designed to write one code to run in serial and parallel. The PETSc
>>>       
>> part
>>     
>>> should look identical. To test, run the code yo uhave verified in serial
>>>       
>> and
>>     
>>> output PETSc data structures (like Mat and Vec) using a binary viewer.
>>> Then run in parallel with the same code, which will output the same
>>> structures. Take the two files and write a small verification code that
>>> loads both versions and calls MatEqual and VecEqual.
>>>
>>>  Matt
>>>
>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>
>>>
>>>       
>>>> Thank you Matthew. Sorry to trouble you again.
>>>>
>>>>  I tried to run it with -log_summary output and I found that there's
>>>>         
>> some
>>     
>>>> errors in the execution. Well, I was busy with other things and I just
>>>>         
>> came
>>     
>>>> back to this problem. Some of my files on the server has also been
>>>>         
>> deleted.
>>     
>>>> It has been a while and I  remember that  it worked before, only much
>>>> slower.
>>>>
>>>>  Anyway, most of the serial code has been updated and maybe it's easier
>>>>         
>> to
>>     
>>>> convert the new serial code instead of debugging on the old parallel
>>>>         
>> code
>>     
>>>> now. I believe I can still reuse part of the old parallel code. However,
>>>>         
>> I
>>     
>>>> hope I can approach it better this time.
>>>>
>>>>  So supposed I need to start converting my new serial code to parallel.
>>>> There's 2 eqns to be solved using PETSc, the momentum and poisson. I
>>>>         
>> also
>>     
>>>> need to parallelize other parts of my code. I wonder which route is the
>>>> best:
>>>>
>>>>  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
>>>>         
>> modify
>>     
>>>> other parts of my code to parallel e.g. looping, updating of values etc.
>>>> Once the execution is fine and speedup is reasonable, then modify the
>>>>         
>> PETSc
>>     
>>>> part - poisson eqn 1st followed by the momentum eqn.
>>>>
>>>>  2. Reverse the above order ie modify the PETSc part - poisson eqn 1st
>>>> followed by the momentum eqn. Then do other parts of my code.
>>>>
>>>>  I'm not sure if the above 2 mtds can work or if there will be
>>>>         
>> conflicts. Of
>>     
>>>> course, an alternative will be:
>>>>
>>>>  3. Do the poisson, momentum eqns and other parts of the code
>>>>         
>> separately.
>>     
>>>> That is, code a standalone parallel poisson eqn and use samples values
>>>>         
>> to
>>     
>>>> test it. Same for the momentum and other parts of the code. When each of
>>>> them is working, combine them to form the full parallel code. However,
>>>>         
>> this
>>     
>>>> will be much more troublesome.
>>>>
>>>>  I hope someone can give me some recommendations.
>>>>
>>>>  Thank you once again.
>>>>
>>>>
>>>>
>>>>  Matthew Knepley wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> 1) There is no way to have any idea what is going on in your code
>>>>>   without -log_summary output
>>>>>
>>>>> 2) Looking at that output, look at the percentage taken by the solver
>>>>>   KSPSolve event. I suspect it is not the biggest component, because
>>>>>  it is very scalable.
>>>>>
>>>>>  Matt
>>>>>
>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Hi,
>>>>>>
>>>>>> I've a serial 2D CFD code. As my grid size requirement increases,
>>>>>>             
>> the
>>     
>>>>>> simulation takes longer. Also, memory requirement becomes a problem.
>>>>>>
>>>>>>
>>>>>>             
>>>> Grid
>>>>
>>>>
>>>>         
>>>>>> size 've reached 1200x1200. Going higher is not possible due to
>>>>>>             
>> memory
>>     
>>>>>> problem.
>>>>>>
>>>>>> I tried to convert my code to a parallel one, following the examples
>>>>>>
>>>>>>
>>>>>>             
>>>> given.
>>>>
>>>>
>>>>         
>>>>>> I also need to restructure parts of my code to enable parallel
>>>>>>             
>> looping.
>>     
>>>>>>             
>>>> I
>>>>
>>>>
>>>>         
>>>>>> 1st changed the PETSc solver to be parallel enabled and then I
>>>>>>
>>>>>>
>>>>>>             
>>>> restructured
>>>>
>>>>
>>>>         
>>>>>> parts of my code. I proceed on as longer as the answer for a simple
>>>>>>             
>> test
>>     
>>>>>> case is correct. I thought it's not really possible to do any speed
>>>>>>
>>>>>>
>>>>>>             
>>>> testing
>>>>
>>>>
>>>>         
>>>>>> since the code is not fully parallelized yet. When I finished during
>>>>>>
>>>>>>
>>>>>>             
>>>> most of
>>>>
>>>>
>>>>         
>>>>>> the conversion, I found that in the actual run that it is much
>>>>>>             
>> slower,
>>     
>>>>>> although the answer is correct.
>>>>>>
>>>>>> So what is the remedy now? I wonder what I should do to check what's
>>>>>>
>>>>>>
>>>>>>             
>>>> wrong.
>>>>
>>>>
>>>>         
>>>>>> Must I restart everything again? Btw, my grid size is 1200x1200. I
>>>>>>
>>>>>>
>>>>>>             
>>>> believed
>>>>
>>>>
>>>>         
>>>>>> it should be suitable for parallel run of 4 processors? Is that so?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>         
>>>
>>>
>>>
>>>       
>>     
>
>
>
>   


From knepley at gmail.com  Tue Apr 15 10:46:17 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 10:46:17 -0500
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <4804CAC0.6060201@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
	 <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>
	 <480336BE.3070507@gmail.com>
	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>
	 <48035F88.2080003@gmail.com>
	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>
	 <4804CAC0.6060201@gmail.com>
Message-ID: <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>

1) Please never cut out parts of the summary. All the information is valuable,
    and most times, necessary

2) You seem to have huge load imbalance (look at VecNorm). Do you partition
    the system yourself. How many processes is this?

3) You seem to be setting a huge number of off-process values in the matrix
    (see MatAssemblyBegin). Is this true? I would reorganize this part.

  Matt

On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
> Hi,
>
>  I have converted the poisson eqn part of the CFD code to parallel. The grid
> size tested is 600x720. For the momentum eqn, I used another serial linear
> solver (nspcg) to prevent mixing of results. Here's the output summary:
>
>  --- Event Stage 0: Main Stage
>
>  MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
> 0.0e+00 10 11100100  0  10 11100100  0   217
>  MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>  MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>  MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
>  MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>  KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
> 1.7e+04 89100100100100  89100100100100   317
>  PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>  PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>  PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>  VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>  *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
>  *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
>  VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>  VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>  VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
> 0.0e+00  0  0100100  0   0  0100100  0     0*
>  *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
>  *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
>
> ------------------------------------------------------------------------------------------------------------------------
>   Memory usage is given in bytes:
>   Object Type          Creations   Destructions   Memory  Descendants' Mem.
>     --- Event Stage 0: Main Stage
>                  Matrix     4              4   49227380     0
>       Krylov Solver     2              2      17216     0
>      Preconditioner     2              2        256     0
>           Index Set     5              5    2596120     0
>                 Vec    40             40   62243224     0
>         Vec Scatter     1              1          0     0
> ========================================================================================================================
>  Average time to get PetscTime(): 4.05312e-07                  Average time
> for MPI_Barrier(): 7.62939e-07
>  Average time for zero size MPI_Send(): 2.02656e-06
>  OptionTable: -log_summary
>
>
>  The PETSc manual states that ratio should be close to 1. There's quite a
> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. So
> what could be the cause?
>
>  I wonder if it has to do the way I insert the matrix. My steps are:
> (cartesian grids, i loop faster than j, fortran)
>
>  For matrix A and rhs
>
>  Insert left extreme cells values belonging to myid
>
>  if (myid==0) then
>
>    insert corner cells values
>
>    insert south cells values
>
>    insert internal cells values
>
>  else if (myid==num_procs-1) then
>
>    insert corner cells values
>
>    insert north cells values
>
>    insert internal cells values
>
>  else
>
>    insert internal cells values
>
>  end if
>
>  Insert right extreme cells values belonging to myid
>
>  All these values are entered into a big_A(size_x*size_y,5) matrix. int_A
> stores the position of the values. I then do
>
>  call MatZeroEntries(A_mat,ierr)
>
>    do k=ksta_p+1,kend_p   !for cells belonging to myid
>
>        do kk=1,5
>
>            II=k-1
>
>            JJ=int_A(k,kk)-1
>
>            call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
>                  end do
>
>    end do
>
>    call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>
>    call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>
>
>  I wonder if the problem lies here.I used the big_A matrix because I was
> migrating from an old linear solver. Lastly, I was told to widen my window
> to 120 characters. May I know how do I do it?
>
>
>
>  Thank you very much.
>
>  Matthew Knepley wrote:
>
> > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
> >
> >
> > > Hi Matthew,
> > >
> > >  I think you've misunderstood what I meant. What I'm trying to say is
> > > initially I've got a serial code. I tried to convert to a parallel one.
> Then
> > > I tested it and it was pretty slow. Due to some work requirement, I need
> to
> > > go back to make some changes to my code. Since the parallel is not
> working
> > > well, I updated and changed the serial one.
> > >
> > >  Well, that was a while ago and now, due to the updates and changes, the
> > > serial code is different from the old converted parallel code. Some
> files
> > > were also deleted and I can't seem to get it working now. So I thought I
> > > might as well convert the new serial code to parallel. But I'm not very
> sure
> > > what I should do 1st.
> > >
> > >  Maybe I should rephrase my question in that if I just convert my
> poisson
> > > equation subroutine from a serial PETSc to a parallel PETSc version,
> will it
> > > work? Should I expect a speedup? The rest of my code is still serial.
> > >
> > >
> >
> > You should, of course, only expect speedup in the parallel parts
> >
> >  Matt
> >
> >
> >
> > >  Thank you very much.
> > >
> > >
> > >
> > >  Matthew Knepley wrote:
> > >
> > >
> > >
> > > > I am not sure why you would ever have two codes. I never do this.
> PETSc
> > > > is designed to write one code to run in serial and parallel. The PETSc
> > > >
> > > >
> > > part
> > >
> > >
> > > > should look identical. To test, run the code yo uhave verified in
> serial
> > > >
> > > >
> > > and
> > >
> > >
> > > > output PETSc data structures (like Mat and Vec) using a binary viewer.
> > > > Then run in parallel with the same code, which will output the same
> > > > structures. Take the two files and write a small verification code
> that
> > > > loads both versions and calls MatEqual and VecEqual.
> > > >
> > > >  Matt
> > > >
> > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
> > > >
> > > >
> > > >
> > > >
> > > > > Thank you Matthew. Sorry to trouble you again.
> > > > >
> > > > >  I tried to run it with -log_summary output and I found that there's
> > > > >
> > > > >
> > > >
> > > some
> > >
> > >
> > > >
> > > > > errors in the execution. Well, I was busy with other things and I
> just
> > > > >
> > > > >
> > > >
> > > came
> > >
> > >
> > > >
> > > > > back to this problem. Some of my files on the server has also been
> > > > >
> > > > >
> > > >
> > > deleted.
> > >
> > >
> > > >
> > > > > It has been a while and I  remember that  it worked before, only
> much
> > > > > slower.
> > > > >
> > > > >  Anyway, most of the serial code has been updated and maybe it's
> easier
> > > > >
> > > > >
> > > >
> > > to
> > >
> > >
> > > >
> > > > > convert the new serial code instead of debugging on the old parallel
> > > > >
> > > > >
> > > >
> > > code
> > >
> > >
> > > >
> > > > > now. I believe I can still reuse part of the old parallel code.
> However,
> > > > >
> > > > >
> > > >
> > > I
> > >
> > >
> > > >
> > > > > hope I can approach it better this time.
> > > > >
> > > > >  So supposed I need to start converting my new serial code to
> parallel.
> > > > > There's 2 eqns to be solved using PETSc, the momentum and poisson. I
> > > > >
> > > > >
> > > >
> > > also
> > >
> > >
> > > >
> > > > > need to parallelize other parts of my code. I wonder which route is
> the
> > > > > best:
> > > > >
> > > > >  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
> > > > >
> > > > >
> > > >
> > > modify
> > >
> > >
> > > >
> > > > > other parts of my code to parallel e.g. looping, updating of values
> etc.
> > > > > Once the execution is fine and speedup is reasonable, then modify
> the
> > > > >
> > > > >
> > > >
> > > PETSc
> > >
> > >
> > > >
> > > > > part - poisson eqn 1st followed by the momentum eqn.
> > > > >
> > > > >  2. Reverse the above order ie modify the PETSc part - poisson eqn
> 1st
> > > > > followed by the momentum eqn. Then do other parts of my code.
> > > > >
> > > > >  I'm not sure if the above 2 mtds can work or if there will be
> > > > >
> > > > >
> > > >
> > > conflicts. Of
> > >
> > >
> > > >
> > > > > course, an alternative will be:
> > > > >
> > > > >  3. Do the poisson, momentum eqns and other parts of the code
> > > > >
> > > > >
> > > >
> > > separately.
> > >
> > >
> > > >
> > > > > That is, code a standalone parallel poisson eqn and use samples
> values
> > > > >
> > > > >
> > > >
> > > to
> > >
> > >
> > > >
> > > > > test it. Same for the momentum and other parts of the code. When
> each of
> > > > > them is working, combine them to form the full parallel code.
> However,
> > > > >
> > > > >
> > > >
> > > this
> > >
> > >
> > > >
> > > > > will be much more troublesome.
> > > > >
> > > > >  I hope someone can give me some recommendations.
> > > > >
> > > > >  Thank you once again.
> > > > >
> > > > >
> > > > >
> > > > >  Matthew Knepley wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > 1) There is no way to have any idea what is going on in your code
> > > > > >  without -log_summary output
> > > > > >
> > > > > > 2) Looking at that output, look at the percentage taken by the
> solver
> > > > > >  KSPSolve event. I suspect it is not the biggest component,
> because
> > > > > >  it is very scalable.
> > > > > >
> > > > > >  Matt
> > > > > >
> > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I've a serial 2D CFD code. As my grid size requirement
> increases,
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > the
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > simulation takes longer. Also, memory requirement becomes a
> problem.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > Grid
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > size 've reached 1200x1200. Going higher is not possible due to
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > memory
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > problem.
> > > > > > >
> > > > > > > I tried to convert my code to a parallel one, following the
> examples
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > given.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > I also need to restructure parts of my code to enable parallel
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > looping.
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > I
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > 1st changed the PETSc solver to be parallel enabled and then I
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > restructured
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > parts of my code. I proceed on as longer as the answer for a
> simple
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > test
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > case is correct. I thought it's not really possible to do any
> speed
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > testing
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > since the code is not fully parallelized yet. When I finished
> during
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > most of
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > the conversion, I found that in the actual run that it is much
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > slower,
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > although the answer is correct.
> > > > > > >
> > > > > > > So what is the remedy now? I wonder what I should do to check
> what's
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > wrong.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > Must I restart everything again? Btw, my grid size is 1200x1200.
> I
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > believed
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > it should be suitable for parallel run of 4 processors? Is that
> so?
> > > > > > >
> > > > > > > Thank you.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From zonexo at gmail.com  Tue Apr 15 10:56:52 2008
From: zonexo at gmail.com (Ben Tay)
Date: Tue, 15 Apr 2008 23:56:52 +0800
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>	 <480336BE.3070507@gmail.com>	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>	 <48035F88.2080003@gmail.com>	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>	 <4804CAC0.6060201@gmail.com> <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
Message-ID: <4804D044.2060502@gmail.com>

Oh sorry here's the whole information. I'm using 2 processors currently:

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance 
Summary: ----------------------------------------------

./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332 
Tue Apr 15 23:03:09 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

                         Max       Max/Min        Avg      Total
Time (sec):           1.114e+03      1.00054   1.114e+03
Objects:              5.400e+01      1.00000   5.400e+01
Flops:                1.574e+11      1.00000   1.574e+11  3.147e+11
Flops/sec:            1.414e+08      1.00054   1.413e+08  2.826e+08
MPI Messages:         8.777e+03      1.00000   8.777e+03  1.755e+04
MPI Message Lengths:  4.213e+07      1.00000   4.800e+03  8.425e+07
MPI Reductions:       8.644e+03      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N 
--> 2N flops
                            and VecAXPY() for complex vectors of length 
N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages 
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 1.1136e+03 100.0%  3.1475e+11 100.0%  1.755e+04 
100.0%  4.800e+03      100.0%  1.729e+04 100.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this 
phase
      %M - percent messages in this phase     %L - percent message 
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #   macros. To get timing results we always recommend    #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      ##########################################################


Event                Count      Time (sec)     
Flops/sec                         --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 
0.0e+00 10 11100100  0  10 11100100  0   217
MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 
0.0e+00 17 11  0  0  0  17 11  0  0  0   120
MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   140
MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 
0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0
MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 
7.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 
8.5e+03 50 72  0  0 49  50 72  0  0 49   363
KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 
1.7e+04 89100100100100  89100100100100   317
PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 
3.0e+00  0  0  0  0  0   0  0  0  0  0    69
PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 
3.0e+00  0  0  0  0  0   0  0  0  0  0    69
PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 
0.0e+00 18 11  0  0  0  18 11  0  0  0   114
VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 
8.5e+03 35 36  0  0 49  35 36  0  0 49   213
VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 
8.8e+03  9  2  0  0 51   9  2  0  0 51    42
VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0   636
VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   346
VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 
0.0e+00 16 38  0  0  0  16 38  0  0  0   453
VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 
0.0e+00  0  0100100  0   0  0100100  0     0
VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 
8.8e+03  9  4  0  0 51   9  4  0  0 51    62
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

              Matrix     4              4   49227380     0
       Krylov Solver     2              2      17216     0
      Preconditioner     2              2        256     0
           Index Set     5              5    2596120     0
                 Vec    40             40   62243224     0
         Vec Scatter     1              1          0     0
========================================================================================================================
Average time to get PetscTime(): 4.05312e-07
Average time for MPI_Barrier(): 7.62939e-07
Average time for zero size MPI_Send(): 2.02656e-06
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
Compiled without FORTRAN kernels                              
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Jan  8 22:22:08 2008
Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 
--sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 
--sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 
--sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 
--with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 
--with-batch=1 --with-mpi-shared=0 
--with-mpi-include=/usr/local/topspin/mpi/mpich/include 
--with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a 
--with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun 
--with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
-----------------------------------------
Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 
12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
Using PETSc arch: atlas3-mpi
-----------------------------------------
Using C compiler: mpicc -fPIC -O  
Using Fortran compiler: mpif90 -I. -fPIC -O   
-----------------------------------------
Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 
-I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi 
-I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
I/home/enduser/g0306332/lib/hypre/include 
-I/usr/local/topspin/mpi/mpich/include    
------------------------------------------
Using C linker: mpicc -fPIC -O
Using Fortran linker: mpif90 -I. -fPIC -O  
Using libraries: 
-Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi 
-L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts 
-lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc        
-Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib 
-L/home/enduser/g0306332/lib/hypre/lib -lHYPRE 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/usr/local/topspin/mpi/mpich/lib 
-L/usr/local/topspin/mpi/mpich/lib -lmpich 
-Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t 
-L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide 
-lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib 
-ldl -lmpich -libverbs -libumad -lpthread -lrt 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib 
-lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib 
-ldl -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 
-libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-L/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
------------------------------------------
1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (28major+153248minor)pagefaults 0swaps
387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (18major+158175minor)pagefaults 0swaps
Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
             
TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  
===================
00000 atlas3-c05 time ./a.out -lo  Done                     04/15/2008 
23:03:10
00001 atlas3-c05 time ./a.out -lo  Done                     04/15/2008 
23:03:10


I have a cartesian grid 600x720. Since there's 2 processors, it is 
partitioned to 600x360. I just use:

call 
MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr)

        call MatSetFromOptions(A_mat,ierr)

        call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)

        call KSPCreate(MPI_COMM_WORLD,ksp,ierr)

        call 
VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)

total_k is actually size_x*size_y. Since it's 2d, the maximum values per 
row is 5. When you says setting off-process values, do you mean I insert 
values from 1 processor into another? I thought I insert the values into 
the correct processor...

Thank you very much!


Matthew Knepley wrote:
> 1) Please never cut out parts of the summary. All the information is valuable,
>     and most times, necessary
>
> 2) You seem to have huge load imbalance (look at VecNorm). Do you partition
>     the system yourself. How many processes is this?
>
> 3) You seem to be setting a huge number of off-process values in the matrix
>     (see MatAssemblyBegin). Is this true? I would reorganize this part.
>
>   Matt
>
> On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
>   
>> Hi,
>>
>>  I have converted the poisson eqn part of the CFD code to parallel. The grid
>> size tested is 600x720. For the momentum eqn, I used another serial linear
>> solver (nspcg) to prevent mixing of results. Here's the output summary:
>>
>>  --- Event Stage 0: Main Stage
>>
>>  MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
>> 0.0e+00 10 11100100  0  10 11100100  0   217
>>  MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
>> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>>  MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>>  MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
>> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
>>  MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
>> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
>> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>>  KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
>> 1.7e+04 89100100100100  89100100100100   317
>>  PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>  PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>  PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>>  VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
>> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>>  *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
>> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
>>  *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
>> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
>>  VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>  VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>>  VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
>> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>>  VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>  *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
>> 0.0e+00  0  0100100  0   0  0100100  0     0*
>>  *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
>>  *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
>> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>   Memory usage is given in bytes:
>>   Object Type          Creations   Destructions   Memory  Descendants' Mem.
>>     --- Event Stage 0: Main Stage
>>                  Matrix     4              4   49227380     0
>>       Krylov Solver     2              2      17216     0
>>      Preconditioner     2              2        256     0
>>           Index Set     5              5    2596120     0
>>                 Vec    40             40   62243224     0
>>         Vec Scatter     1              1          0     0
>> ========================================================================================================================
>>  Average time to get PetscTime(): 4.05312e-07                  Average time
>> for MPI_Barrier(): 7.62939e-07
>>  Average time for zero size MPI_Send(): 2.02656e-06
>>  OptionTable: -log_summary
>>
>>
>>  The PETSc manual states that ratio should be close to 1. There's quite a
>> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. So
>> what could be the cause?
>>
>>  I wonder if it has to do the way I insert the matrix. My steps are:
>> (cartesian grids, i loop faster than j, fortran)
>>
>>  For matrix A and rhs
>>
>>  Insert left extreme cells values belonging to myid
>>
>>  if (myid==0) then
>>
>>    insert corner cells values
>>
>>    insert south cells values
>>
>>    insert internal cells values
>>
>>  else if (myid==num_procs-1) then
>>
>>    insert corner cells values
>>
>>    insert north cells values
>>
>>    insert internal cells values
>>
>>  else
>>
>>    insert internal cells values
>>
>>  end if
>>
>>  Insert right extreme cells values belonging to myid
>>
>>  All these values are entered into a big_A(size_x*size_y,5) matrix. int_A
>> stores the position of the values. I then do
>>
>>  call MatZeroEntries(A_mat,ierr)
>>
>>    do k=ksta_p+1,kend_p   !for cells belonging to myid
>>
>>        do kk=1,5
>>
>>            II=k-1
>>
>>            JJ=int_A(k,kk)-1
>>
>>            call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
>>                  end do
>>
>>    end do
>>
>>    call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>
>>    call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>
>>
>>  I wonder if the problem lies here.I used the big_A matrix because I was
>> migrating from an old linear solver. Lastly, I was told to widen my window
>> to 120 characters. May I know how do I do it?
>>
>>
>>
>>  Thank you very much.
>>
>>  Matthew Knepley wrote:
>>
>>     
>>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>
>>>
>>>       
>>>> Hi Matthew,
>>>>
>>>>  I think you've misunderstood what I meant. What I'm trying to say is
>>>> initially I've got a serial code. I tried to convert to a parallel one.
>>>>         
>> Then
>>     
>>>> I tested it and it was pretty slow. Due to some work requirement, I need
>>>>         
>> to
>>     
>>>> go back to make some changes to my code. Since the parallel is not
>>>>         
>> working
>>     
>>>> well, I updated and changed the serial one.
>>>>
>>>>  Well, that was a while ago and now, due to the updates and changes, the
>>>> serial code is different from the old converted parallel code. Some
>>>>         
>> files
>>     
>>>> were also deleted and I can't seem to get it working now. So I thought I
>>>> might as well convert the new serial code to parallel. But I'm not very
>>>>         
>> sure
>>     
>>>> what I should do 1st.
>>>>
>>>>  Maybe I should rephrase my question in that if I just convert my
>>>>         
>> poisson
>>     
>>>> equation subroutine from a serial PETSc to a parallel PETSc version,
>>>>         
>> will it
>>     
>>>> work? Should I expect a speedup? The rest of my code is still serial.
>>>>
>>>>
>>>>         
>>> You should, of course, only expect speedup in the parallel parts
>>>
>>>  Matt
>>>
>>>
>>>
>>>       
>>>>  Thank you very much.
>>>>
>>>>
>>>>
>>>>  Matthew Knepley wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> I am not sure why you would ever have two codes. I never do this.
>>>>>           
>> PETSc
>>     
>>>>> is designed to write one code to run in serial and parallel. The PETSc
>>>>>
>>>>>
>>>>>           
>>>> part
>>>>
>>>>
>>>>         
>>>>> should look identical. To test, run the code yo uhave verified in
>>>>>           
>> serial
>>     
>>>>>           
>>>> and
>>>>
>>>>
>>>>         
>>>>> output PETSc data structures (like Mat and Vec) using a binary viewer.
>>>>> Then run in parallel with the same code, which will output the same
>>>>> structures. Take the two files and write a small verification code
>>>>>           
>> that
>>     
>>>>> loads both versions and calls MatEqual and VecEqual.
>>>>>
>>>>>  Matt
>>>>>
>>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Thank you Matthew. Sorry to trouble you again.
>>>>>>
>>>>>>  I tried to run it with -log_summary output and I found that there's
>>>>>>
>>>>>>
>>>>>>             
>>>> some
>>>>
>>>>
>>>>         
>>>>>> errors in the execution. Well, I was busy with other things and I
>>>>>>             
>> just
>>     
>>>>>>             
>>>> came
>>>>
>>>>
>>>>         
>>>>>> back to this problem. Some of my files on the server has also been
>>>>>>
>>>>>>
>>>>>>             
>>>> deleted.
>>>>
>>>>
>>>>         
>>>>>> It has been a while and I  remember that  it worked before, only
>>>>>>             
>> much
>>     
>>>>>> slower.
>>>>>>
>>>>>>  Anyway, most of the serial code has been updated and maybe it's
>>>>>>             
>> easier
>>     
>>>>>>             
>>>> to
>>>>
>>>>
>>>>         
>>>>>> convert the new serial code instead of debugging on the old parallel
>>>>>>
>>>>>>
>>>>>>             
>>>> code
>>>>
>>>>
>>>>         
>>>>>> now. I believe I can still reuse part of the old parallel code.
>>>>>>             
>> However,
>>     
>>>>>>             
>>>> I
>>>>
>>>>
>>>>         
>>>>>> hope I can approach it better this time.
>>>>>>
>>>>>>  So supposed I need to start converting my new serial code to
>>>>>>             
>> parallel.
>>     
>>>>>> There's 2 eqns to be solved using PETSc, the momentum and poisson. I
>>>>>>
>>>>>>
>>>>>>             
>>>> also
>>>>
>>>>
>>>>         
>>>>>> need to parallelize other parts of my code. I wonder which route is
>>>>>>             
>> the
>>     
>>>>>> best:
>>>>>>
>>>>>>  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
>>>>>>
>>>>>>
>>>>>>             
>>>> modify
>>>>
>>>>
>>>>         
>>>>>> other parts of my code to parallel e.g. looping, updating of values
>>>>>>             
>> etc.
>>     
>>>>>> Once the execution is fine and speedup is reasonable, then modify
>>>>>>             
>> the
>>     
>>>>>>             
>>>> PETSc
>>>>
>>>>
>>>>         
>>>>>> part - poisson eqn 1st followed by the momentum eqn.
>>>>>>
>>>>>>  2. Reverse the above order ie modify the PETSc part - poisson eqn
>>>>>>             
>> 1st
>>     
>>>>>> followed by the momentum eqn. Then do other parts of my code.
>>>>>>
>>>>>>  I'm not sure if the above 2 mtds can work or if there will be
>>>>>>
>>>>>>
>>>>>>             
>>>> conflicts. Of
>>>>
>>>>
>>>>         
>>>>>> course, an alternative will be:
>>>>>>
>>>>>>  3. Do the poisson, momentum eqns and other parts of the code
>>>>>>
>>>>>>
>>>>>>             
>>>> separately.
>>>>
>>>>
>>>>         
>>>>>> That is, code a standalone parallel poisson eqn and use samples
>>>>>>             
>> values
>>     
>>>>>>             
>>>> to
>>>>
>>>>
>>>>         
>>>>>> test it. Same for the momentum and other parts of the code. When
>>>>>>             
>> each of
>>     
>>>>>> them is working, combine them to form the full parallel code.
>>>>>>             
>> However,
>>     
>>>>>>             
>>>> this
>>>>
>>>>
>>>>         
>>>>>> will be much more troublesome.
>>>>>>
>>>>>>  I hope someone can give me some recommendations.
>>>>>>
>>>>>>  Thank you once again.
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Matthew Knepley wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> 1) There is no way to have any idea what is going on in your code
>>>>>>>  without -log_summary output
>>>>>>>
>>>>>>> 2) Looking at that output, look at the percentage taken by the
>>>>>>>               
>> solver
>>     
>>>>>>>  KSPSolve event. I suspect it is not the biggest component,
>>>>>>>               
>> because
>>     
>>>>>>>  it is very scalable.
>>>>>>>
>>>>>>>  Matt
>>>>>>>
>>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've a serial 2D CFD code. As my grid size requirement
>>>>>>>>                 
>> increases,
>>     
>>>>>>>>                 
>>>> the
>>>>
>>>>
>>>>         
>>>>>>>> simulation takes longer. Also, memory requirement becomes a
>>>>>>>>                 
>> problem.
>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> Grid
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> size 've reached 1200x1200. Going higher is not possible due to
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>> memory
>>>>
>>>>
>>>>         
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> I tried to convert my code to a parallel one, following the
>>>>>>>>                 
>> examples
>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> given.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> I also need to restructure parts of my code to enable parallel
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>> looping.
>>>>
>>>>
>>>>         
>>>>>>>>                 
>>>>>> I
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> restructured
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> parts of my code. I proceed on as longer as the answer for a
>>>>>>>>                 
>> simple
>>     
>>>>>>>>                 
>>>> test
>>>>
>>>>
>>>>         
>>>>>>>> case is correct. I thought it's not really possible to do any
>>>>>>>>                 
>> speed
>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> testing
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> since the code is not fully parallelized yet. When I finished
>>>>>>>>                 
>> during
>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> most of
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> the conversion, I found that in the actual run that it is much
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>> slower,
>>>>
>>>>
>>>>         
>>>>>>>> although the answer is correct.
>>>>>>>>
>>>>>>>> So what is the remedy now? I wonder what I should do to check
>>>>>>>>                 
>> what's
>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> wrong.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> Must I restart everything again? Btw, my grid size is 1200x1200.
>>>>>>>>                 
>> I
>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> believed
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> it should be suitable for parallel run of 4 processors? Is that
>>>>>>>>                 
>> so?
>>     
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>             
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>         
>>>
>>>
>>>
>>>       
>>     
>
>
>
>   


From bsmith at mcs.anl.gov  Tue Apr 15 11:09:10 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Tue, 15 Apr 2008 11:09:10 -0500
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <4804D044.2060502@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>	 <480336BE.3070507@gmail.com>	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>	 <48035F88.2080003@gmail.com>	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>	 <4804CAC0.6060201@gmail.com> <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com> <4804D044.2060502@gmail.com>
Message-ID: <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>


    It is taking 8776 iterations of GMRES! How many does it take on  
one process? This is a huge
amount.

MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e 
+03 0.0e+00 10 11100100  0  10 11100100  0   217
MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e 
+00 0.0e+00 17 11  0  0  0  17 11  0  0  0   120

One process is spending 2.9 times as long in the embarresingly  
parallel MatSolve then the other process;
this indicates a huge imbalance in the number of nonzeros on each  
process. As Matt noticed, the partitioning
between the two processes is terrible.

   Barry

On Apr 15, 2008, at 10:56 AM, Ben Tay wrote:
> Oh sorry here's the whole information. I'm using 2 processors  
> currently:
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by  
> g0306332 Tue Apr 15 23:03:09 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                        Max       Max/Min        Avg      Total
> Time (sec):           1.114e+03      1.00054   1.114e+03
> Objects:              5.400e+01      1.00000   5.400e+01
> Flops:                1.574e+11      1.00000   1.574e+11  3.147e+11
> Flops/sec:            1.414e+08      1.00054   1.413e+08  2.826e+08
> MPI Messages:         8.777e+03      1.00000   8.777e+03  1.755e+04
> MPI Message Lengths:  4.213e+07      1.00000   4.800e+03  8.425e+07
> MPI Reductions:       8.644e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                           e.g., VecAXPY() for real vectors of length  
> N --> 2N flops
>                           and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                       Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 1.1136e+03 100.0%  3.1475e+11 100.0%  1.755e+04  
> 100.0%  4.800e+03      100.0%  1.729e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops/sec: Max - maximum over all processors
>                      Ratio - ratio of maximum to minimum over all  
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush()  
> and PetscLogStagePop().
>     %T - percent time in this phase         %F - percent flops in  
> this phase
>     %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>     %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time  
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>     ##########################################################
>     #                                                        #
>     #                          WARNING!!!                    #
>     #                                                        #
>     #   This code was run without the PreLoadBegin()         #
>     #   macros. To get timing results we always recommend    #
>     #   preloading. otherwise timing numbers may be          #
>     #   meaningless.                                         #
>     ##########################################################
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e 
> +03 0.0e+00 10 11100100  0  10 11100100  0   217
> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e 
> +00 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
> MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
> MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00  
> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0
> MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e 
> +03 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e 
> +00 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
> KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e 
> +03 1.7e+04 89100100100100  89100100100100   317
> PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e 
> +00 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e 
> +00 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e 
> +00 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
> VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e 
> +00 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
> VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e 
> +00 8.8e+03  9  2  0  0 51   9  2  0  0 51    42
> VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e 
> +00 0.0e+00  0  1  0  0  0   0  1  0  0  0   636
> VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
> VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e 
> +00 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
> VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e 
> +03 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e 
> +00 8.8e+03  9  4  0  0 51   9  4  0  0 51    62
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>             Matrix     4              4   49227380     0
>      Krylov Solver     2              2      17216     0
>     Preconditioner     2              2        256     0
>          Index Set     5              5    2596120     0
>                Vec    40             40   62243224     0
>        Vec Scatter     1              1          0     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 4.05312e-07
> Average time for MPI_Barrier(): 7.62939e-07
> Average time for zero size MPI_Send(): 2.02656e-06
> OptionTable: -log_summary
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> Compiled without FORTRAN kernels                               
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan  8 22:22:08 2008
> Configure options: --with-memcmp-ok --sizeof_char=1 -- 
> sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 -- 
> sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 -- 
> bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with- 
> vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/ 
> g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi- 
> shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include -- 
> with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with- 
> mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack- 
> dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> -----------------------------------------
> Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed  
> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3-mpi
> -----------------------------------------
> Using C compiler: mpicc -fPIC -O  Using Fortran compiler: mpif90 -I.  
> -fPIC -O   -----------------------------------------
> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/ 
> nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/ 
> home/enduser/g0306332/petsc-2.3.3-p8/include -
> I/home/enduser/g0306332/lib/hypre/include -I/usr/local/topspin/mpi/ 
> mpich/include    ------------------------------------------
> Using C linker: mpicc -fPIC -O
> Using Fortran linker: mpif90 -I. -fPIC -O  Using libraries: -Wl,- 
> rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -L/ 
> nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts - 
> lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc         
> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib -L/home/enduser/ 
> g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/mvapich/0.9.9/gen2/ 
> lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ 
> 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/ 
> gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ 
> 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib -L/usr/ 
> local/topspin/mpi/mpich/lib -lmpich -Wl,-rpath,/opt/intel/cmkl/8.1.1/ 
> lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t  
> -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/ 
> lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/ 
> gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt -Wl,-rpath,/ 
> opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat- 
> linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo - 
> lirc -lgcc_s -lirc_s -lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/ 
> lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ 
> 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/ 
> fce/9.1.045/lib -lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/ 
> gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ 
> 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,- 
> rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib - 
> Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib - 
> Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/ 
> lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/ 
> usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc+ 
> + -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl  
> -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 - 
> libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/ 
> lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s - 
> lirc_s -ldl -lc
> ------------------------------------------
> 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata  
> 0maxresident)k
> 0inputs+0outputs (28major+153248minor)pagefaults 0swaps
> 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata  
> 0maxresident)k
> 0inputs+0outputs (18major+158175minor)pagefaults 0swaps
> Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>            TID   HOST_NAME   COMMAND_LINE             
> STATUS            TERMINATION_TIME
> ===== ========== ================  =======================   
> ===================
> 00000 atlas3-c05 time ./a.out -lo  Done                      
> 04/15/2008 23:03:10
> 00001 atlas3-c05 time ./a.out -lo  Done                      
> 04/15/2008 23:03:10
>
>
> I have a cartesian grid 600x720. Since there's 2 processors, it is  
> partitioned to 600x360. I just use:
>
> call  
> MatCreateMPIAIJ 
> (MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k, 
> 5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr)
>
>       call MatSetFromOptions(A_mat,ierr)
>
>       call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)
>
>       call KSPCreate(MPI_COMM_WORLD,ksp,ierr)
>
>       call  
> VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)
>
> total_k is actually size_x*size_y. Since it's 2d, the maximum values  
> per row is 5. When you says setting off-process values, do you mean  
> I insert values from 1 processor into another? I thought I insert  
> the values into the correct processor...
>
> Thank you very much!
>
>
>
> Matthew Knepley wrote:
>> 1) Please never cut out parts of the summary. All the information  
>> is valuable,
>>    and most times, necessary
>>
>> 2) You seem to have huge load imbalance (look at VecNorm). Do you  
>> partition
>>    the system yourself. How many processes is this?
>>
>> 3) You seem to be setting a huge number of off-process values in  
>> the matrix
>>    (see MatAssemblyBegin). Is this true? I would reorganize this  
>> part.
>>
>>  Matt
>>
>> On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have converted the poisson eqn part of the CFD code to parallel.  
>>> The grid
>>> size tested is 600x720. For the momentum eqn, I used another  
>>> serial linear
>>> solver (nspcg) to prevent mixing of results. Here's the output  
>>> summary:
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04  
>>> 4.8e+03
>>> 0.0e+00 10 11100100  0  10 11100100  0   217
>>> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>>> MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>>> MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e 
>>> +00
>>> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
>>> MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00  
>>> 2.4e+03
>>> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00  
>>> 0.0e+00
>>> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>>> KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04  
>>> 4.8e+03
>>> 1.7e+04 89100100100100  89100100100100   317
>>> PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00  
>>> 0.0e+00
>>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>> PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00  
>>> 0.0e+00
>>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>> PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>>> VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00  
>>> 0.0e+00
>>> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>>> *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00  
>>> 0.0e+00
>>> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
>>> *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
>>> VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>>> VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>>> VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04  
>>> 4.8e+03
>>> 0.0e+00  0  0100100  0   0  0100100  0     0*
>>> *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
>>> *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00  
>>> 0.0e+00
>>> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>  Memory usage is given in bytes:
>>>  Object Type          Creations   Destructions   Memory   
>>> Descendants' Mem.
>>>    --- Event Stage 0: Main Stage
>>>                 Matrix     4              4   49227380     0
>>>      Krylov Solver     2              2      17216     0
>>>     Preconditioner     2              2        256     0
>>>          Index Set     5              5    2596120     0
>>>                Vec    40             40   62243224     0
>>>        Vec Scatter     1              1          0     0
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>> Average time to get PetscTime(): 4.05312e-07                   
>>> Average time
>>> for MPI_Barrier(): 7.62939e-07
>>> Average time for zero size MPI_Send(): 2.02656e-06
>>> OptionTable: -log_summary
>>>
>>>
>>> The PETSc manual states that ratio should be close to 1. There's  
>>> quite a
>>> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very  
>>> big. So
>>> what could be the cause?
>>>
>>> I wonder if it has to do the way I insert the matrix. My steps are:
>>> (cartesian grids, i loop faster than j, fortran)
>>>
>>> For matrix A and rhs
>>>
>>> Insert left extreme cells values belonging to myid
>>>
>>> if (myid==0) then
>>>
>>>   insert corner cells values
>>>
>>>   insert south cells values
>>>
>>>   insert internal cells values
>>>
>>> else if (myid==num_procs-1) then
>>>
>>>   insert corner cells values
>>>
>>>   insert north cells values
>>>
>>>   insert internal cells values
>>>
>>> else
>>>
>>>   insert internal cells values
>>>
>>> end if
>>>
>>> Insert right extreme cells values belonging to myid
>>>
>>> All these values are entered into a big_A(size_x*size_y,5) matrix.  
>>> int_A
>>> stores the position of the values. I then do
>>>
>>> call MatZeroEntries(A_mat,ierr)
>>>
>>>   do k=ksta_p+1,kend_p   !for cells belonging to myid
>>>
>>>       do kk=1,5
>>>
>>>           II=k-1
>>>
>>>           JJ=int_A(k,kk)-1
>>>
>>>           call MatSetValues(A_mat,1,II, 
>>> 1,JJ,big_A(k,kk),ADD_VALUES,ierr)
>>>                 end do
>>>
>>>   end do
>>>
>>>   call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>>
>>>   call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>>
>>>
>>> I wonder if the problem lies here.I used the big_A matrix because  
>>> I was
>>> migrating from an old linear solver. Lastly, I was told to widen  
>>> my window
>>> to 120 characters. May I know how do I do it?
>>>
>>>
>>>
>>> Thank you very much.
>>>
>>> Matthew Knepley wrote:
>>>
>>>
>>>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>> Hi Matthew,
>>>>>
>>>>> I think you've misunderstood what I meant. What I'm trying to  
>>>>> say is
>>>>> initially I've got a serial code. I tried to convert to a  
>>>>> parallel one.
>>>>>
>>> Then
>>>
>>>>> I tested it and it was pretty slow. Due to some work  
>>>>> requirement, I need
>>>>>
>>> to
>>>
>>>>> go back to make some changes to my code. Since the parallel is not
>>>>>
>>> working
>>>
>>>>> well, I updated and changed the serial one.
>>>>>
>>>>> Well, that was a while ago and now, due to the updates and  
>>>>> changes, the
>>>>> serial code is different from the old converted parallel code.  
>>>>> Some
>>>>>
>>> files
>>>
>>>>> were also deleted and I can't seem to get it working now. So I  
>>>>> thought I
>>>>> might as well convert the new serial code to parallel. But I'm  
>>>>> not very
>>>>>
>>> sure
>>>
>>>>> what I should do 1st.
>>>>>
>>>>> Maybe I should rephrase my question in that if I just convert my
>>>>>
>>> poisson
>>>
>>>>> equation subroutine from a serial PETSc to a parallel PETSc  
>>>>> version,
>>>>>
>>> will it
>>>
>>>>> work? Should I expect a speedup? The rest of my code is still  
>>>>> serial.
>>>>>
>>>>>
>>>>>
>>>> You should, of course, only expect speedup in the parallel parts
>>>>
>>>> Matt
>>>>
>>>>
>>>>
>>>>
>>>>> Thank you very much.
>>>>>
>>>>>
>>>>>
>>>>> Matthew Knepley wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am not sure why you would ever have two codes. I never do this.
>>>>>>
>>> PETSc
>>>
>>>>>> is designed to write one code to run in serial and parallel.  
>>>>>> The PETSc
>>>>>>
>>>>>>
>>>>>>
>>>>> part
>>>>>
>>>>>
>>>>>
>>>>>> should look identical. To test, run the code yo uhave verified in
>>>>>>
>>> serial
>>>
>>>>>>
>>>>> and
>>>>>
>>>>>
>>>>>
>>>>>> output PETSc data structures (like Mat and Vec) using a binary  
>>>>>> viewer.
>>>>>> Then run in parallel with the same code, which will output the  
>>>>>> same
>>>>>> structures. Take the two files and write a small verification  
>>>>>> code
>>>>>>
>>> that
>>>
>>>>>> loads both versions and calls MatEqual and VecEqual.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com>  
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thank you Matthew. Sorry to trouble you again.
>>>>>>>
>>>>>>> I tried to run it with -log_summary output and I found that  
>>>>>>> there's
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> some
>>>>>
>>>>>
>>>>>
>>>>>>> errors in the execution. Well, I was busy with other things  
>>>>>>> and I
>>>>>>>
>>> just
>>>
>>>>>>>
>>>>> came
>>>>>
>>>>>
>>>>>
>>>>>>> back to this problem. Some of my files on the server has also  
>>>>>>> been
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> deleted.
>>>>>
>>>>>
>>>>>
>>>>>>> It has been a while and I  remember that  it worked before, only
>>>>>>>
>>> much
>>>
>>>>>>> slower.
>>>>>>>
>>>>>>> Anyway, most of the serial code has been updated and maybe it's
>>>>>>>
>>> easier
>>>
>>>>>>>
>>>>> to
>>>>>
>>>>>
>>>>>
>>>>>>> convert the new serial code instead of debugging on the old  
>>>>>>> parallel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> code
>>>>>
>>>>>
>>>>>
>>>>>>> now. I believe I can still reuse part of the old parallel code.
>>>>>>>
>>> However,
>>>
>>>>>>>
>>>>> I
>>>>>
>>>>>
>>>>>
>>>>>>> hope I can approach it better this time.
>>>>>>>
>>>>>>> So supposed I need to start converting my new serial code to
>>>>>>>
>>> parallel.
>>>
>>>>>>> There's 2 eqns to be solved using PETSc, the momentum and  
>>>>>>> poisson. I
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> also
>>>>>
>>>>>
>>>>>
>>>>>>> need to parallelize other parts of my code. I wonder which  
>>>>>>> route is
>>>>>>>
>>> the
>>>
>>>>>>> best:
>>>>>>>
>>>>>>> 1. Don't change the PETSc part ie continue using  
>>>>>>> PETSC_COMM_SELF,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> modify
>>>>>
>>>>>
>>>>>
>>>>>>> other parts of my code to parallel e.g. looping, updating of  
>>>>>>> values
>>>>>>>
>>> etc.
>>>
>>>>>>> Once the execution is fine and speedup is reasonable, then  
>>>>>>> modify
>>>>>>>
>>> the
>>>
>>>>>>>
>>>>> PETSc
>>>>>
>>>>>
>>>>>
>>>>>>> part - poisson eqn 1st followed by the momentum eqn.
>>>>>>>
>>>>>>> 2. Reverse the above order ie modify the PETSc part - poisson  
>>>>>>> eqn
>>>>>>>
>>> 1st
>>>
>>>>>>> followed by the momentum eqn. Then do other parts of my code.
>>>>>>>
>>>>>>> I'm not sure if the above 2 mtds can work or if there will be
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> conflicts. Of
>>>>>
>>>>>
>>>>>
>>>>>>> course, an alternative will be:
>>>>>>>
>>>>>>> 3. Do the poisson, momentum eqns and other parts of the code
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> separately.
>>>>>
>>>>>
>>>>>
>>>>>>> That is, code a standalone parallel poisson eqn and use samples
>>>>>>>
>>> values
>>>
>>>>>>>
>>>>> to
>>>>>
>>>>>
>>>>>
>>>>>>> test it. Same for the momentum and other parts of the code. When
>>>>>>>
>>> each of
>>>
>>>>>>> them is working, combine them to form the full parallel code.
>>>>>>>
>>> However,
>>>
>>>>>>>
>>>>> this
>>>>>
>>>>>
>>>>>
>>>>>>> will be much more troublesome.
>>>>>>>
>>>>>>> I hope someone can give me some recommendations.
>>>>>>>
>>>>>>> Thank you once again.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Matthew Knepley wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> 1) There is no way to have any idea what is going on in your  
>>>>>>>> code
>>>>>>>> without -log_summary output
>>>>>>>>
>>>>>>>> 2) Looking at that output, look at the percentage taken by the
>>>>>>>>
>>> solver
>>>
>>>>>>>> KSPSolve event. I suspect it is not the biggest component,
>>>>>>>>
>>> because
>>>
>>>>>>>> it is very scalable.
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com>  
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've a serial 2D CFD code. As my grid size requirement
>>>>>>>>>
>>> increases,
>>>
>>>>>>>>>
>>>>> the
>>>>>
>>>>>
>>>>>
>>>>>>>>> simulation takes longer. Also, memory requirement becomes a
>>>>>>>>>
>>> problem.
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> Grid
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> size 've reached 1200x1200. Going higher is not possible due  
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> memory
>>>>>
>>>>>
>>>>>
>>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> I tried to convert my code to a parallel one, following the
>>>>>>>>>
>>> examples
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> given.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> I also need to restructure parts of my code to enable parallel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> looping.
>>>>>
>>>>>
>>>>>
>>>>>>>>>
>>>>>>> I
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> restructured
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> parts of my code. I proceed on as longer as the answer for a
>>>>>>>>>
>>> simple
>>>
>>>>>>>>>
>>>>> test
>>>>>
>>>>>
>>>>>
>>>>>>>>> case is correct. I thought it's not really possible to do any
>>>>>>>>>
>>> speed
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> testing
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> since the code is not fully parallelized yet. When I finished
>>>>>>>>>
>>> during
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> most of
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> the conversion, I found that in the actual run that it is much
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> slower,
>>>>>
>>>>>
>>>>>
>>>>>>>>> although the answer is correct.
>>>>>>>>>
>>>>>>>>> So what is the remedy now? I wonder what I should do to check
>>>>>>>>>
>>> what's
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> wrong.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> Must I restart everything again? Btw, my grid size is  
>>>>>>>>> 1200x1200.
>>>>>>>>>
>>> I
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> believed
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> it should be suitable for parallel run of 4 processors? Is  
>>>>>>>>> that
>>>>>>>>>
>>> so?
>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>


From zonexo at gmail.com  Tue Apr 15 11:44:17 2008
From: zonexo at gmail.com (Ben Tay)
Date: Wed, 16 Apr 2008 00:44:17 +0800
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <a9f269830804131047s74e8396u9dd0ec12e9639371@mail.gmail.com>	 <480336BE.3070507@gmail.com>	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>	 <48035F88.2080003@gmail.com>	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>	 <4804CAC0.6060201@gmail.com> <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com> <4804D044.2060502@gmail.com> <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>
Message-ID: <4804DB61.3080906@gmail.com>

Hi,

Here's the summary for 1 processor. Seems like it's also using a long 
time... Can someone tell me when my mistakes possibly lie? Thank you 
very much!

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance 
Summary: ----------------------------------------------

./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 
Wed Apr 16 00:39:22 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

                         Max       Max/Min        Avg      Total
Time (sec):           1.088e+03      1.00000   1.088e+03
Objects:              4.300e+01      1.00000   4.300e+01
Flops:                2.658e+11      1.00000   2.658e+11  2.658e+11
Flops/sec:            2.444e+08      1.00000   2.444e+08  2.444e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       1.460e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N 
--> 2N flops
                            and VecAXPY() for complex vectors of length 
N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages 
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 1.0877e+03 100.0%  2.6584e+11 100.0%  0.000e+00   
0.0%  0.000e+00        0.0%  1.460e+04 100.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this 
phase
      %M - percent messages in this phase     %L - percent message 
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #   macros. To get timing results we always recommend    #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      ##########################################################


Event                Count      Time (sec)     
Flops/sec                         --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 11  0  0  0  12 11  0  0  0   216
MatSolve            7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 25 11  0  0  0  25 11  0  0  0   107
MatLUFactorNum         1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0    88
MatILUFactorSym        1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPGMRESOrthog      7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00 
7.2e+03 52 72  0  0 49  52 72  0  0 49   341
KSPSetup               1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00 
1.5e+04 93100  0  0100  93100  0  0100   262
PCSetUp                1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00 
3.0e+00  0  0  0  0  0   0  0  0  0  0    44
PCApply             7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 25 11  0  0  0  25 11  0  0  0   107
VecMDot             7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00 
7.2e+03 25 36  0  0 49  25 36  0  0 49   359
VecNorm             7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00 
7.4e+03  2  2  0  0 51   2  2  0  0 51   374
VecScale            7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  1  1  0  0  0   1  1  0  0  0   345
VecCopy              240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   206
VecMAXPY            7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 29 38  0  0  0  29 38  0  0  0   324
VecAssemblyBegin       2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize        7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 
7.4e+03  2  4  0  0 51   2  4  0  0 51   364
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

              Matrix     2              2   65632332     0
       Krylov Solver     1              1      17216     0
      Preconditioner     1              1        168     0
           Index Set     3              3    5185032     0
                 Vec    36             36  120987640     0
========================================================================================================================
Average time to get PetscTime(): 3.09944e-07
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Jan  8 22:22:08 2008
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Jan  8 22:22:08 2008                    
Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 
--sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 
--sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 
--sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 
--with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 
--with-batch=1 --with-mpi-shared=0 
--with-mpi-include=/usr/local/topspin/mpi/mpich/include 
--with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a 
--with-mpirun=/usr/local/topspin/mpi/mpich/bi
n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t 
--with-shared=0 
-----------------------------------------
Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 
12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
Using PETSc arch: atlas3-mpi
-----------------------------------------
Using C compiler: mpicc -fPIC -O  
Using Fortran compiler: mpif90 -I. -fPIC -O   
-----------------------------------------
Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 
-I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi 
-I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
I/home/enduser/g0306332/lib/hypre/include 
-I/usr/local/topspin/mpi/mpich/include    
------------------------------------------
Using C linker: mpicc -fPIC -O
Using Fortran linker: mpif90 -I. -fPIC -O  
Using libraries: 
-Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi 
-L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts 
-lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc        
-Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib 
-L/home/enduser/g0306332/lib/hypre/lib -lHYPRE 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/usr/local/topspin/mpi/mpich/lib 
-L/usr/local/topspin/mpi/mpich/lib -lmpich 
-Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t 
-L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide 
-lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib 
-ldl -lmpich -libverbs -libumad -lpthread -lrt 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib 
-lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib 
-ldl -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 
-libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-L/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
------------------------------------------
639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (20major+172979minor)pagefaults 0swaps
Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  
===================
00000 atlas3-c45 time ./a.out -lo  Done                     04/16/2008 
00:39:23


Barry Smith wrote:
>
>    It is taking 8776 iterations of GMRES! How many does it take on one 
> process? This is a huge
> amount.
>
> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 
> 4.8e+03 0.0e+00 10 11100100  0  10 11100100  0   217
> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 
> 0.0e+00 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>
> One process is spending 2.9 times as long in the embarresingly 
> parallel MatSolve then the other process;
> this indicates a huge imbalance in the number of nonzeros on each 
> process. As Matt noticed, the partitioning
> between the two processes is terrible.
>
>   Barry
>
> On Apr 15, 2008, at 10:56 AM, Ben Tay wrote:
>> Oh sorry here's the whole information. I'm using 2 processors currently:
>>
>> ************************************************************************************************************************ 
>>
>> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript 
>> -r -fCourier9' to print this document            ***
>> ************************************************************************************************************************ 
>>
>>
>> ---------------------------------------------- PETSc Performance 
>> Summary: ----------------------------------------------
>>
>> ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by 
>> g0306332 Tue Apr 15 23:03:09 2008
>> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 
>> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>>
>>                        Max       Max/Min        Avg      Total
>> Time (sec):           1.114e+03      1.00054   1.114e+03
>> Objects:              5.400e+01      1.00000   5.400e+01
>> Flops:                1.574e+11      1.00000   1.574e+11  3.147e+11
>> Flops/sec:            1.414e+08      1.00054   1.413e+08  2.826e+08
>> MPI Messages:         8.777e+03      1.00000   8.777e+03  1.755e+04
>> MPI Message Lengths:  4.213e+07      1.00000   4.800e+03  8.425e+07
>> MPI Reductions:       8.644e+03      1.00000
>>
>> Flop counting convention: 1 flop = 1 real number operation of type 
>> (multiply/divide/add/subtract)
>>                           e.g., VecAXPY() for real vectors of length 
>> N --> 2N flops
>>                           and VecAXPY() for complex vectors of length 
>> N --> 8N flops
>>
>> Summary of Stages:   ----- Time ------  ----- Flops -----  --- 
>> Messages ---  -- Message Lengths --  -- Reductions --
>>                       Avg     %Total     Avg     %Total   counts   
>> %Total     Avg         %Total   counts   %Total
>> 0:      Main Stage: 1.1136e+03 100.0%  3.1475e+11 100.0%  1.755e+04 
>> 100.0%  4.800e+03      100.0%  1.729e+04 100.0%
>>
>> ------------------------------------------------------------------------------------------------------------------------ 
>>
>> See the 'Profiling' chapter of the users' manual for details on 
>> interpreting output.
>> Phase summary info:
>>  Count: number of times phase was executed
>>  Time and Flops/sec: Max - maximum over all processors
>>                      Ratio - ratio of maximum to minimum over all 
>> processors
>>  Mess: number of messages sent
>>  Avg. len: average message length
>>  Reduct: number of global reductions
>>  Global: entire computation
>>  Stage: stages of a computation. Set stages with PetscLogStagePush() 
>> and PetscLogStagePop().
>>     %T - percent time in this phase         %F - percent flops in 
>> this phase
>>     %M - percent messages in this phase     %L - percent message 
>> lengths in this phase
>>     %R - percent reductions in this phase
>>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
>> over all processors)
>> ------------------------------------------------------------------------------------------------------------------------ 
>>
>>
>>
>>     ##########################################################
>>     #                                                        #
>>     #                          WARNING!!!                    #
>>     #                                                        #
>>     #   This code was run without the PreLoadBegin()         #
>>     #   macros. To get timing results we always recommend    #
>>     #   preloading. otherwise timing numbers may be          #
>>     #   meaningless.                                         #
>>     ##########################################################
>>
>>
>> Event                Count      Time (sec)     
>> Flops/sec                         --- Global ---  --- Stage ---   Total
>>                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg 
>> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------------------------------------------------------------------ 
>>
>>
>> --- Event Stage 0: Main Stage
>>
>> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 
>> 4.8e+03 0.0e+00 10 11100100  0  10 11100100  0   217
>> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 
>> 0.0e+00 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>> MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>> MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0
>> MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 
>> 2.4e+03 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 
>> 0.0e+00 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>> KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 
>> 4.8e+03 1.7e+04 89100100100100  89100100100100   317
>> PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 
>> 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>> PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 
>> 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>> PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 
>> 0.0e+00 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>> VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 
>> 0.0e+00 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>> VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 
>> 0.0e+00 8.8e+03  9  2  0  0 51   9  2  0  0 51    42
>> VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 
>> 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0   636
>> VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>> VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 
>> 0.0e+00 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>> VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 
>> 4.8e+03 0.0e+00  0  0100100  0   0  0100100  0     0
>> VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 
>> 0.0e+00 8.8e+03  9  4  0  0 51   9  4  0  0 51    62
>> ------------------------------------------------------------------------------------------------------------------------ 
>>
>>
>> Memory usage is given in bytes:
>>
>> Object Type          Creations   Destructions   Memory  Descendants' 
>> Mem.
>>
>> --- Event Stage 0: Main Stage
>>
>>             Matrix     4              4   49227380     0
>>      Krylov Solver     2              2      17216     0
>>     Preconditioner     2              2        256     0
>>          Index Set     5              5    2596120     0
>>                Vec    40             40   62243224     0
>>        Vec Scatter     1              1          0     0
>> ======================================================================================================================== 
>>
>> Average time to get PetscTime(): 4.05312e-07
>> Average time for MPI_Barrier(): 7.62939e-07
>> Average time for zero size MPI_Send(): 2.02656e-06
>> OptionTable: -log_summary
>> Compiled without FORTRAN kernels
>> Compiled with full precision matrices (default)
>> Compiled without FORTRAN kernels                              
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>> sizeof(PetscScalar) 8
>> Configure run at: Tue Jan  8 22:22:08 2008
>> Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 
>> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 
>> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 
>> --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with-vendor-compilers=intel 
>> --with-x=0 --with-hypre-dir=/home/enduser/g0306332/lib/hypre 
>> --with-debugging=0 --with-batch=1 --with-mpi-shared=0 
>> --with-mpi-include=/usr/local/topspin/mpi/mpich/include 
>> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a 
>> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun 
>> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
>> -----------------------------------------
>> Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
>> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed 
>> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
>> Using PETSc arch: atlas3-mpi
>> -----------------------------------------
>> Using C compiler: mpicc -fPIC -O  Using Fortran compiler: mpif90 -I. 
>> -fPIC -O   -----------------------------------------
>> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 
>> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi 
>> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
>> I/home/enduser/g0306332/lib/hypre/include 
>> -I/usr/local/topspin/mpi/mpich/include    
>> ------------------------------------------
>> Using C linker: mpicc -fPIC -O
>> Using Fortran linker: mpif90 -I. -fPIC -O  Using libraries: 
>> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi 
>> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts 
>> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc        
>> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib 
>> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib 
>> -L/usr/local/topspin/mpi/mpich/lib -lmpich 
>> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t 
>> -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide 
>> -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib 
>> -ldl -lmpich -libverbs -libumad -lpthread -lrt 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
>> -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib 
>> -L/opt/intel/fce/9.1.045/lib -lifport -lifcore -lm 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich 
>> -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs 
>> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -L/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
>> -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
>> ------------------------------------------
>> 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata 
>> 0maxresident)k
>> 0inputs+0outputs (28major+153248minor)pagefaults 0swaps
>> 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata 
>> 0maxresident)k
>> 0inputs+0outputs (18major+158175minor)pagefaults 0swaps
>> Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>>            TID   HOST_NAME   COMMAND_LINE            
>> STATUS            TERMINATION_TIME
>> ===== ========== ================  =======================  
>> ===================
>> 00000 atlas3-c05 time ./a.out -lo  Done                     
>> 04/15/2008 23:03:10
>> 00001 atlas3-c05 time ./a.out -lo  Done                     
>> 04/15/2008 23:03:10
>>
>>
>> I have a cartesian grid 600x720. Since there's 2 processors, it is 
>> partitioned to 600x360. I just use:
>>
>> call 
>> MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr) 
>>
>>
>>       call MatSetFromOptions(A_mat,ierr)
>>
>>       call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)
>>
>>       call KSPCreate(MPI_COMM_WORLD,ksp,ierr)
>>
>>       call 
>> VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)
>>
>> total_k is actually size_x*size_y. Since it's 2d, the maximum values 
>> per row is 5. When you says setting off-process values, do you mean I 
>> insert values from 1 processor into another? I thought I insert the 
>> values into the correct processor...
>>
>> Thank you very much!
>>
>>
>>
>> Matthew Knepley wrote:
>>> 1) Please never cut out parts of the summary. All the information is 
>>> valuable,
>>>    and most times, necessary
>>>
>>> 2) You seem to have huge load imbalance (look at VecNorm). Do you 
>>> partition
>>>    the system yourself. How many processes is this?
>>>
>>> 3) You seem to be setting a huge number of off-process values in the 
>>> matrix
>>>    (see MatAssemblyBegin). Is this true? I would reorganize this part.
>>>
>>>  Matt
>>>
>>> On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have converted the poisson eqn part of the CFD code to parallel. 
>>>> The grid
>>>> size tested is 600x720. For the momentum eqn, I used another serial 
>>>> linear
>>>> solver (nspcg) to prevent mixing of results. Here's the output 
>>>> summary:
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 
>>>> 4.8e+03
>>>> 0.0e+00 10 11100100  0  10 11100100  0   217
>>>> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>>>> MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>>>> MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
>>>> MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 
>>>> 2.4e+03
>>>> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 
>>>> 0.0e+00
>>>> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>>>> KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 
>>>> 4.8e+03
>>>> 1.7e+04 89100100100100  89100100100100   317
>>>> PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 
>>>> 0.0e+00
>>>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>>> PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 
>>>> 0.0e+00
>>>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>>> PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>>>> VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 
>>>> 0.0e+00
>>>> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>>>> *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 
>>>> 0.0e+00
>>>> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
>>>> *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
>>>> VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>>>> VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>>>> VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 
>>>> 4.8e+03
>>>> 0.0e+00  0  0100100  0   0  0100100  0     0*
>>>> *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
>>>> *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 
>>>> 0.0e+00
>>>> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------ 
>>>>
>>>>  Memory usage is given in bytes:
>>>>  Object Type          Creations   Destructions   Memory  
>>>> Descendants' Mem.
>>>>    --- Event Stage 0: Main Stage
>>>>                 Matrix     4              4   49227380     0
>>>>      Krylov Solver     2              2      17216     0
>>>>     Preconditioner     2              2        256     0
>>>>          Index Set     5              5    2596120     0
>>>>                Vec    40             40   62243224     0
>>>>        Vec Scatter     1              1          0     0
>>>> ======================================================================================================================== 
>>>>
>>>> Average time to get PetscTime(): 4.05312e-07                  
>>>> Average time
>>>> for MPI_Barrier(): 7.62939e-07
>>>> Average time for zero size MPI_Send(): 2.02656e-06
>>>> OptionTable: -log_summary
>>>>
>>>>
>>>> The PETSc manual states that ratio should be close to 1. There's 
>>>> quite a
>>>> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very 
>>>> big. So
>>>> what could be the cause?
>>>>
>>>> I wonder if it has to do the way I insert the matrix. My steps are:
>>>> (cartesian grids, i loop faster than j, fortran)
>>>>
>>>> For matrix A and rhs
>>>>
>>>> Insert left extreme cells values belonging to myid
>>>>
>>>> if (myid==0) then
>>>>
>>>>   insert corner cells values
>>>>
>>>>   insert south cells values
>>>>
>>>>   insert internal cells values
>>>>
>>>> else if (myid==num_procs-1) then
>>>>
>>>>   insert corner cells values
>>>>
>>>>   insert north cells values
>>>>
>>>>   insert internal cells values
>>>>
>>>> else
>>>>
>>>>   insert internal cells values
>>>>
>>>> end if
>>>>
>>>> Insert right extreme cells values belonging to myid
>>>>
>>>> All these values are entered into a big_A(size_x*size_y,5) matrix. 
>>>> int_A
>>>> stores the position of the values. I then do
>>>>
>>>> call MatZeroEntries(A_mat,ierr)
>>>>
>>>>   do k=ksta_p+1,kend_p   !for cells belonging to myid
>>>>
>>>>       do kk=1,5
>>>>
>>>>           II=k-1
>>>>
>>>>           JJ=int_A(k,kk)-1
>>>>
>>>>           call 
>>>> MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
>>>>                 end do
>>>>
>>>>   end do
>>>>
>>>>   call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>>>
>>>>   call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>>>
>>>>
>>>> I wonder if the problem lies here.I used the big_A matrix because I 
>>>> was
>>>> migrating from an old linear solver. Lastly, I was told to widen my 
>>>> window
>>>> to 120 characters. May I know how do I do it?
>>>>
>>>>
>>>>
>>>> Thank you very much.
>>>>
>>>> Matthew Knepley wrote:
>>>>
>>>>
>>>>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi Matthew,
>>>>>>
>>>>>> I think you've misunderstood what I meant. What I'm trying to say is
>>>>>> initially I've got a serial code. I tried to convert to a 
>>>>>> parallel one.
>>>>>>
>>>> Then
>>>>
>>>>>> I tested it and it was pretty slow. Due to some work requirement, 
>>>>>> I need
>>>>>>
>>>> to
>>>>
>>>>>> go back to make some changes to my code. Since the parallel is not
>>>>>>
>>>> working
>>>>
>>>>>> well, I updated and changed the serial one.
>>>>>>
>>>>>> Well, that was a while ago and now, due to the updates and 
>>>>>> changes, the
>>>>>> serial code is different from the old converted parallel code. Some
>>>>>>
>>>> files
>>>>
>>>>>> were also deleted and I can't seem to get it working now. So I 
>>>>>> thought I
>>>>>> might as well convert the new serial code to parallel. But I'm 
>>>>>> not very
>>>>>>
>>>> sure
>>>>
>>>>>> what I should do 1st.
>>>>>>
>>>>>> Maybe I should rephrase my question in that if I just convert my
>>>>>>
>>>> poisson
>>>>
>>>>>> equation subroutine from a serial PETSc to a parallel PETSc version,
>>>>>>
>>>> will it
>>>>
>>>>>> work? Should I expect a speedup? The rest of my code is still 
>>>>>> serial.
>>>>>>
>>>>>>
>>>>>>
>>>>> You should, of course, only expect speedup in the parallel parts
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Matthew Knepley wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I am not sure why you would ever have two codes. I never do this.
>>>>>>>
>>>> PETSc
>>>>
>>>>>>> is designed to write one code to run in serial and parallel. The 
>>>>>>> PETSc
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> part
>>>>>>
>>>>>>
>>>>>>
>>>>>>> should look identical. To test, run the code yo uhave verified in
>>>>>>>
>>>> serial
>>>>
>>>>>>>
>>>>>> and
>>>>>>
>>>>>>
>>>>>>
>>>>>>> output PETSc data structures (like Mat and Vec) using a binary 
>>>>>>> viewer.
>>>>>>> Then run in parallel with the same code, which will output the same
>>>>>>> structures. Take the two files and write a small verification code
>>>>>>>
>>>> that
>>>>
>>>>>>> loads both versions and calls MatEqual and VecEqual.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Thank you Matthew. Sorry to trouble you again.
>>>>>>>>
>>>>>>>> I tried to run it with -log_summary output and I found that 
>>>>>>>> there's
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> some
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> errors in the execution. Well, I was busy with other things and I
>>>>>>>>
>>>> just
>>>>
>>>>>>>>
>>>>>> came
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> back to this problem. Some of my files on the server has also been
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> deleted.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> It has been a while and I  remember that  it worked before, only
>>>>>>>>
>>>> much
>>>>
>>>>>>>> slower.
>>>>>>>>
>>>>>>>> Anyway, most of the serial code has been updated and maybe it's
>>>>>>>>
>>>> easier
>>>>
>>>>>>>>
>>>>>> to
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> convert the new serial code instead of debugging on the old 
>>>>>>>> parallel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> code
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> now. I believe I can still reuse part of the old parallel code.
>>>>>>>>
>>>> However,
>>>>
>>>>>>>>
>>>>>> I
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> hope I can approach it better this time.
>>>>>>>>
>>>>>>>> So supposed I need to start converting my new serial code to
>>>>>>>>
>>>> parallel.
>>>>
>>>>>>>> There's 2 eqns to be solved using PETSc, the momentum and 
>>>>>>>> poisson. I
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> also
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> need to parallelize other parts of my code. I wonder which 
>>>>>>>> route is
>>>>>>>>
>>>> the
>>>>
>>>>>>>> best:
>>>>>>>>
>>>>>>>> 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> modify
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> other parts of my code to parallel e.g. looping, updating of 
>>>>>>>> values
>>>>>>>>
>>>> etc.
>>>>
>>>>>>>> Once the execution is fine and speedup is reasonable, then modify
>>>>>>>>
>>>> the
>>>>
>>>>>>>>
>>>>>> PETSc
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> part - poisson eqn 1st followed by the momentum eqn.
>>>>>>>>
>>>>>>>> 2. Reverse the above order ie modify the PETSc part - poisson eqn
>>>>>>>>
>>>> 1st
>>>>
>>>>>>>> followed by the momentum eqn. Then do other parts of my code.
>>>>>>>>
>>>>>>>> I'm not sure if the above 2 mtds can work or if there will be
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> conflicts. Of
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> course, an alternative will be:
>>>>>>>>
>>>>>>>> 3. Do the poisson, momentum eqns and other parts of the code
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> separately.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> That is, code a standalone parallel poisson eqn and use samples
>>>>>>>>
>>>> values
>>>>
>>>>>>>>
>>>>>> to
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> test it. Same for the momentum and other parts of the code. When
>>>>>>>>
>>>> each of
>>>>
>>>>>>>> them is working, combine them to form the full parallel code.
>>>>>>>>
>>>> However,
>>>>
>>>>>>>>
>>>>>> this
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> will be much more troublesome.
>>>>>>>>
>>>>>>>> I hope someone can give me some recommendations.
>>>>>>>>
>>>>>>>> Thank you once again.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Matthew Knepley wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> 1) There is no way to have any idea what is going on in your code
>>>>>>>>> without -log_summary output
>>>>>>>>>
>>>>>>>>> 2) Looking at that output, look at the percentage taken by the
>>>>>>>>>
>>>> solver
>>>>
>>>>>>>>> KSPSolve event. I suspect it is not the biggest component,
>>>>>>>>>
>>>> because
>>>>
>>>>>>>>> it is very scalable.
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've a serial 2D CFD code. As my grid size requirement
>>>>>>>>>>
>>>> increases,
>>>>
>>>>>>>>>>
>>>>>> the
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>> simulation takes longer. Also, memory requirement becomes a
>>>>>>>>>>
>>>> problem.
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> Grid
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> size 've reached 1200x1200. Going higher is not possible due to
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>> memory
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>> problem.
>>>>>>>>>>
>>>>>>>>>> I tried to convert my code to a parallel one, following the
>>>>>>>>>>
>>>> examples
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> given.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> I also need to restructure parts of my code to enable parallel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>> looping.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>>
>>>>>>>> I
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> restructured
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> parts of my code. I proceed on as longer as the answer for a
>>>>>>>>>>
>>>> simple
>>>>
>>>>>>>>>>
>>>>>> test
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>> case is correct. I thought it's not really possible to do any
>>>>>>>>>>
>>>> speed
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> testing
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> since the code is not fully parallelized yet. When I finished
>>>>>>>>>>
>>>> during
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> most of
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> the conversion, I found that in the actual run that it is much
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>> slower,
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>> although the answer is correct.
>>>>>>>>>>
>>>>>>>>>> So what is the remedy now? I wonder what I should do to check
>>>>>>>>>>
>>>> what's
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> wrong.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Must I restart everything again? Btw, my grid size is 1200x1200.
>>>>>>>>>>
>>>> I
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> believed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> it should be suitable for parallel run of 4 processors? Is that
>>>>>>>>>>
>>>> so?
>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>


From knepley at gmail.com  Tue Apr 15 12:33:46 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 12:33:46 -0500
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <4804DB61.3080906@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
	 <480336BE.3070507@gmail.com>
	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>
	 <48035F88.2080003@gmail.com>
	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>
	 <4804CAC0.6060201@gmail.com>
	 <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
	 <4804D044.2060502@gmail.com>
	 <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>
	 <4804DB61.3080906@gmail.com>
Message-ID: <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>

The convergence here is jsut horrendous. Have you tried using LU to check
your implementation? All the time is in the solve right now. I would first
try a direct method (at least on a small problem) and then try to understand
the convergence behavior. MUMPS can actually scale very well for big problems.

  Matt

On Tue, Apr 15, 2008 at 11:44 AM, Ben Tay <zonexo at gmail.com> wrote:
> Hi,
>
>  Here's the summary for 1 processor. Seems like it's also using a long
> time... Can someone tell me when my mistakes possibly lie? Thank you very
> much!
>
>
> ************************************************************************************************************************
>  ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed
> Apr 16 00:39:22 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                         Max       Max/Min        Avg      Total
>  Time (sec):           1.088e+03      1.00000   1.088e+03
>  Objects:              4.300e+01      1.00000   4.300e+01
>  Flops:                2.658e+11      1.00000   2.658e+11  2.658e+11
>  Flops/sec:            2.444e+08      1.00000   2.444e+08  2.444e+08
>  MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
>  MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
>  MPI Reductions:       1.460e+04      1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length N -->
> 2N flops
>                            and VecAXPY() for complex vectors of length N -->
> 8N flops
>
>  Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts   %Total
> Avg         %Total   counts   %Total
>  0:      Main Stage: 1.0877e+03 100.0%  2.6584e+11 100.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  1.460e+04 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops/sec: Max - maximum over all processors
>                       Ratio - ratio of maximum to minimum over all
> processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this
> phase
>      %M - percent messages in this phase     %L - percent message lengths in
> this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was run without the PreLoadBegin()         #
>      #   macros. To get timing results we always recommend    #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      ##########################################################
>
>
>  Event                Count      Time (sec)     Flops/sec
> --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult             7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 11  0  0  0  12 11  0  0  0   216
>  MatSolve            7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11  0  0  0  25 11  0  0  0   107
>  MatLUFactorNum         1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0    88
>  MatILUFactorSym        1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyBegin       1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyEnd         1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetOrdering         1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatZeroEntries         1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPGMRESOrthog      7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00
> 7.2e+03 52 72  0  0 49  52 72  0  0 49   341
>  KSPSetup               1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00
> 1.5e+04 93100  0  0100  93100  0  0100   262
>  PCSetUp                1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    44
>  PCApply             7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11  0  0  0  25 11  0  0  0   107
>  VecMDot             7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00
> 7.2e+03 25 36  0  0 49  25 36  0  0 49   359
>  VecNorm             7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00
> 7.4e+03  2  2  0  0 51   2  2  0  0 51   374
>  VecScale            7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  1  1  0  0  0   1  1  0  0  0   345
>  VecCopy              240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet               241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAXPY              479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   206
>  VecMAXPY            7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 29 38  0  0  0  29 38  0  0  0   324
>  VecAssemblyBegin       2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAssemblyEnd         2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecNormalize        7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00
> 7.4e+03  2  4  0  0 51   2  4  0  0 51   364
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type          Creations   Destructions   Memory  Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>              Matrix     2              2   65632332     0
>       Krylov Solver     1              1      17216     0
>      Preconditioner     1              1        168     0
>           Index Set     3              3    5185032     0
>                 Vec    36             36  120987640     0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 3.09944e-07
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan  8 22:22:08 2008
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan  8 22:22:08 2008                    Configure
> options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2
> --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4
> --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4
> --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bi
>  n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t
> --with-shared=0 -----------------------------------------
>  Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
>  Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12
> 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>  Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
>  Using PETSc arch: atlas3-mpi
>  -----------------------------------------
>  Using C compiler: mpicc -fPIC -O  Using Fortran compiler: mpif90 -I. -fPIC
> -O   -----------------------------------------
>  Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
>  I/home/enduser/g0306332/lib/hypre/include
> -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
>  Using C linker: mpicc -fPIC -O
>  Using Fortran linker: mpif90 -I. -fPIC -O  Using libraries:
> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi
> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts
> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib
> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib
> -L/usr/local/topspin/mpi/mpich/lib -lmpich
> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t
> -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64
> -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport
> -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl
> -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs
> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
>  ------------------------------------------
>  639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (20major+172979minor)pagefaults 0swaps
>  Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
>  TID   HOST_NAME   COMMAND_LINE            STATUS
> TERMINATION_TIME
>  ===== ========== ================  =======================
> ===================
>  00000 atlas3-c45 time ./a.out -lo  Done                     04/16/2008
> 00:39:23
>
>
>  Barry Smith wrote:
>
> >
> >   It is taking 8776 iterations of GMRES! How many does it take on one
> process? This is a huge
> > amount.
> >
> > MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
> 0.0e+00 10 11100100  0  10 11100100  0   217
> > MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
> >
> > One process is spending 2.9 times as long in the embarresingly parallel
> MatSolve then the other process;
> > this indicates a huge imbalance in the number of nonzeros on each process.
> As Matt noticed, the partitioning
> > between the two processes is terrible.
> >
> >  Barry
> >
> > On Apr 15, 2008, at 10:56 AM, Ben Tay wrote:
> >
> > > Oh sorry here's the whole information. I'm using 2 processors currently:
> > >
> > >
> ************************************************************************************************************************
> > > ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
> > >
> ************************************************************************************************************************
> > >
> > > ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
> > >
> > > ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332
> Tue Apr 15 23:03:09 2008
> > > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007
> HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
> > >
> > >                       Max       Max/Min        Avg      Total
> > > Time (sec):           1.114e+03      1.00054   1.114e+03
> > > Objects:              5.400e+01      1.00000   5.400e+01
> > > Flops:                1.574e+11      1.00000   1.574e+11  3.147e+11
> > > Flops/sec:            1.414e+08      1.00054   1.413e+08  2.826e+08
> > > MPI Messages:         8.777e+03      1.00000   8.777e+03  1.755e+04
> > > MPI Message Lengths:  4.213e+07      1.00000   4.800e+03  8.425e+07
> > > MPI Reductions:       8.644e+03      1.00000
> > >
> > > Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> > >                          e.g., VecAXPY() for real vectors of length N
> --> 2N flops
> > >                          and VecAXPY() for complex vectors of length N
> --> 8N flops
> > >
> > > Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
> > >                      Avg     %Total     Avg     %Total   counts   %Total
> Avg         %Total   counts   %Total
> > > 0:      Main Stage: 1.1136e+03 100.0%  3.1475e+11 100.0%  1.755e+04
> 100.0%  4.800e+03      100.0%  1.729e+04 100.0%
> > >
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > > See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> > > Phase summary info:
> > >  Count: number of times phase was executed
> > >  Time and Flops/sec: Max - maximum over all processors
> > >                     Ratio - ratio of maximum to minimum over all
> processors
> > >  Mess: number of messages sent
> > >  Avg. len: average message length
> > >  Reduct: number of global reductions
> > >  Global: entire computation
> > >  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> > >    %T - percent time in this phase         %F - percent flops in this
> phase
> > >    %M - percent messages in this phase     %L - percent message lengths
> in this phase
> > >    %R - percent reductions in this phase
> > >  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > >
> > >    ##########################################################
> > >    #                                                        #
> > >    #                          WARNING!!!                    #
> > >    #                                                        #
> > >    #   This code was run without the PreLoadBegin()         #
> > >    #   macros. To get timing results we always recommend    #
> > >    #   preloading. otherwise timing numbers may be          #
> > >    #   meaningless.                                         #
> > >    ##########################################################
> > >
> > >
> > > Event                Count      Time (sec)     Flops/sec
> --- Global ---  --- Stage ---   Total
> > >                 Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > > MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
> 0.0e+00 10 11100100  0  10 11100100  0   217
> > > MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
> > > MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
> > > MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0
> > > MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
> > > KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
> 1.7e+04 89100100100100  89100100100100   317
> > > PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> > > PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> > > PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
> > > VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
> > > VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42
> > > VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636
> > > VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> > > VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
> > > VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
> > > VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
> 0.0e+00  0  0100100  0   0  0100100  0     0
> > > VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> > > VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > > Memory usage is given in bytes:
> > >
> > > Object Type          Creations   Destructions   Memory  Descendants'
> Mem.
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > >            Matrix     4              4   49227380     0
> > >     Krylov Solver     2              2      17216     0
> > >    Preconditioner     2              2        256     0
> > >         Index Set     5              5    2596120     0
> > >               Vec    40             40   62243224     0
> > >       Vec Scatter     1              1          0     0
> > >
> ========================================================================================================================
> > > Average time to get PetscTime(): 4.05312e-07
> > > Average time for MPI_Barrier(): 7.62939e-07
> > > Average time for zero size MPI_Send(): 2.02656e-06
> > > OptionTable: -log_summary
> > > Compiled without FORTRAN kernels
> > > Compiled with full precision matrices (default)
> > > Compiled without FORTRAN kernels                              Compiled
> with full precision matrices (default)
> > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> > > Configure run at: Tue Jan  8 22:22:08 2008
> > > Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8
> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8
> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4
> --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun
> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> > > -----------------------------------------
> > > Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
> > > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul
> 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> > > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> > > Using PETSc arch: atlas3-mpi
> > > -----------------------------------------
> > > Using C compiler: mpicc -fPIC -O  Using Fortran compiler: mpif90 -I.
> -fPIC -O   -----------------------------------------
> > > Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
> > > I/home/enduser/g0306332/lib/hypre/include
> -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
> > > Using C linker: mpicc -fPIC -O
> > > Using Fortran linker: mpif90 -I. -fPIC -O  Using libraries:
> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi
> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts
> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib
> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib
> -L/usr/local/topspin/mpi/mpich/lib -lmpich
> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t
> -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64
> -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport
> -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl
> -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs
> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
> > > ------------------------------------------
> > > 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (28major+153248minor)pagefaults 0swaps
> > > 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (18major+158175minor)pagefaults 0swaps
> > > Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
> > >           TID   HOST_NAME   COMMAND_LINE            STATUS
> TERMINATION_TIME
> > > ===== ========== ================  =======================
> ===================
> > > 00000 atlas3-c05 time ./a.out -lo  Done                     04/15/2008
> 23:03:10
> > > 00001 atlas3-c05 time ./a.out -lo  Done                     04/15/2008
> 23:03:10
> > >
> > >
> > > I have a cartesian grid 600x720. Since there's 2 processors, it is
> partitioned to 600x360. I just use:
> > >
> > > call
> MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr)
> > >
> > >      call MatSetFromOptions(A_mat,ierr)
> > >
> > >      call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)
> > >
> > >      call KSPCreate(MPI_COMM_WORLD,ksp,ierr)
> > >
> > >      call
> VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)
> > >
> > > total_k is actually size_x*size_y. Since it's 2d, the maximum values per
> row is 5. When you says setting off-process values, do you mean I insert
> values from 1 processor into another? I thought I insert the values into the
> correct processor...
> > >
> > > Thank you very much!
> > >
> > >
> > >
> > > Matthew Knepley wrote:
> > >
> > > > 1) Please never cut out parts of the summary. All the information is
> valuable,
> > > >   and most times, necessary
> > > >
> > > > 2) You seem to have huge load imbalance (look at VecNorm). Do you
> partition
> > > >   the system yourself. How many processes is this?
> > > >
> > > > 3) You seem to be setting a huge number of off-process values in the
> matrix
> > > >   (see MatAssemblyBegin). Is this true? I would reorganize this part.
> > > >
> > > >  Matt
> > > >
> > > > On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
> > > >
> > > >
> > > > > Hi,
> > > > >
> > > > > I have converted the poisson eqn part of the CFD code to parallel.
> The grid
> > > > > size tested is 600x720. For the momentum eqn, I used another serial
> linear
> > > > > solver (nspcg) to prevent mixing of results. Here's the output
> summary:
> > > > >
> > > > > --- Event Stage 0: Main Stage
> > > > >
> > > > > MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04
> 4.8e+03
> > > > > 0.0e+00 10 11100100  0  10 11100100  0   217
> > > > > MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
> > > > > MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
> > > > > MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0
> 0.0e+00
> > > > > 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
> > > > > MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00
> 2.4e+03
> > > > > 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00
> 0.0e+00
> > > > > 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
> > > > > KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04
> 4.8e+03
> > > > > 1.7e+04 89100100100100  89100100100100   317
> > > > > PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> > > > > PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> > > > > PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
> > > > > VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00
> 0.0e+00
> > > > > 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
> > > > > *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00
> 0.0e+00
> > > > > 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
> > > > > *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
> > > > > VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> > > > > VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
> > > > > VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
> > > > > VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04
> 4.8e+03
> > > > > 0.0e+00  0  0100100  0   0  0100100  0     0*
> > > > > *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
> > > > > *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00
> 0.0e+00
> > > > > 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
> > > > >
> > > > >
> ------------------------------------------------------------------------------------------------------------------------
> > > > >  Memory usage is given in bytes:
> > > > >  Object Type          Creations   Destructions   Memory
> Descendants' Mem.
> > > > >   --- Event Stage 0: Main Stage
> > > > >                Matrix     4              4   49227380     0
> > > > >     Krylov Solver     2              2      17216     0
> > > > >    Preconditioner     2              2        256     0
> > > > >         Index Set     5              5    2596120     0
> > > > >               Vec    40             40   62243224     0
> > > > >       Vec Scatter     1              1          0     0
> > > > >
> ========================================================================================================================
> > > > > Average time to get PetscTime(): 4.05312e-07
> Average time
> > > > > for MPI_Barrier(): 7.62939e-07
> > > > > Average time for zero size MPI_Send(): 2.02656e-06
> > > > > OptionTable: -log_summary
> > > > >
> > > > >
> > > > > The PETSc manual states that ratio should be close to 1. There's
> quite a
> > > > > few *(in bold)* which are >1 and MatAssemblyBegin seems to be very
> big. So
> > > > > what could be the cause?
> > > > >
> > > > > I wonder if it has to do the way I insert the matrix. My steps are:
> > > > > (cartesian grids, i loop faster than j, fortran)
> > > > >
> > > > > For matrix A and rhs
> > > > >
> > > > > Insert left extreme cells values belonging to myid
> > > > >
> > > > > if (myid==0) then
> > > > >
> > > > >  insert corner cells values
> > > > >
> > > > >  insert south cells values
> > > > >
> > > > >  insert internal cells values
> > > > >
> > > > > else if (myid==num_procs-1) then
> > > > >
> > > > >  insert corner cells values
> > > > >
> > > > >  insert north cells values
> > > > >
> > > > >  insert internal cells values
> > > > >
> > > > > else
> > > > >
> > > > >  insert internal cells values
> > > > >
> > > > > end if
> > > > >
> > > > > Insert right extreme cells values belonging to myid
> > > > >
> > > > > All these values are entered into a big_A(size_x*size_y,5) matrix.
> int_A
> > > > > stores the position of the values. I then do
> > > > >
> > > > > call MatZeroEntries(A_mat,ierr)
> > > > >
> > > > >  do k=ksta_p+1,kend_p   !for cells belonging to myid
> > > > >
> > > > >      do kk=1,5
> > > > >
> > > > >          II=k-1
> > > > >
> > > > >          JJ=int_A(k,kk)-1
> > > > >
> > > > >          call
> MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
> > > > >                end do
> > > > >
> > > > >  end do
> > > > >
> > > > >  call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
> > > > >
> > > > >  call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
> > > > >
> > > > >
> > > > > I wonder if the problem lies here.I used the big_A matrix because I
> was
> > > > > migrating from an old linear solver. Lastly, I was told to widen my
> window
> > > > > to 120 characters. May I know how do I do it?
> > > > >
> > > > >
> > > > >
> > > > > Thank you very much.
> > > > >
> > > > > Matthew Knepley wrote:
> > > > >
> > > > >
> > > > >
> > > > > > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Hi Matthew,
> > > > > > >
> > > > > > > I think you've misunderstood what I meant. What I'm trying to
> say is
> > > > > > > initially I've got a serial code. I tried to convert to a
> parallel one.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > Then
> > > > >
> > > > >
> > > > > >
> > > > > > > I tested it and it was pretty slow. Due to some work
> requirement, I need
> > > > > > >
> > > > > > >
> > > > > >
> > > > > to
> > > > >
> > > > >
> > > > > >
> > > > > > > go back to make some changes to my code. Since the parallel is
> not
> > > > > > >
> > > > > > >
> > > > > >
> > > > > working
> > > > >
> > > > >
> > > > > >
> > > > > > > well, I updated and changed the serial one.
> > > > > > >
> > > > > > > Well, that was a while ago and now, due to the updates and
> changes, the
> > > > > > > serial code is different from the old converted parallel code.
> Some
> > > > > > >
> > > > > > >
> > > > > >
> > > > > files
> > > > >
> > > > >
> > > > > >
> > > > > > > were also deleted and I can't seem to get it working now. So I
> thought I
> > > > > > > might as well convert the new serial code to parallel. But I'm
> not very
> > > > > > >
> > > > > > >
> > > > > >
> > > > > sure
> > > > >
> > > > >
> > > > > >
> > > > > > > what I should do 1st.
> > > > > > >
> > > > > > > Maybe I should rephrase my question in that if I just convert my
> > > > > > >
> > > > > > >
> > > > > >
> > > > > poisson
> > > > >
> > > > >
> > > > > >
> > > > > > > equation subroutine from a serial PETSc to a parallel PETSc
> version,
> > > > > > >
> > > > > > >
> > > > > >
> > > > > will it
> > > > >
> > > > >
> > > > > >
> > > > > > > work? Should I expect a speedup? The rest of my code is still
> serial.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > You should, of course, only expect speedup in the parallel parts
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Thank you very much.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Matthew Knepley wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > I am not sure why you would ever have two codes. I never do
> this.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > PETSc
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > is designed to write one code to run in serial and parallel.
> The PETSc
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > part
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > should look identical. To test, run the code yo uhave verified
> in
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > serial
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > and
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > output PETSc data structures (like Mat and Vec) using a binary
> viewer.
> > > > > > > > Then run in parallel with the same code, which will output the
> same
> > > > > > > > structures. Take the two files and write a small verification
> code
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > that
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > loads both versions and calls MatEqual and VecEqual.
> > > > > > > >
> > > > > > > > Matt
> > > > > > > >
> > > > > > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com>
> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thank you Matthew. Sorry to trouble you again.
> > > > > > > > >
> > > > > > > > > I tried to run it with -log_summary output and I found that
> there's
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > some
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > errors in the execution. Well, I was busy with other things
> and I
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > just
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > came
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > back to this problem. Some of my files on the server has
> also been
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > deleted.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > It has been a while and I  remember that  it worked before,
> only
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > much
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > slower.
> > > > > > > > >
> > > > > > > > > Anyway, most of the serial code has been updated and maybe
> it's
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > easier
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > to
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > convert the new serial code instead of debugging on the old
> parallel
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > code
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > now. I believe I can still reuse part of the old parallel
> code.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > However,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > I
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > hope I can approach it better this time.
> > > > > > > > >
> > > > > > > > > So supposed I need to start converting my new serial code to
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > parallel.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > There's 2 eqns to be solved using PETSc, the momentum and
> poisson. I
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > also
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > need to parallelize other parts of my code. I wonder which
> route is
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > best:
> > > > > > > > >
> > > > > > > > > 1. Don't change the PETSc part ie continue using
> PETSC_COMM_SELF,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > modify
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > other parts of my code to parallel e.g. looping, updating of
> values
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > etc.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > Once the execution is fine and speedup is reasonable, then
> modify
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > PETSc
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > part - poisson eqn 1st followed by the momentum eqn.
> > > > > > > > >
> > > > > > > > > 2. Reverse the above order ie modify the PETSc part -
> poisson eqn
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > 1st
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > followed by the momentum eqn. Then do other parts of my
> code.
> > > > > > > > >
> > > > > > > > > I'm not sure if the above 2 mtds can work or if there will
> be
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > conflicts. Of
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > course, an alternative will be:
> > > > > > > > >
> > > > > > > > > 3. Do the poisson, momentum eqns and other parts of the code
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > separately.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > That is, code a standalone parallel poisson eqn and use
> samples
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > values
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > to
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > test it. Same for the momentum and other parts of the code.
> When
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > each of
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > them is working, combine them to form the full parallel
> code.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > However,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > this
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > will be much more troublesome.
> > > > > > > > >
> > > > > > > > > I hope someone can give me some recommendations.
> > > > > > > > >
> > > > > > > > > Thank you once again.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Matthew Knepley wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 1) There is no way to have any idea what is going on in
> your code
> > > > > > > > > > without -log_summary output
> > > > > > > > > >
> > > > > > > > > > 2) Looking at that output, look at the percentage taken by
> the
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > solver
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > KSPSolve event. I suspect it is not the biggest component,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > because
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > it is very scalable.
> > > > > > > > > >
> > > > > > > > > > Matt
> > > > > > > > > >
> > > > > > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay
> <zonexo at gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I've a serial 2D CFD code. As my grid size requirement
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > increases,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > the
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > simulation takes longer. Also, memory requirement
> becomes a
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > problem.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > Grid
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > size 've reached 1200x1200. Going higher is not possible
> due to
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > memory
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > problem.
> > > > > > > > > > >
> > > > > > > > > > > I tried to convert my code to a parallel one, following
> the
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > examples
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > given.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > I also need to restructure parts of my code to enable
> parallel
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > looping.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > I
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 1st changed the PETSc solver to be parallel enabled and
> then I
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > restructured
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > parts of my code. I proceed on as longer as the answer
> for a
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > simple
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > test
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > case is correct. I thought it's not really possible to
> do any
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > speed
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > testing
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > since the code is not fully parallelized yet. When I
> finished
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > during
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > most of
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > the conversion, I found that in the actual run that it
> is much
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > slower,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > although the answer is correct.
> > > > > > > > > > >
> > > > > > > > > > > So what is the remedy now? I wonder what I should do to
> check
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > what's
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > wrong.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Must I restart everything again? Btw, my grid size is
> 1200x1200.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > I
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > believed
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > it should be suitable for parallel run of 4 processors?
> Is that
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > so?
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thank you.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From pivello at gmail.com  Tue Apr 15 13:46:49 2008
From: pivello at gmail.com (=?ISO-8859-1?Q?M=E1rcio_Ricardo_Pivello?=)
Date: Tue, 15 Apr 2008 15:46:49 -0300
Subject: PETSc + HYPRE
In-Reply-To: <a9f269830804150736s5a291478y5d2f531789cb7673@mail.gmail.com>
References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
	 <a9f269830804150736s5a291478y5d2f531789cb7673@mail.gmail.com>
Message-ID: <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com>

Hy, Matthew, thanks for your help.

Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on FEM,
with fluid-structure interaction. In this case, I'm simulating the blood
flow inside an aneurysm in an abdominal aorta artery.
By not working I mean the error does not decrease with time. Our team is
just starting using HYPRE, in fact this is the very first case we run with
it.


Again, thanks for your help.


M?rcio Ricardo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/914405a1/attachment.htm>

From knepley at gmail.com  Tue Apr 15 13:51:26 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 13:51:26 -0500
Subject: PETSc + HYPRE
In-Reply-To: <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com>
References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
	 <a9f269830804150736s5a291478y5d2f531789cb7673@mail.gmail.com>
	 <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com>
Message-ID: <a9f269830804151151q50ef47a0i6ba3defaa59e39dc@mail.gmail.com>

On Tue, Apr 15, 2008 at 1:46 PM, M?rcio Ricardo Pivello
<pivello at gmail.com> wrote:
> Hy, Matthew, thanks for your help.
>
> Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on FEM,
> with fluid-structure interaction. In this case, I'm simulating the blood
> flow inside an aneurysm in an abdominal aorta artery.
>  By not working I mean the error does not decrease with time. Our team is

In this case, in addition to my last mail, you want to look at

  -ksp_monitor -ksp_converged_reason

to see what happened in the solver.

  Matt

> just starting using HYPRE, in fact this is the very first case we run with
> it.
>
>
> Again, thanks for your help.
>
>
> M?rcio Ricardo.
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From dalcinl at gmail.com  Tue Apr 15 18:43:22 2008
From: dalcinl at gmail.com (Lisandro Dalcin)
Date: Tue, 15 Apr 2008 20:43:22 -0300
Subject: PETSc + HYPRE
In-Reply-To: <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com>
References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
	 <a9f269830804150736s5a291478y5d2f531789cb7673@mail.gmail.com>
	 <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com>
Message-ID: <e7ba66e40804151643t10202d57i74a14606914533a5@mail.gmail.com>

Sorry for my insistence, but... Did you see my previous mail? The code
you wrote is not OK. You have to first create the KSP, next extract
the PC with KSPGetPC, and then configure the PC to use HYPRE+BoomerAMG

To be sure you are actually being using hypre, add -ksp_view to command line.


On 4/15/08, M?rcio Ricardo Pivello <pivello at gmail.com> wrote:
> Hy, Matthew, thanks for your help.
>
> Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on FEM,
> with fluid-structure interaction. In this case, I'm simulating the blood
> flow inside an aneurysm in an abdominal aorta artery.
>  By not working I mean the error does not decrease with time. Our team is
> just starting using HYPRE, in fact this is the very first case we run with
> it.
>
>
> Again, thanks for your help.
>
>
> M?rcio Ricardo.
>
>
>


-- 
Lisandro Dalc?n
---------------
Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC)
Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC)
Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET)
PTLC - G?emes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594


From rlmackie862 at gmail.com  Tue Apr 15 19:19:14 2008
From: rlmackie862 at gmail.com (Randall Mackie)
Date: Tue, 15 Apr 2008 17:19:14 -0700
Subject: general question on speed using quad core Xeons
Message-ID: <48054602.9040200@gmail.com>

I'm running my PETSc code on a cluster of quad core Xeon's connected
by Infiniband. I hadn't much worried about the performance, because
everything seemed to be working quite well, but today I was actually
comparing performance (wall clock time) for the same problem, but on
different combinations of CPUS.

I find that my PETSc code is quite scalable until I start to use
multiple cores/cpu.

For example, the run time doesn't improve by going from 1 core/cpu
to 4 cores/cpu, and I find this to be very strange, especially since
looking at top or Ganglia, all 4 cpus on each node are running at 100% almost
all of the time. I would have thought if the cpus were going all out,
that I would still be getting much more scalable results.

We are using mvapich-0.9.9 with infiniband. So, I don't know if
this is a cluster/Xeon issue, or something else.

Anybody with experience on this?

Thanks, Randy M.


From knepley at gmail.com  Tue Apr 15 19:34:08 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 19:34:08 -0500
Subject: general question on speed using quad core Xeons
In-Reply-To: <48054602.9040200@gmail.com>
References: <48054602.9040200@gmail.com>
Message-ID: <a9f269830804151734w6303dd8je27b595c46730950@mail.gmail.com>

On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862 at gmail.com> wrote:
> I'm running my PETSc code on a cluster of quad core Xeon's connected
>  by Infiniband. I hadn't much worried about the performance, because
>  everything seemed to be working quite well, but today I was actually
>  comparing performance (wall clock time) for the same problem, but on
>  different combinations of CPUS.
>
>  I find that my PETSc code is quite scalable until I start to use
>  multiple cores/cpu.
>
>  For example, the run time doesn't improve by going from 1 core/cpu
>  to 4 cores/cpu, and I find this to be very strange, especially since
>  looking at top or Ganglia, all 4 cpus on each node are running at 100%
> almost
>  all of the time. I would have thought if the cpus were going all out,
>  that I would still be getting much more scalable results.

Those a really coarse measures. There is absolutely no way that all cores
are going 100%. Its easy to show by hand. Take the peak flop rate and
this gives you the bandwidth needed to sustain that computation (if
everything is perfect, like axpy). You will find that the chip bandwidth
is far below this. A nice analysis is in

  http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf

>  We are using mvapich-0.9.9 with infiniband. So, I don't know if
>  this is a cluster/Xeon issue, or something else.

This is actually mathematics! How satisfying. The only way to improve
this is to change the data structure (e.g. use blocks) or change the
algorithm (e.g. use spectral elements and unassembled structures)

  Matt

>  Anybody with experience on this?
>
>  Thanks, Randy M.
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From rlmackie862 at gmail.com  Tue Apr 15 19:41:09 2008
From: rlmackie862 at gmail.com (Randall Mackie)
Date: Tue, 15 Apr 2008 17:41:09 -0700
Subject: general question on speed using quad core Xeons
In-Reply-To: <a9f269830804151734w6303dd8je27b595c46730950@mail.gmail.com>
References: <48054602.9040200@gmail.com> <a9f269830804151734w6303dd8je27b595c46730950@mail.gmail.com>
Message-ID: <48054B25.5030702@gmail.com>

Then what's the point of having 4 and 8 cores per cpu for parallel
computations then? I mean, I think I've done all I can to make
my code as efficient as possible.

I'm not quite sure I understand your comment about using blocks
or unassembled structures.


Randy


Matthew Knepley wrote:
> On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862 at gmail.com> wrote:
>> I'm running my PETSc code on a cluster of quad core Xeon's connected
>>  by Infiniband. I hadn't much worried about the performance, because
>>  everything seemed to be working quite well, but today I was actually
>>  comparing performance (wall clock time) for the same problem, but on
>>  different combinations of CPUS.
>>
>>  I find that my PETSc code is quite scalable until I start to use
>>  multiple cores/cpu.
>>
>>  For example, the run time doesn't improve by going from 1 core/cpu
>>  to 4 cores/cpu, and I find this to be very strange, especially since
>>  looking at top or Ganglia, all 4 cpus on each node are running at 100%
>> almost
>>  all of the time. I would have thought if the cpus were going all out,
>>  that I would still be getting much more scalable results.
> 
> Those a really coarse measures. There is absolutely no way that all cores
> are going 100%. Its easy to show by hand. Take the peak flop rate and
> this gives you the bandwidth needed to sustain that computation (if
> everything is perfect, like axpy). You will find that the chip bandwidth
> is far below this. A nice analysis is in
> 
>   http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf
> 
>>  We are using mvapich-0.9.9 with infiniband. So, I don't know if
>>  this is a cluster/Xeon issue, or something else.
> 
> This is actually mathematics! How satisfying. The only way to improve
> this is to change the data structure (e.g. use blocks) or change the
> algorithm (e.g. use spectral elements and unassembled structures)
> 
>   Matt
> 
>>  Anybody with experience on this?
>>
>>  Thanks, Randy M.
>>
>>
> 
> 
> 


From knepley at gmail.com  Tue Apr 15 19:46:17 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 19:46:17 -0500
Subject: general question on speed using quad core Xeons
In-Reply-To: <48054B25.5030702@gmail.com>
References: <48054602.9040200@gmail.com>
	 <a9f269830804151734w6303dd8je27b595c46730950@mail.gmail.com>
	 <48054B25.5030702@gmail.com>
Message-ID: <a9f269830804151746x4e066d7bw7c770f039d970018@mail.gmail.com>

On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie <rlmackie862 at gmail.com> wrote:
> Then what's the point of having 4 and 8 cores per cpu for parallel
>  computations then? I mean, I think I've done all I can to make
>  my code as efficient as possible.

I really advise reading the paper. It explicitly treats the case of
blocking, and uses
a simple model to demonstrate all the points I made.

With a single, scalar sparse matrix, there is definitely no point at
all of having
multiple cores. However, this will speed up things like finite element
integration.
So, for instance, making this integration dominate your cost (like
spectral element
codes do) will show nice speedup. Ulrich Ruede has a great talk about this on
his website.

  Matt

>  I'm not quite sure I understand your comment about using blocks
>  or unassembled structures.
>
>
>  Randy
>
>
>
>
>  Matthew Knepley wrote:
>
> > On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862 at gmail.com>
> wrote:
> >
> > > I'm running my PETSc code on a cluster of quad core Xeon's connected
> > >  by Infiniband. I hadn't much worried about the performance, because
> > >  everything seemed to be working quite well, but today I was actually
> > >  comparing performance (wall clock time) for the same problem, but on
> > >  different combinations of CPUS.
> > >
> > >  I find that my PETSc code is quite scalable until I start to use
> > >  multiple cores/cpu.
> > >
> > >  For example, the run time doesn't improve by going from 1 core/cpu
> > >  to 4 cores/cpu, and I find this to be very strange, especially since
> > >  looking at top or Ganglia, all 4 cpus on each node are running at 100%
> > > almost
> > >  all of the time. I would have thought if the cpus were going all out,
> > >  that I would still be getting much more scalable results.
> > >
> >
> > Those a really coarse measures. There is absolutely no way that all cores
> > are going 100%. Its easy to show by hand. Take the peak flop rate and
> > this gives you the bandwidth needed to sustain that computation (if
> > everything is perfect, like axpy). You will find that the chip bandwidth
> > is far below this. A nice analysis is in
> >
> >  http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf
> >
> >
> > >  We are using mvapich-0.9.9 with infiniband. So, I don't know if
> > >  this is a cluster/Xeon issue, or something else.
> > >
> >
> > This is actually mathematics! How satisfying. The only way to improve
> > this is to change the data structure (e.g. use blocks) or change the
> > algorithm (e.g. use spectral elements and unassembled structures)
> >
> >  Matt
> >
> >
> > >  Anybody with experience on this?
> > >
> > >  Thanks, Randy M.
> > >
> > >
> > >
> >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From zonexo at gmail.com  Tue Apr 15 19:52:19 2008
From: zonexo at gmail.com (Ben Tay)
Date: Wed, 16 Apr 2008 08:52:19 +0800
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <480336BE.3070507@gmail.com>	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>	 <48035F88.2080003@gmail.com>	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>	 <4804CAC0.6060201@gmail.com>	 <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>	 <4804D044.2060502@gmail.com>	 <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>	 <4804DB61.3080906@gmail.com> <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
Message-ID: <48054DC3.8080005@gmail.com>

Hi,

I was initially using LU and Hypre to solve my serial code. I switched 
to the default GMRES when I converted the parallel code. I've now redo 
the test using KSPBCGS and also Hypre  BommerAMG. Seems like 
MatAssemblyBegin, VecAYPX, VecScatterEnd (in bold) are the problems. 
What should I be checking? Here's the results for 1 and 2 processor  for 
each solver. Thank you so much!

*1 processor KSPBCGS *

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance 
Summary: ----------------------------------------------

./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 
Wed Apr 16 08:32:21 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

                         Max       Max/Min        Avg      Total
Time (sec):           8.176e+01      1.00000   8.176e+01
Objects:              2.700e+01      1.00000   2.700e+01
Flops:                1.893e+10      1.00000   1.893e+10  1.893e+10
Flops/sec:            2.315e+08      1.00000   2.315e+08  2.315e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       3.743e+03      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N 
--> 2N flops
                            and VecAXPY() for complex vectors of length 
N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages 
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 8.1756e+01 100.0%  1.8925e+10 100.0%  0.000e+00   
0.0%  0.000e+00        0.0%  3.743e+03 100.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this 
phase
      %M - percent messages in this phase     %L - percent message 
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)
------------------------------------------------------------------------------------------------------------------------

     
      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #   macros. To get timing results we always recommend    #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      ##########################################################


Event                Count      Time (sec)     
Flops/sec                         --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             1498 1.0 1.6548e+01 1.0 3.55e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 20 31  0  0  0  20 31  0  0  0   355
MatSolve            1500 1.0 3.2228e+01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 39 31  0  0  0  39 31  0  0  0   183
MatLUFactorNum         2 1.0 2.0642e-01 1.0 1.02e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   102
MatILUFactorSym        2 1.0 2.0250e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 1.7963e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            2 1.0 3.8147e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         2 1.0 2.6301e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
4.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 1.0190e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetup               2 1.0 2.8230e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 6.7238e+01 1.0 2.81e+08 1.0 0.0e+00 0.0e+00 
3.7e+03 82100  0  0100  82100  0  0100   281
PCSetUp                2 1.0 4.3527e-01 1.0 4.85e+07 1.0 0.0e+00 0.0e+00 
6.0e+00  1  0  0  0  0   1  0  0  0  0    48
PCApply             1500 1.0 3.2232e+01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 39 31  0  0  0  39 31  0  0  0   183
VecDot              2984 1.0 5.3279e+00 1.0 4.84e+08 1.0 0.0e+00 0.0e+00 
3.0e+03  7 14  0  0 80   7 14  0  0 80   484
VecNorm              754 1.0 1.1453e+00 1.0 5.74e+08 1.0 0.0e+00 0.0e+00 
7.5e+02  1  3  0  0 20   1  3  0  0 20   574
VecCopy                2 1.0 3.2830e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 3 1.0 3.9389e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY             2244 1.0 4.8304e+00 1.0 4.02e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  6 10  0  0  0   6 10  0  0  0   402
VecAYPX              752 1.0 1.5623e+00 1.0 4.19e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  2  3  0  0  0   2  3  0  0  0   419
VecWAXPY            1492 1.0 5.0827e+00 1.0 2.54e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  6  7  0  0  0   6  7  0  0  0   254
VecAssemblyBegin       2 1.0 2.6703e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 5.2452e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

              Matrix     4              4  300369852     0
       Krylov Solver     2              2          8     0
      Preconditioner     2              2        336     0
           Index Set     6              6   15554064     0
                 Vec    13             13   44937496     0
========================================================================================================================
Average time to get PetscTime(): 3.09944e-07
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Jan  8 22:22:08 2008
Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 
--sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 
--sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 
--sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 
--with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 
--with-batch=1 --with-mpi-shared=0 
--with-mpi-include=/usr/local/topspin/mpi/mpich/include 
--with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a 
--with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun 
--with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
-----------------------------------------
Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01

*2 processors KSPBCGS


*
---------------------------------------------- PETSc Performance 
Summary: ----------------------------------------------

./a.out on a atlas3-mp named atlas3-c25 with 2 processors, by g0306332 
Wed Apr 16 08:37:25 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

                         Max       Max/Min        Avg      Total
Time (sec):           3.795e+02      1.00000   3.795e+02
Objects:              3.800e+01      1.00000   3.800e+01
Flops:                8.592e+09      1.00000   8.592e+09  1.718e+10
Flops/sec:            2.264e+07      1.00000   2.264e+07  4.528e+07
MPI Messages:         1.335e+03      1.00000   1.335e+03  2.670e+03
MPI Message Lengths:  6.406e+06      1.00000   4.798e+03  1.281e+07
MPI Reductions:       1.678e+03      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N 
--> 2N flops
                            and VecAXPY() for complex vectors of length 
N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages 
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 3.7950e+02 100.0%  1.7185e+10 100.0%  2.670e+03 
100.0%  4.798e+03      100.0%  3.357e+03 100.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this 
phase
      %M - percent messages in this phase     %L - percent message 
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #   macros. To get timing results we always recommend    #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      ##########################################################

 
Event                Count      Time (sec)     
Flops/sec                         --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             1340 1.0 7.4356e+01 1.6 5.87e+07 1.6 2.7e+03 4.8e+03 
0.0e+00 16 31100100  0  16 31100100  0    72
MatSolve            1342 1.0 4.3794e+01 1.2 7.08e+07 1.2 0.0e+00 0.0e+00 
0.0e+00 11 31  0  0  0  11 31  0  0  0   123
MatLUFactorNum         2 1.0 2.5116e-01 1.0 7.68e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   153
MatILUFactorSym        2 1.0 2.3831e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
*MatAssemblyBegin       2 1.0 7.9380e-0116482.3 0.00e+00 0.0 0.0e+00 
0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0*
MatAssemblyEnd         2 1.0 2.4782e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 
7.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            2 1.0 5.0068e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         2 1.0 1.8508e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
4.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 8.6530e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetup               3 1.0 1.9901e-01 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 3.3575e+02 1.0 2.56e+07 1.0 2.7e+03 4.8e+03 
3.3e+03 88100100100100  88100100100100    51
PCSetUp                3 1.0 5.0751e-01 1.0 3.79e+07 1.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0  0   0  0  0  0  0    76
PCSetUpOnBlocks        1 1.0 4.4248e-02 1.0 4.39e+07 1.0 0.0e+00 0.0e+00 
3.0e+00  0  0  0  0  0   0  0  0  0  0    88
PCApply             1342 1.0 4.9832e+01 1.2 6.56e+07 1.2 0.0e+00 0.0e+00 
0.0e+00 12 31  0  0  0  12 31  0  0  0   108
VecDot              2668 1.0 2.0710e+02 1.2 6.70e+06 1.2 0.0e+00 0.0e+00 
2.7e+03 50 13  0  0 79  50 13  0  0 79    11
VecNorm              675 1.0 2.9565e+01 3.3 3.33e+07 3.3 0.0e+00 0.0e+00 
6.7e+02  5  3  0  0 20   5  3  0  0 20    20
VecCopy                2 1.0 2.4400e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet              1338 1.0 5.9052e+00 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAXPY             2007 1.0 2.2173e+01 2.6 1.03e+08 2.6 0.0e+00 0.0e+00 
0.0e+00  4 10  0  0  0   4 10  0  0  0    79
*VecAYPX              673 1.0 2.8062e+00 4.0 4.29e+08 4.0 0.0e+00 
0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0   213*
VecWAXPY            1334 1.0 4.8052e+00 2.4 2.84e+08 2.4 0.0e+00 0.0e+00 
0.0e+00  1  7  0  0  0   1  7  0  0  0   240
VecAssemblyBegin       2 1.0 1.4091e-04 3.1 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 5.0068e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
*VecScatterBegin     1334 1.0 1.1666e-01 5.9 0.00e+00 0.0 2.7e+03 
4.8e+03 0.0e+00  0  0100100  0   0  0100100  0     0*
VecScatterEnd       1334 1.0 5.2569e+01 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 10  0  0  0  0  10  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------
                      
Memory usage is given in bytes:
  
Object Type          Creations   Destructions   Memory  Descendants' Mem.
  
--- Event Stage 0: Main Stage
     
              Matrix     6              6  283964900     0
       Krylov Solver     3              3          8     0
      Preconditioner     3              3        424     0
           Index Set     8              8   12965152     0
                 Vec    17             17   34577080     0
         Vec Scatter     1              1          0     0
========================================================================================================================
Average time to get PetscTime(): 8.10623e-07                  
Average time for MPI_Barrier(): 5.72205e-07                   
Average time for zero size MPI_Send(): 1.90735e-06            
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Jan  8 22:22:08 2008
@
@

*1 processor Hypre

* 
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance 
Summary: ----------------------------------------------

./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 
Wed Apr 16 08:45:38 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

                         Max       Max/Min        Avg      Total
Time (sec):           2.059e+01      1.00000   2.059e+01
Objects:              3.400e+01      1.00000   3.400e+01
Flops:                3.151e+08      1.00000   3.151e+08  3.151e+08
Flops/sec:            1.530e+07      1.00000   1.530e+07  1.530e+07
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       2.400e+01      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N 
--> 2N flops
                            and VecAXPY() for complex vectors of length 
N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages 
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 2.0590e+01 100.0%  3.1512e+08 100.0%  0.000e+00   
0.0%  0.000e+00        0.0%  2.400e+01 100.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this 
phase
      %M - percent messages in this phase     %L - percent message 
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)
------------------------------------------------------------------------------------------------------------------------

     
      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #   macros. To get timing results we always recommend    #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      ##########################################################


Event                Count      Time (sec)     
Flops/sec                         --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult               12 1.0 2.6237e-01 1.0 4.24e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  1 35  0  0  0   1 35  0  0  0   424
MatSolve               7 1.0 4.5932e-01 1.0 2.23e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  2 33  0  0  0   2 33  0  0  0   223
MatLUFactorNum         1 1.0 1.2635e-01 1.0 1.36e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  1  5  0  0  0   1  5  0  0  0   136
MatILUFactorSym        1 1.0 1.3007e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  1  0  0  0  4   1  0  0  0  4     0
MatConvert             1 1.0 4.1277e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatAssemblyBegin       2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 1.3946e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
MatGetRow         432000 1.0 8.4685e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            2 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 1.6376e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  8   0  0  0  0  8     0
MatZeroEntries         2 1.0 8.2422e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPGMRESOrthog         6 1.0 1.0955e-01 1.0 3.31e+08 1.0 0.0e+00 0.0e+00 
6.0e+00  1 12  0  0 25   1 12  0  0 25   331
KSPSetup               2 1.0 2.5418e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 5.9363e+00 1.0 5.31e+07 1.0 0.0e+00 0.0e+00 
1.8e+01 29100  0  0 75  29100  0  0 75    53
PCSetUp                2 1.0 1.5691e+00 1.0 1.10e+07 1.0 0.0e+00 0.0e+00 
5.0e+00  8  5  0  0 21   8  5  0  0 21    11
PCApply               14 1.0 3.7548e+00 1.0 2.73e+07 1.0 0.0e+00 0.0e+00 
0.0e+00 18 33  0  0  0  18 33  0  0  0    27
VecMDot                6 1.0 7.7139e-02 1.0 2.35e+08 1.0 0.0e+00 0.0e+00 
6.0e+00  0  6  0  0 25   0  6  0  0 25   235
VecNorm               14 1.0 9.9192e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
7.0e+00  0  6  0  0 29   0  6  0  0 29   183
VecScale               7 1.0 5.4052e-03 1.0 5.59e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0   559
VecCopy                1 1.0 2.0301e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 9 1.0 1.1883e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY                7 1.0 2.8702e-02 1.0 3.91e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  4  0  0  0   0  4  0  0  0   391
VecAYPX                6 1.0 2.8528e-02 1.0 3.63e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  3  0  0  0   0  3  0  0  0   363
VecMAXPY               7 1.0 4.1699e-02 1.0 5.59e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  7  0  0  0   0  7  0  0  0   559
VecAssemblyBegin       2 1.0 2.3842e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0 25   0  0  0  0 25     0
VecAssemblyEnd         2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize           7 1.0 1.3958e-02 1.0 6.50e+08 1.0 0.0e+00 0.0e+00 
7.0e+00  0  3  0  0 29   0  3  0  0 29   650
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

              Matrix     3              3  267569524     0
       Krylov Solver     2              2      17224     0
      Preconditioner     2              2        440     0
           Index Set     3              3   10369032     0
                 Vec    24             24   82961752     0
========================================================================================================================
Average time to get PetscTime(): 1.90735e-07
OptionTable: -log_summary
Compiled without FORTRAN kernels
 

*2 processors Hypre*
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance 
Summary: ----------------------------------------------

./a.out on a atlas3-mp named atlas3-c48 with 2 processors, by g0306332 
Wed Apr 16 08:46:56 2008  
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

                         Max       Max/Min        Avg      Total
Time (sec):           9.614e+01      1.02903   9.478e+01
Objects:              4.100e+01      1.00000   4.100e+01
Flops:                2.778e+08      1.00000   2.778e+08  5.555e+08
Flops/sec:            2.973e+06      1.02903   2.931e+06  5.862e+06
MPI Messages:         7.000e+00      1.00000   7.000e+00  1.400e+01
MPI Message Lengths:  3.120e+04      1.00000   4.457e+03  6.240e+04
MPI Reductions:       1.650e+01      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N 
--> 2N flops
                            and VecAXPY() for complex vectors of length 
N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages 
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 9.4784e+01 100.0%  5.5553e+08 100.0%  1.400e+01 
100.0%  4.457e+03      100.0%  3.300e+01 100.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this 
phase
      %M - percent messages in this phase     %L - percent message 
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #   macros. To get timing results we always recommend    #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      ##########################################################


Event                Count      Time (sec)     
Flops/sec                         --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s

--- Event Stage 0: Main Stage

MatMult               12 1.0 4.5412e-01 2.0 4.34e+08 2.0 1.2e+01 4.8e+03 
0.0e+00  0 36 86 92  0   0 36 86 92  0   438
MatSolve               7 1.0 5.0386e-01 1.1 2.28e+08 1.1 0.0e+00 0.0e+00 
0.0e+00  1 37  0  0  0   1 37  0  0  0   407
MatLUFactorNum         1 1.0 9.5120e-01 1.6 2.98e+07 1.6 0.0e+00 0.0e+00 
0.0e+00  1  6  0  0  0   1  6  0  0  0    36
MatILUFactorSym        1 1.0 1.1285e+01 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  9  0  0  0  3   9  0  0  0  3     0
MatConvert             1 1.0 6.2023e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
*MatAssemblyBegin       2 1.0 3.1003e+01246.4 0.00e+00 0.0 0.0e+00 
0.0e+00 2.0e+00 16  0  0  0  6  16  0  0  0  6     0*
MatAssemblyEnd         2 1.0 2.2413e+00 1.9 0.00e+00 0.0 2.0e+00 2.4e+03 
7.0e+00  2  0 14  8 21   2  0 14  8 21     0
MatGetRow         216000 1.0 9.2643e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            3 1.0 5.9605e-06 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 2.4464e-01 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  6   0  0  0  0  6     0
MatZeroEntries         2 1.0 6.1072e+00 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  5  0  0  0  0   5  0  0  0  0     0
KSPGMRESOrthog         6 1.0 4.4529e-02 1.3 5.26e+08 1.3 0.0e+00 0.0e+00 
6.0e+00  0  7  0  0 18   0  7  0  0 18   815
KSPSetup               2 1.0 1.8315e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
KSPSolve               2 1.0 3.0572e+01 1.1 9.64e+06 1.1 1.2e+01 4.8e+03 
1.8e+01 31100 86 92 55  31100 86 92 55    18
PCSetUp                2 1.0 2.0424e+01 1.3 1.07e+06 1.3 0.0e+00 0.0e+00 
5.0e+00 19  6  0  0 15  19  6  0  0 15     2
PCApply               14 1.0 2.9443e+00 1.0 3.56e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  3 37  0  0  0   3 37  0  0  0    70
VecMDot                6 1.0 2.7561e-02 1.6 5.15e+08 1.6 0.0e+00 0.0e+00 
6.0e+00  0  3  0  0 18   0  3  0  0 18   658
*VecNorm               14 1.0 1.4223e+00 5.1 5.45e+07 5.1 0.0e+00 
0.0e+00 7.0e+00  1  5  0  0 21   1  5  0  0 21    21*
VecScale               7 1.0 1.8604e-02 1.0 8.25e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0   163
VecCopy                1 1.0 3.0069e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 9 1.0 3.2693e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY                7 1.0 3.0581e-02 1.1 3.98e+08 1.1 0.0e+00 0.0e+00 
0.0e+00  0  4  0  0  0   0  4  0  0  0   706
*VecAYPX                6 1.0 4.4344e+00147.6 3.45e+08147.6 0.0e+00 
0.0e+00 0.0e+00  2  4  0  0  0   2  4  0  0  0     5*
VecMAXPY               7 1.0 2.1892e-02 1.0 5.34e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  4  0  0  0   0  4  0  0  0  1066
VecAssemblyBegin       2 1.0 9.2602e-0412.5 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0 18   0  0  0  0 18     0
VecAssemblyEnd         2 1.0 7.8678e-06 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin        6 1.0 9.3222e-05 1.1 0.00e+00 0.0 1.2e+01 4.8e+03 
0.0e+00  0  0 86 92  0   0  0 86 92  0     0
*VecScatterEnd          6 1.0 1.9959e-011404.6 0.00e+00 0.0 0.0e+00 
0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0*
VecNormalize           7 1.0 2.3088e-02 1.0 1.98e+08 1.0 0.0e+00 0.0e+00 
7.0e+00  0  2  0  0 21   0  2  0  0 21   393
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

              Matrix     5              5  267571932     0
       Krylov Solver     2              2      17224     0
      Preconditioner     2              2        440     0
           Index Set     5              5   10372120     0
                 Vec    26             26   53592184     0
         Vec Scatter     1              1          0     0
========================================================================================================================
Average time to get PetscTime(): 2.14577e-07
Average time for MPI_Barrier(): 8.10623e-07
Average time for zero size MPI_Send(): 1.43051e-06
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Jan  8 22:22:08 2008
 

Matthew Knepley wrote:
> The convergence here is jsut horrendous. Have you tried using LU to check
> your implementation? All the time is in the solve right now. I would first
> try a direct method (at least on a small problem) and then try to understand
> the convergence behavior. MUMPS can actually scale very well for big problems.
>
>   Matt
>
>   
>>>>>           
>>>>         
>>>
>>>       
>>     
>
>
>
>   


From rlmackie862 at gmail.com  Tue Apr 15 21:03:15 2008
From: rlmackie862 at gmail.com (Randall Mackie)
Date: Tue, 15 Apr 2008 19:03:15 -0700
Subject: general question on speed using quad core Xeons
In-Reply-To: <a9f269830804151746x4e066d7bw7c770f039d970018@mail.gmail.com>
References: <48054602.9040200@gmail.com>	 <a9f269830804151734w6303dd8je27b595c46730950@mail.gmail.com>	 <48054B25.5030702@gmail.com> <a9f269830804151746x4e066d7bw7c770f039d970018@mail.gmail.com>
Message-ID: <48055E63.5070606@gmail.com>

Okay, but if I'm stuck with a big 3D finite difference code, written in PETSc
using Distributed Arrays, with 3 dof per node, then you're saying there is
really nothing I can do, except using blocking, to improve things on quad
core cpus? They talk about blocking using BAIJ format, and so is this the
same thing as creating MPIBAIJ matrices in PETSc? And is creating MPIBAIJ
matrices in PETSc going to make a substantial difference in the speed?

I'm sorry if I'm being dense, I'm just trying to understand if there is some
simple way I can utilize those extra cores on each cpu easily, and since
I'm not a computer scientist, some of these concepts are difficult.

Thanks, Randy

Matthew Knepley wrote:
> On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie <rlmackie862 at gmail.com> wrote:
>> Then what's the point of having 4 and 8 cores per cpu for parallel
>>  computations then? I mean, I think I've done all I can to make
>>  my code as efficient as possible.
> 
> I really advise reading the paper. It explicitly treats the case of
> blocking, and uses
> a simple model to demonstrate all the points I made.
> 
> With a single, scalar sparse matrix, there is definitely no point at
> all of having
> multiple cores. However, this will speed up things like finite element
> integration.
> So, for instance, making this integration dominate your cost (like
> spectral element
> codes do) will show nice speedup. Ulrich Ruede has a great talk about this on
> his website.
> 
>   Matt
> 
>>  I'm not quite sure I understand your comment about using blocks
>>  or unassembled structures.
>>
>>
>>  Randy
>>
>>
>>
>>
>>  Matthew Knepley wrote:
>>
>>> On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862 at gmail.com>
>> wrote:
>>>> I'm running my PETSc code on a cluster of quad core Xeon's connected
>>>>  by Infiniband. I hadn't much worried about the performance, because
>>>>  everything seemed to be working quite well, but today I was actually
>>>>  comparing performance (wall clock time) for the same problem, but on
>>>>  different combinations of CPUS.
>>>>
>>>>  I find that my PETSc code is quite scalable until I start to use
>>>>  multiple cores/cpu.
>>>>
>>>>  For example, the run time doesn't improve by going from 1 core/cpu
>>>>  to 4 cores/cpu, and I find this to be very strange, especially since
>>>>  looking at top or Ganglia, all 4 cpus on each node are running at 100%
>>>> almost
>>>>  all of the time. I would have thought if the cpus were going all out,
>>>>  that I would still be getting much more scalable results.
>>>>
>>> Those a really coarse measures. There is absolutely no way that all cores
>>> are going 100%. Its easy to show by hand. Take the peak flop rate and
>>> this gives you the bandwidth needed to sustain that computation (if
>>> everything is perfect, like axpy). You will find that the chip bandwidth
>>> is far below this. A nice analysis is in
>>>
>>>  http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf
>>>
>>>
>>>>  We are using mvapich-0.9.9 with infiniband. So, I don't know if
>>>>  this is a cluster/Xeon issue, or something else.
>>>>
>>> This is actually mathematics! How satisfying. The only way to improve
>>> this is to change the data structure (e.g. use blocks) or change the
>>> algorithm (e.g. use spectral elements and unassembled structures)
>>>
>>>  Matt
>>>
>>>
>>>>  Anybody with experience on this?
>>>>
>>>>  Thanks, Randy M.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
> 


From zonexo at gmail.com  Tue Apr 15 21:08:45 2008
From: zonexo at gmail.com (Ben Tay)
Date: Wed, 16 Apr 2008 10:08:45 +0800
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <480336BE.3070507@gmail.com>	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>	 <48035F88.2080003@gmail.com>	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>	 <4804CAC0.6060201@gmail.com>	 <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>	 <4804D044.2060502@gmail.com>	 <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>	 <4804DB61.3080906@gmail.com> <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
Message-ID: <48055FAD.3000105@gmail.com>

An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080416/0f3ee54b/attachment.htm>

From knepley at gmail.com  Tue Apr 15 21:20:02 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 21:20:02 -0500
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <48055FAD.3000105@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
	 <48035F88.2080003@gmail.com>
	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>
	 <4804CAC0.6060201@gmail.com>
	 <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
	 <4804D044.2060502@gmail.com>
	 <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>
	 <4804DB61.3080906@gmail.com>
	 <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
	 <48055FAD.3000105@gmail.com>
Message-ID: <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>

On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay <zonexo at gmail.com> wrote:
>
>  Hi,
>
>  I just tested the ex2f.F example, changing m and n to 600. Here's the
> result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin,
> MatGetOrdering and KSPSetup have ratios >>1. The time taken seems to be
> faster as the processor increases, although speedup is not 1:1. I thought
> that this example should scale well, shouldn't it? Is there something wrong
> with my installation then?

1) Notice that the events that are unbalanced take 0.01% of the time.
Not important.

2) The speedup really stinks. Even though this is a small problem. Are
you sure that
     you are actually running on two processors with separate memory
pipes and not
     on 1 dual core?

    Matt

>  Thank you.
>
>  1 processor:
>
>  Norm of error 0.3371E+01 iterations  1153
>
> ************************************************************************************************************************
>  ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed
> Apr 16 10:03:12 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                           Max       Max/Min        Avg      Total
>  Time (sec):           1.222e+02      1.00000   1.222e+02
>  Objects:              4.400e+01      1.00000   4.400e+01
>  Flops:                3.547e+10      1.00000   3.547e+10  3.547e+10
>  Flops/sec:            2.903e+08      1.00000   2.903e+08  2.903e+08
>  MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
>  MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
>  MPI Reductions:       2.349e+03      1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                              e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                              and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                          Avg     %Total     Avg     %Total   counts   %Total
> Avg         %Total   counts   %Total
>   0:      Main Stage: 1.2216e+02 100.0%  3.5466e+10 100.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  2.349e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>     Count: number of times phase was executed
>     Time and Flops/sec: Max - maximum over all processors
>                         Ratio - ratio of maximum to minimum over all
> processors
>     Mess: number of messages sent
>     Avg. len: average message length
>     Reduct: number of global reductions
>     Global: entire computation
>     Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>        %T - percent time in this phase         %F - percent flops in this
> phase
>        %M - percent messages in this phase     %L - percent message lengths
> in this phase
>        %R - percent reductions in this phase
>     Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>        ##########################################################
>        #                                                        #
>        #                          WARNING!!!                    #
>        #                                                        #
>        #   This code was run without the PreLoadBegin()         #
>        #   macros. To get timing results we always recommend    #
>        #   preloading. otherwise timing numbers may be          #
>        #   meaningless.                                         #
>        ##########################################################
>
>  Event                Count      Time (sec)     Flops/sec
> --- Global ---  --- Stage ---   Total
>                     Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult             1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 13 11  0  0  0  13 11  0  0  0   239
>  MatSolve            1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11  0  0  0  25 11  0  0  0   124
>  MatLUFactorNum         1 1.0 3.6166e-02 1.0 8.94e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0    89
>  MatILUFactorSym        1 1.0 1.9690e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyBegin       1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyEnd         1 1.0 2.6258e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetOrdering         1 1.0 5.4259e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecMDot             1153 1.0 3.2664e+01 1.0 3.92e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 27 36  0  0 49  27 36  0  0 49   392
>  VecNorm             1193 1.0 2.0344e+00 1.0 4.22e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03  2  2  0  0 51   2  2  0  0 51   422
>  VecScale            1192 1.0 6.9107e-01 1.0 6.21e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  1  1  0  0  0   1  1  0  0  0   621
>  VecCopy               39 1.0 3.4571e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet                41 1.0 1.1397e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAXPY               78 1.0 6.9354e-01 1.0 8.10e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0    81
>  VecMAXPY            1192 1.0 3.7492e+01 1.0 3.63e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 31 38  0  0  0  31 38  0  0  0   363
>  VecNormalize        1192 1.0 2.7284e+00 1.0 4.72e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03  2  4  0  0 51   2  4  0  0 51   472
>  KSPGMRESOrthog      1153 1.0 6.7939e+01 1.0 3.76e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 56 72  0  0 49  56 72  0  0 49   376
>  KSPSetup               1 1.0 1.1651e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00
> 2.3e+03100100  0  0100 100100  0  0100   292
>  PCSetUp                1 1.0 2.3852e-01 1.0 1.36e+07 1.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    14
>  PCApply             1192 1.0 3.1021e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11  0  0  0  25 11  0  0  0   124
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type          Creations   Destructions   Memory  Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>                Matrix     2              2   54691212     0
>             Index Set     3              3    4321032     0
>                   Vec    37             37  103708408     0
>         Krylov Solver     1              1      17216     0
>        Preconditioner     1              1        168     0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan  8 22:22:08 2008
>  Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8
> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8
> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4
> --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun
> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
>  -----------------------------------------
>  Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
>  Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12
> 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>  Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
>  Using PETSc arch: atlas3-mpi
>  -----------------------------------------
>  85.53user 1.22system 2:02.65elapsed 70%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (16major+46429minor)pagefaults 0swaps
>  Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
>
>  2 processors:
>
>  Norm of error 0.3231E+01 iterations  1177
>
> ************************************************************************************************************************
>  ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c58 with 2 processors, by g0306332 Wed
> Apr 16 09:48:37 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                           Max       Max/Min        Avg      Total
>  Time (sec):           1.034e+02      1.00000   1.034e+02
>  Objects:              5.500e+01      1.00000   5.500e+01
>  Flops:                1.812e+10      1.00000   1.812e+10  3.625e+10
>  Flops/sec:            1.752e+08      1.00000   1.752e+08  3.504e+08
>  MPI Messages:         1.218e+03      1.00000   1.218e+03  2.436e+03
>  MPI Message Lengths:  5.844e+06      1.00000   4.798e+03  1.169e+07
>  MPI Reductions:       1.204e+03      1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                              e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                              and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                          Avg     %Total     Avg     %Total   counts   %Total
> Avg         %Total   counts   %Total
>   0:      Main Stage: 1.0344e+02 100.0%  3.6250e+10 100.0%  2.436e+03 100.0%
> 4.798e+03      100.0%  2.407e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>     Count: number of times phase was executed
>     Time and Flops/sec: Max - maximum over all processors
>                         Ratio - ratio of maximum to minimum over all
> processors
>     Mess: number of messages sent
>     Avg. len: average message length
>     Reduct: number of global reductions
>     Global: entire computation
>     Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>        %T - percent time in this phase         %F - percent flops in this
> phase
>        %M - percent messages in this phase     %L - percent message lengths
> in this phase
>        %R - percent reductions in this phase
>     Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>        ##########################################################
>        #                                                        #
>        #                          WARNING!!!                    #
>        #                                                        #
>        #   This code was run without the PreLoadBegin()         #
>        #   macros. To get timing results we always recommend    #
>        #   preloading. otherwise timing numbers may be          #
>        #   meaningless.                                         #
>        ##########################################################
>
>  Event                Count      Time (sec)     Flops/sec
> --- Global ---  --- Stage ---   Total
>                     Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult             1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03
> 0.0e+00 11 11100100  0  11 11100100  0   315
>  MatSolve            1217 1.0 2.1088e+01 1.2 1.10e+08 1.2 0.0e+00 0.0e+00
> 0.0e+00 19 11  0  0  0  19 11  0  0  0   187
>  MatLUFactorNum         1 1.0 8.2862e-02 2.9 5.58e+07 2.9 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0    39
>  MatILUFactorSym        1 1.0 3.3310e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyBegin       1 1.0 1.5567e-011854.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyEnd         1 1.0 1.0352e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetRowIJ            1 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetOrdering         1 1.0 5.0953e-0210.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecMDot             1177 1.0 4.0427e+01 1.1 1.85e+08 1.1 0.0e+00 0.0e+00
> 1.2e+03 37 36  0  0 49  37 36  0  0 49   323
>  VecNorm             1218 1.0 1.5475e+01 1.9 5.25e+07 1.9 0.0e+00 0.0e+00
> 1.2e+03 12  2  0  0 51  12  2  0  0 51    57
>  VecScale            1217 1.0 5.7866e-01 1.0 3.97e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  1  1  0  0  0   1  1  0  0  0   757
>  VecCopy               40 1.0 6.6697e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet              1259 1.0 1.5276e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  VecAXPY               80 1.0 2.1163e-01 2.4 3.21e+08 2.4 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   272
>  VecMAXPY            1217 1.0 2.2980e+01 1.4 4.28e+08 1.4 0.0e+00 0.0e+00
> 0.0e+00 19 38  0  0  0  19 38  0  0  0   606
>  VecScatterBegin     1217 1.0 3.6620e-02 1.4 0.00e+00 0.0 2.4e+03 4.8e+03
> 0.0e+00  0  0100100  0   0  0100100  0     0
>  VecScatterEnd       1217 1.0 8.1980e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  VecNormalize        1217 1.0 1.6030e+01 1.8 7.36e+07 1.8 0.0e+00 0.0e+00
> 1.2e+03 12  4  0  0 51  12  4  0  0 51    82
>  KSPGMRESOrthog      1177 1.0 5.7248e+01 1.0 2.35e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 55 72  0  0 49  55 72  0  0 49   457
>  KSPSetup               2 1.0 1.0363e-0110.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03
> 2.4e+03 99100100100100  99100100100100   352
>  PCSetUp                2 1.0 1.5685e-01 2.3 2.40e+07 2.3 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    21
>  PCSetUpOnBlocks        1 1.0 1.5668e-01 2.3 2.41e+07 2.3 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    21
>  PCApply             1217 1.0 2.2625e+01 1.2 1.02e+08 1.2 0.0e+00 0.0e+00
> 0.0e+00 20 11  0  0  0  20 11  0  0  0   174
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type          Creations   Destructions   Memory  Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>                Matrix     4              4   34540820     0
>             Index Set     5              5    2164120     0
>                   Vec    41             41   53315992     0
>           Vec Scatter     1              1          0     0
>         Krylov Solver     2              2      17216     0
>        Preconditioner     2              2        256     0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  Average time for MPI_Barrier(): 8.10623e-07
>  Average time for zero size MPI_Send(): 2.98023e-06
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan  8 22:22:08 2008
>
>  42.64user 0.28system 1:08.08elapsed 63%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (18major+28609minor)pagefaults 0swaps
>  1:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
>  0inputs+0outputs (18major+23666minor)pagefaults 0swaps
>
>
>  4 processors:
>
>  Norm of error 0.3090E+01 iterations   937
>  63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (16major+13520minor)pagefaults 0swaps
>  53.13user 0.06system 1:04.31elapsed 82%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (15major+13414minor)pagefaults 0swaps
>  58.55user 0.23system 1:04.31elapsed 91%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (17major+18383minor)pagefaults 0swaps
>  20.36user 0.67system 1:04.33elapsed 32%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (14major+18392minor)pagefaults 0swaps
>  Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
>
>
>
> ************************************************************************************************************************
>  ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c45 with 4 processors, by g0306332 Wed
> Apr 16 09:55:16 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                           Max       Max/Min        Avg      Total
>  Time (sec):           6.374e+01      1.00001   6.374e+01
>  Objects:              5.500e+01      1.00000   5.500e+01
>  Flops:                7.209e+09      1.00016   7.208e+09  2.883e+10
>  Flops/sec:            1.131e+08      1.00017   1.131e+08  4.524e+08
>  MPI Messages:         1.940e+03      2.00000   1.455e+03  5.820e+03
>  MPI Message Lengths:  9.307e+06      2.00000   4.798e+03  2.792e+07
>  MPI Reductions:       4.798e+02      1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                              e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                              and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                          Avg     %Total     Avg     %Total   counts   %Total
> Avg         %Total   counts   %Total
>   0:      Main Stage: 6.3737e+01 100.0%  2.8832e+10 100.0%  5.820e+03 100.0%
> 4.798e+03      100.0%  1.919e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>     Count: number of times phase was executed
>     Time and Flops/sec: Max - maximum over all processors
>                         Ratio - ratio of maximum to minimum over all
> processors
>     Mess: number of messages sent
>     Avg. len: average message length
>     Reduct: number of global reductions
>     Global: entire computation
>     Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>        %T - percent time in this phase         %F - percent flops in this
> phase
>        %M - percent messages in this phase     %L - percent message lengths
> in this phase
>        %R - percent reductions in this phase
>     Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>        ##########################################################
>        #                                                        #
>        #                          WARNING!!!                    #
>        #                                                        #
>        #   This code was run without the PreLoadBegin()         #
>        #   macros. To get timing results we always recommend    #
>        #   preloading. otherwise timing numbers may be          #
>        #   meaningless.                                         #
>        ##########################################################
>
>
>  Event                Count      Time (sec)     Flops/sec
> --- Global ---  --- Stage ---   Total
>                     Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult              969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03
> 0.0e+00  8 11100100  0   8 11100100  0   321
>  MatSolve             969 1.0 1.4244e+01 3.3 1.79e+08 3.3 0.0e+00 0.0e+00
> 0.0e+00 11 11  0  0  0  11 11  0  0  0   220
>  MatLUFactorNum         1 1.0 5.2070e-02 6.2 9.63e+07 6.2 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0    62
>  MatILUFactorSym        1 1.0 1.7911e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyBegin       1 1.0 2.1741e-01164.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyEnd         1 1.0 3.5663e-02 1.0 0.00e+00 0.0 6.0e+00 2.4e+03
> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetRowIJ            1 1.0 2.1458e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetOrdering         1 1.0 1.2779e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecMDot              937 1.0 3.5634e+01 2.1 1.52e+08 2.1 0.0e+00 0.0e+00
> 9.4e+02 48 36  0  0 49  48 36  0  0 49   292
>  VecNorm              970 1.0 1.4387e+01 2.9 3.55e+07 2.9 0.0e+00 0.0e+00
> 9.7e+02 18  2  0  0 51  18  2  0  0 51    49
>  VecScale             969 1.0 1.5714e-01 2.1 1.14e+09 2.1 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   0  1  0  0  0  2220
>  VecCopy               32 1.0 1.8988e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet              1003 1.0 1.1690e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  VecAXPY               64 1.0 2.1091e-02 1.1 6.07e+08 1.1 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2185
>  VecMAXPY             969 1.0 1.4823e+01 3.4 6.26e+08 3.4 0.0e+00 0.0e+00
> 0.0e+00 11 38  0  0  0  11 38  0  0  0   747
>  VecScatterBegin      969 1.0 2.3238e-02 2.1 0.00e+00 0.0 5.8e+03 4.8e+03
> 0.0e+00  0  0100100  0   0  0100100  0     0
>  VecScatterEnd        969 1.0 1.4613e+0083.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  VecNormalize         969 1.0 1.4468e+01 2.8 5.15e+07 2.8 0.0e+00 0.0e+00
> 9.7e+02 18  4  0  0 50  18  4  0  0 50    72
>  KSPGMRESOrthog       937 1.0 3.9924e+01 1.3 1.68e+08 1.3 0.0e+00 0.0e+00
> 9.4e+02 59 72  0  0 49  59 72  0  0 49   521
>  KSPSetup               2 1.0 2.6190e-02 8.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03
> 1.9e+03 98100100100 99  98100100100 99   461
>  PCSetUp                2 1.0 7.1320e-02 4.1 4.59e+07 4.1 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    45
>  PCSetUpOnBlocks        1 1.0 7.1230e-02 4.1 4.62e+07 4.1 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    45
>  PCApply              969 1.0 1.5379e+01 3.3 1.66e+08 3.3 0.0e+00 0.0e+00
> 0.0e+00 12 11  0  0  0  12 11  0  0  0   203
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type          Creations   Destructions   Memory  Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>                Matrix     4              4   17264420     0
>             Index Set     5              5    1084120     0
>                   Vec    41             41   26675992     0
>           Vec Scatter     1              1          0     0
>         Krylov Solver     2              2      17216     0
>        Preconditioner     2              2        256     0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  Average time for MPI_Barrier(): 6.00815e-06
>  Average time for zero size MPI_Send(): 5.42402e-05
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan  8 22:22:08 2008
>
>
>
>  Matthew Knepley wrote:
>  The convergence here is jsut horrendous. Have you tried using LU to check
> your implementation? All the time is in the solve right now. I would first
> try a direct method (at least on a small problem) and then try to understand
> the convergence behavior. MUMPS can actually scale very well for big
> problems.
>
>  Matt
>
>
>
>
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From knepley at gmail.com  Tue Apr 15 21:34:33 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 21:34:33 -0500
Subject: general question on speed using quad core Xeons
In-Reply-To: <48055E63.5070606@gmail.com>
References: <48054602.9040200@gmail.com>
	 <a9f269830804151734w6303dd8je27b595c46730950@mail.gmail.com>
	 <48054B25.5030702@gmail.com>
	 <a9f269830804151746x4e066d7bw7c770f039d970018@mail.gmail.com>
	 <48055E63.5070606@gmail.com>
Message-ID: <a9f269830804151934o1f403373g9ce6aab6d3f9dbf4@mail.gmail.com>

On Tue, Apr 15, 2008 at 9:03 PM, Randall Mackie <rlmackie862 at gmail.com> wrote:
> Okay, but if I'm stuck with a big 3D finite difference code, written in
> PETSc
>  using Distributed Arrays, with 3 dof per node, then you're saying there is
>  really nothing I can do, except using blocking, to improve things on quad
>  core cpus? They talk about blocking using BAIJ format, and so is this the

Yes, just about.

>  same thing as creating MPIBAIJ matrices in PETSc? And is creating MPIBAIJ

Yes.

>  matrices in PETSc going to make a substantial difference in the speed?

That is the hope. You can just give MPIBAIJ as the argument to DAGetMatrix().

>  I'm sorry if I'm being dense, I'm just trying to understand if there is some
>  simple way I can utilize those extra cores on each cpu easily, and since
>  I'm not a computer scientist, some of these concepts are difficult.

I really believe extra cores are currently a con for scientific computing. There
are real mathematical barriers to their effective use.

  Matt

>  Thanks, Randy
>  Matthew Knepley wrote:
>
> > On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie <rlmackie862 at gmail.com>
> wrote:
> >
> > > Then what's the point of having 4 and 8 cores per cpu for parallel
> > >  computations then? I mean, I think I've done all I can to make
> > >  my code as efficient as possible.
> > >
> >
> > I really advise reading the paper. It explicitly treats the case of
> > blocking, and uses
> > a simple model to demonstrate all the points I made.
> >
> > With a single, scalar sparse matrix, there is definitely no point at
> > all of having
> > multiple cores. However, this will speed up things like finite element
> > integration.
> > So, for instance, making this integration dominate your cost (like
> > spectral element
> > codes do) will show nice speedup. Ulrich Ruede has a great talk about this
> on
> > his website.
> >
> >  Matt
> >
> >
> > >  I'm not quite sure I understand your comment about using blocks
> > >  or unassembled structures.
> > >
> > >
> > >  Randy
> > >
> > >
> > >
> > >
> > >  Matthew Knepley wrote:
> > >
> > >
> > > > On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie
> <rlmackie862 at gmail.com>
> > > >
> > > wrote:
> > >
> > > >
> > > > > I'm running my PETSc code on a cluster of quad core Xeon's connected
> > > > >  by Infiniband. I hadn't much worried about the performance, because
> > > > >  everything seemed to be working quite well, but today I was
> actually
> > > > >  comparing performance (wall clock time) for the same problem, but
> on
> > > > >  different combinations of CPUS.
> > > > >
> > > > >  I find that my PETSc code is quite scalable until I start to use
> > > > >  multiple cores/cpu.
> > > > >
> > > > >  For example, the run time doesn't improve by going from 1 core/cpu
> > > > >  to 4 cores/cpu, and I find this to be very strange, especially
> since
> > > > >  looking at top or Ganglia, all 4 cpus on each node are running at
> 100%
> > > > > almost
> > > > >  all of the time. I would have thought if the cpus were going all
> out,
> > > > >  that I would still be getting much more scalable results.
> > > > >
> > > > >
> > > > Those a really coarse measures. There is absolutely no way that all
> cores
> > > > are going 100%. Its easy to show by hand. Take the peak flop rate and
> > > > this gives you the bandwidth needed to sustain that computation (if
> > > > everything is perfect, like axpy). You will find that the chip
> bandwidth
> > > > is far below this. A nice analysis is in
> > > >
> > > >  http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf
> > > >
> > > >
> > > >
> > > > >  We are using mvapich-0.9.9 with infiniband. So, I don't know if
> > > > >  this is a cluster/Xeon issue, or something else.
> > > > >
> > > > >
> > > > This is actually mathematics! How satisfying. The only way to improve
> > > > this is to change the data structure (e.g. use blocks) or change the
> > > > algorithm (e.g. use spectral elements and unassembled structures)
> > > >
> > > >  Matt
> > > >
> > > >
> > > >
> > > > >  Anybody with experience on this?
> > > > >
> > > > >  Thanks, Randy M.
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From zonexo at gmail.com  Tue Apr 15 22:01:28 2008
From: zonexo at gmail.com (Ben Tay)
Date: Wed, 16 Apr 2008 11:01:28 +0800
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <48035F88.2080003@gmail.com>	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>	 <4804CAC0.6060201@gmail.com>	 <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>	 <4804D044.2060502@gmail.com>	 <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>	 <4804DB61.3080906@gmail.com>	 <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>	 <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>
Message-ID: <48056C08.6030903@gmail.com>

An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080416/25e6879b/attachment.htm>

From knepley at gmail.com  Tue Apr 15 22:08:02 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 15 Apr 2008 22:08:02 -0500
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <48056C08.6030903@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
	 <4804CAC0.6060201@gmail.com>
	 <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
	 <4804D044.2060502@gmail.com>
	 <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>
	 <4804DB61.3080906@gmail.com>
	 <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
	 <48055FAD.3000105@gmail.com>
	 <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>
	 <48056C08.6030903@gmail.com>
Message-ID: <a9f269830804152008k543daccn6e5ed6baa5829590@mail.gmail.com>

On Tue, Apr 15, 2008 at 10:01 PM, Ben Tay <zonexo at gmail.com> wrote:
>
>  Hi Matthew,
>
>  You mention that the unbalanced events take 0.01% of the time and speedup
> is terrible. Where did you get this information? Are you referring to Global

1) Look at the time of the events you point out (1.0e-2s) and the
total time or time for KSPSolve(1.0e2)

2) Look at the time for KSPSolve on 1 and 2 procs

> %T? As for the speedup, do you look at the time reported by the "time"
> command ie 63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata
> 0maxresident)?
>
>  I think you may be right. My school uses :
>
> The Supercomputing & Visualisation Unit, Computer Centre is pleased to
> announce the addition of a new cluster of Linux-based compute servers,
> consisting of a total of 64 servers (60 dual-core and 4 quad-core systems).
> Each of the compute nodes in the cluster is equipped with the following
> configurations:
>
>    No of Nodes Processors Qty per node Total cores per node Memory per node
>    4 Quad-Core Intel Xeon X5355 2 8 16 GB
>    60 Dual-Core Intel Xeon 5160 2 4 8 GB
>  When I run on 2 processors, it states I'm running on 2*atlas3-c45. So does
> it mean I running on shared memory bandwidth? So does it mean if I run on 4
> processors, is it equivalent to using 2 memory pipes?
>
>  I also got a reply from my school's engineer:
>
>  For queue mcore_parallel, LSF will assign the compute nodes automatically.
> To most of applications, running with 2*atlas3-c45 and 2*atlas3-c50 may be
> faster. However, it is not sure if 2*atlas3-c45 means to run the job within
> one CPU on dual core, or with two CPUs on two separate cores. This is not
> controllable.
>
>  So what can I do on my side to ensure speedup? I hope I do not have to
> switch from PETSc to other solvers.

Switching solvers will do you no good at all. The easiest thing to do
is get these
guys to improve the scheduler. Every half decent scheduler can assure that you
get separate processors. There is no excuse for forcing you into dual cores.

   Matt

>  Thanks lot!
>
>
>
>  Matthew Knepley wrote:
>  On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay <zonexo at gmail.com> wrote:
>
>
>  Hi,
>
>  I just tested the ex2f.F example, changing m and n to 600. Here's the
> result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin,
> MatGetOrdering and KSPSetup have ratios >>1. The time taken seems to be
> faster as the processor increases, although speedup is not 1:1. I thought
> that this example should scale well, shouldn't it? Is there something wrong
> with my installation then?
>
>  1) Notice that the events that are unbalanced take 0.01% of the time.
> Not important.
>
> 2) The speedup really stinks. Even though this is a small problem. Are
> you sure that
>  you are actually running on two processors with separate memory
> pipes and not
>  on 1 dual core?
>
>  Matt
>
>
>
>  Thank you.
>
>  1 processor:
>
>  Norm of error 0.3371E+01 iterations 1153
>
> ************************************************************************************************************************
>  *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed
> Apr 16 10:03:12 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>  Max Max/Min Avg Total
>  Time (sec): 1.222e+02 1.00000 1.222e+02
>  Objects: 4.400e+01 1.00000 4.400e+01
>  Flops: 3.547e+10 1.00000 3.547e+10 3.547e+10
>  Flops/sec: 2.903e+08 1.00000 2.903e+08 2.903e+08
>  MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
>  MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
>  MPI Reductions: 2.349e+03 1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>  e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>  and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
>  Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
>  0: Main Stage: 1.2216e+02 100.0% 3.5466e+10 100.0% 0.000e+00 0.0%
> 0.000e+00 0.0% 2.349e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops/sec: Max - maximum over all processors
>  Ratio - ratio of maximum to minimum over all
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>  %T - percent time in this phase %F - percent flops in this
> phase
>  %M - percent messages in this phase %L - percent message lengths
> in this phase
>  %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>  ##########################################################
>  # #
>  # WARNING!!! #
>  # #
>  # This code was run without the PreLoadBegin() #
>  # macros. To get timing results we always recommend #
>  # preloading. otherwise timing numbers may be #
>  # meaningless. #
>  ##########################################################
>
>  Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
>  Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 13 11 0 0 0 13 11 0 0 0 239
>  MatSolve 1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11 0 0 0 25 11 0 0 0 124
>  MatLUFactorNum 1 1.0 3.6166e-02 1.0 8.94e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 89
>  MatILUFactorSym 1 1.0 1.9690e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyBegin 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyEnd 1 1.0 2.6258e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetOrdering 1 1.0 5.4259e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecMDot 1153 1.0 3.2664e+01 1.0 3.92e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 27 36 0 0 49 27 36 0 0 49 392
>  VecNorm 1193 1.0 2.0344e+00 1.0 4.22e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 2 2 0 0 51 2 2 0 0 51 422
>  VecScale 1192 1.0 6.9107e-01 1.0 6.21e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 1 0 0 0 1 1 0 0 0 621
>  VecCopy 39 1.0 3.4571e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecSet 41 1.0 1.1397e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecAXPY 78 1.0 6.9354e-01 1.0 8.10e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 81
>  VecMAXPY 1192 1.0 3.7492e+01 1.0 3.63e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 31 38 0 0 0 31 38 0 0 0 363
>  VecNormalize 1192 1.0 2.7284e+00 1.0 4.72e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 2 4 0 0 51 2 4 0 0 51 472
>  KSPGMRESOrthog 1153 1.0 6.7939e+01 1.0 3.76e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 56 72 0 0 49 56 72 0 0 49 376
>  KSPSetup 1 1.0 1.1651e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  KSPSolve 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00
> 2.3e+03100100 0 0100 100100 0 0100 292
>  PCSetUp 1 1.0 2.3852e-01 1.0 1.36e+07 1.0 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 14
>  PCApply 1192 1.0 3.1021e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11 0 0 0 25 11 0 0 0 124
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type Creations Destructions Memory Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>  Matrix 2 2 54691212 0
>  Index Set 3 3 4321032 0
>  Vec 37 37 103708408 0
>  Krylov Solver 1 1 17216 0
>  Preconditioner 1 1 168 0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan 8 22:22:08 2008
>  Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8
> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8
> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4
> --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun
> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
>  -----------------------------------------
>  Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01
>  Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12
> 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>  Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
>  Using PETSc arch: atlas3-mpi
>  -----------------------------------------
>  85.53user 1.22system 2:02.65elapsed 70%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (16major+46429minor)pagefaults 0swaps
>  Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
>
>  2 processors:
>
>  Norm of error 0.3231E+01 iterations 1177
>
> ************************************************************************************************************************
>  *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c58 with 2 processors, by g0306332 Wed
> Apr 16 09:48:37 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>  Max Max/Min Avg Total
>  Time (sec): 1.034e+02 1.00000 1.034e+02
>  Objects: 5.500e+01 1.00000 5.500e+01
>  Flops: 1.812e+10 1.00000 1.812e+10 3.625e+10
>  Flops/sec: 1.752e+08 1.00000 1.752e+08 3.504e+08
>  MPI Messages: 1.218e+03 1.00000 1.218e+03 2.436e+03
>  MPI Message Lengths: 5.844e+06 1.00000 4.798e+03 1.169e+07
>  MPI Reductions: 1.204e+03 1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>  e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>  and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
>  Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
>  0: Main Stage: 1.0344e+02 100.0% 3.6250e+10 100.0% 2.436e+03 100.0%
> 4.798e+03 100.0% 2.407e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops/sec: Max - maximum over all processors
>  Ratio - ratio of maximum to minimum over all
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>  %T - percent time in this phase %F - percent flops in this
> phase
>  %M - percent messages in this phase %L - percent message lengths
> in this phase
>  %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>  ##########################################################
>  # #
>  # WARNING!!! #
>  # #
>  # This code was run without the PreLoadBegin() #
>  # macros. To get timing results we always recommend #
>  # preloading. otherwise timing numbers may be #
>  # meaningless. #
>  ##########################################################
>
>  Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
>  Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03
> 0.0e+00 11 11100100 0 11 11100100 0 315
>  MatSolve 1217 1.0 2.1088e+01 1.2 1.10e+08 1.2 0.0e+00 0.0e+00
> 0.0e+00 19 11 0 0 0 19 11 0 0 0 187
>  MatLUFactorNum 1 1.0 8.2862e-02 2.9 5.58e+07 2.9 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 39
>  MatILUFactorSym 1 1.0 3.3310e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyBegin 1 1.0 1.5567e-011854.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyEnd 1 1.0 1.0352e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetRowIJ 1 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetOrdering 1 1.0 5.0953e-0210.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecMDot 1177 1.0 4.0427e+01 1.1 1.85e+08 1.1 0.0e+00 0.0e+00
> 1.2e+03 37 36 0 0 49 37 36 0 0 49 323
>  VecNorm 1218 1.0 1.5475e+01 1.9 5.25e+07 1.9 0.0e+00 0.0e+00
> 1.2e+03 12 2 0 0 51 12 2 0 0 51 57
>  VecScale 1217 1.0 5.7866e-01 1.0 3.97e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 1 0 0 0 1 1 0 0 0 757
>  VecCopy 40 1.0 6.6697e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecSet 1259 1.0 1.5276e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>  VecAXPY 80 1.0 2.1163e-01 2.4 3.21e+08 2.4 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 272
>  VecMAXPY 1217 1.0 2.2980e+01 1.4 4.28e+08 1.4 0.0e+00 0.0e+00
> 0.0e+00 19 38 0 0 0 19 38 0 0 0 606
>  VecScatterBegin 1217 1.0 3.6620e-02 1.4 0.00e+00 0.0 2.4e+03 4.8e+03
> 0.0e+00 0 0100100 0 0 0100100 0 0
>  VecScatterEnd 1217 1.0 8.1980e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>  VecNormalize 1217 1.0 1.6030e+01 1.8 7.36e+07 1.8 0.0e+00 0.0e+00
> 1.2e+03 12 4 0 0 51 12 4 0 0 51 82
>  KSPGMRESOrthog 1177 1.0 5.7248e+01 1.0 2.35e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 55 72 0 0 49 55 72 0 0 49 457
>  KSPSetup 2 1.0 1.0363e-0110.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  KSPSolve 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03
> 2.4e+03 99100100100100 99100100100100 352
>  PCSetUp 2 1.0 1.5685e-01 2.3 2.40e+07 2.3 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 21
>  PCSetUpOnBlocks 1 1.0 1.5668e-01 2.3 2.41e+07 2.3 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 21
>  PCApply 1217 1.0 2.2625e+01 1.2 1.02e+08 1.2 0.0e+00 0.0e+00
> 0.0e+00 20 11 0 0 0 20 11 0 0 0 174
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type Creations Destructions Memory Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>  Matrix 4 4 34540820 0
>  Index Set 5 5 2164120 0
>  Vec 41 41 53315992 0
>  Vec Scatter 1 1 0 0
>  Krylov Solver 2 2 17216 0
>  Preconditioner 2 2 256 0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  Average time for MPI_Barrier(): 8.10623e-07
>  Average time for zero size MPI_Send(): 2.98023e-06
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan 8 22:22:08 2008
>
>  42.64user 0.28system 1:08.08elapsed 63%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (18major+28609minor)pagefaults 0swaps
>  1:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
>  0inputs+0outputs (18major+23666minor)pagefaults 0swaps
>
>
>  4 processors:
>
>  Norm of error 0.3090E+01 iterations 937
>  63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (16major+13520minor)pagefaults 0swaps
>  53.13user 0.06system 1:04.31elapsed 82%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (15major+13414minor)pagefaults 0swaps
>  58.55user 0.23system 1:04.31elapsed 91%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (17major+18383minor)pagefaults 0swaps
>  20.36user 0.67system 1:04.33elapsed 32%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (14major+18392minor)pagefaults 0swaps
>  Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
>
>
>
> ************************************************************************************************************************
>  *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c45 with 4 processors, by g0306332 Wed
> Apr 16 09:55:16 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>  Max Max/Min Avg Total
>  Time (sec): 6.374e+01 1.00001 6.374e+01
>  Objects: 5.500e+01 1.00000 5.500e+01
>  Flops: 7.209e+09 1.00016 7.208e+09 2.883e+10
>  Flops/sec: 1.131e+08 1.00017 1.131e+08 4.524e+08
>  MPI Messages: 1.940e+03 2.00000 1.455e+03 5.820e+03
>  MPI Message Lengths: 9.307e+06 2.00000 4.798e+03 2.792e+07
>  MPI Reductions: 4.798e+02 1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>  e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>  and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
>  Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
>  0: Main Stage: 6.3737e+01 100.0% 2.8832e+10 100.0% 5.820e+03 100.0%
> 4.798e+03 100.0% 1.919e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops/sec: Max - maximum over all processors
>  Ratio - ratio of maximum to minimum over all
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>  %T - percent time in this phase %F - percent flops in this
> phase
>  %M - percent messages in this phase %L - percent message lengths
> in this phase
>  %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>  ##########################################################
>  # #
>  # WARNING!!! #
>  # #
>  # This code was run without the PreLoadBegin() #
>  # macros. To get timing results we always recommend #
>  # preloading. otherwise timing numbers may be #
>  # meaningless. #
>  ##########################################################
>
>
>  Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
>  Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03
> 0.0e+00 8 11100100 0 8 11100100 0 321
>  MatSolve 969 1.0 1.4244e+01 3.3 1.79e+08 3.3 0.0e+00 0.0e+00
> 0.0e+00 11 11 0 0 0 11 11 0 0 0 220
>  MatLUFactorNum 1 1.0 5.2070e-02 6.2 9.63e+07 6.2 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 62
>  MatILUFactorSym 1 1.0 1.7911e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyBegin 1 1.0 2.1741e-01164.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyEnd 1 1.0 3.5663e-02 1.0 0.00e+00 0.0 6.0e+00 2.4e+03
> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetRowIJ 1 1.0 2.1458e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetOrdering 1 1.0 1.2779e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecMDot 937 1.0 3.5634e+01 2.1 1.52e+08 2.1 0.0e+00 0.0e+00
> 9.4e+02 48 36 0 0 49 48 36 0 0 49 292
>  VecNorm 970 1.0 1.4387e+01 2.9 3.55e+07 2.9 0.0e+00 0.0e+00
> 9.7e+02 18 2 0 0 51 18 2 0 0 51 49
>  VecScale 969 1.0 1.5714e-01 2.1 1.14e+09 2.1 0.0e+00 0.0e+00
> 0.0e+00 0 1 0 0 0 0 1 0 0 0 2220
>  VecCopy 32 1.0 1.8988e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecSet 1003 1.0 1.1690e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>  VecAXPY 64 1.0 2.1091e-02 1.1 6.07e+08 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2185
>  VecMAXPY 969 1.0 1.4823e+01 3.4 6.26e+08 3.4 0.0e+00 0.0e+00
> 0.0e+00 11 38 0 0 0 11 38 0 0 0 747
>  VecScatterBegin 969 1.0 2.3238e-02 2.1 0.00e+00 0.0 5.8e+03 4.8e+03
> 0.0e+00 0 0100100 0 0 0100100 0 0
>  VecScatterEnd 969 1.0 1.4613e+0083.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>  VecNormalize 969 1.0 1.4468e+01 2.8 5.15e+07 2.8 0.0e+00 0.0e+00
> 9.7e+02 18 4 0 0 50 18 4 0 0 50 72
>  KSPGMRESOrthog 937 1.0 3.9924e+01 1.3 1.68e+08 1.3 0.0e+00 0.0e+00
> 9.4e+02 59 72 0 0 49 59 72 0 0 49 521
>  KSPSetup 2 1.0 2.6190e-02 8.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  KSPSolve 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03
> 1.9e+03 98100100100 99 98100100100 99 461
>  PCSetUp 2 1.0 7.1320e-02 4.1 4.59e+07 4.1 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 45
>  PCSetUpOnBlocks 1 1.0 7.1230e-02 4.1 4.62e+07 4.1 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 45
>  PCApply 969 1.0 1.5379e+01 3.3 1.66e+08 3.3 0.0e+00 0.0e+00
> 0.0e+00 12 11 0 0 0 12 11 0 0 0 203
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type Creations Destructions Memory Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>  Matrix 4 4 17264420 0
>  Index Set 5 5 1084120 0
>  Vec 41 41 26675992 0
>  Vec Scatter 1 1 0 0
>  Krylov Solver 2 2 17216 0
>  Preconditioner 2 2 256 0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  Average time for MPI_Barrier(): 6.00815e-06
>  Average time for zero size MPI_Send(): 5.42402e-05
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan 8 22:22:08 2008
>
>
>
>  Matthew Knepley wrote:
>  The convergence here is jsut horrendous. Have you tried using LU to check
> your implementation? All the time is in the solve right now. I would first
> try a direct method (at least on a small problem) and then try to understand
> the convergence behavior. MUMPS can actually scale very well for big
> problems.
>
>  Matt
>
>
>
>
>
>
>
>
>
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From balay at mcs.anl.gov  Tue Apr 15 22:45:25 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Tue, 15 Apr 2008 22:45:25 -0500 (CDT)
Subject: Slow speed after changing from serial to parallel (with
 ex2f.F)
In-Reply-To: <48056C08.6030903@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>
 <48056C08.6030903@gmail.com>
Message-ID: <alpine.LFD.1.10.0804152214270.17603@asterix>

On Wed, 16 Apr 2008, Ben Tay wrote:

> I think you may be right. My school uses :

> ? No of Nodes Processors Qty per node Total cores per node Memory per node ?
> ? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ?
> ? 60 Dual-Core Intel Xeon 5160 2 4 8 GB


I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355
machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I get the following:

<< Logs for my run are attached >>

asterix:/home/balay/download-pine>grep MatMult *
ex2f-600-1p.log:MatMult             1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397
ex2f-600-2p.log:MatMult             1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   632
ex2f-600-4p.log:MatMult              969 1.0 4.3311e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 15 11100100  0  15 11100100  0   724
ex2f-600-8p.log:MatMult             1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 1.8e+04 4.8e+03 0.0e+00 16 11100100  0  16 11100100  0   749
asterix:/home/balay/download-pine>grep KSPSolve *
ex2f-600-1p.log:KSPSolve               1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   513
ex2f-600-2p.log:KSPSolve               1 1.0 4.4005e+01 1.0 1.81e+10 1.0 2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100   824
ex2f-600-4p.log:KSPSolve               1 1.0 2.8139e+01 1.0 7.21e+09 1.0 5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99  1024
ex2f-600-8p.log:KSPSolve               1 1.0 3.6260e+01 1.0 4.90e+09 1.0 1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100  1081
asterix:/home/balay/download-pine>


You get the following [with intel compilers?]:

asterix:/home/balay/download-pine/x>grep MatMult *
log.1:MatMult???????????? 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11? 0? 0? 0? 13 11? 0? 0? 0?? 239
log.2:MatMult???????????? 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 0.0e+00 11 11100100? 0? 11 11100100? 0?? 315
log.4:MatMult????????????? 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 0.0e+00? 8 11100100? 0?? 8 11100100? 0?? 321
asterix:/home/balay/download-pine/x>grep KSPSolve *
log.1:KSPSolve?????????????? 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 2.3e+03100100? 0? 0100 100100? 0? 0100?? 292
log.2:KSPSolve?????????????? 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 2.4e+03 99100100100100? 99100100100100?? 352
log.4:KSPSolve?????????????? 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 1.9e+03 98100100100 99? 98100100100 99?? 461
asterix:/home/balay/download-pine/x>

What exact CPU was this run on?

A couple of comments:
- my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 3.6 for 2,4 proc runs [so higher
  load imbalance on your machine]
- The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 for you
- Speedups I see for MatMult are:

np   me   you

2   1.59   1.32
4   1.82   1.34
8   1.88

--------------------------

The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores.

As Matt indicated perhaps in "Subject: general question on speed using quad core Xeons" thread,
for sparse linear algebra - the performance is limited by memory bandwidth - not CPU

So one have to look at the hardware memory architecture of the machine
if you expect scalability.

The 2x quad-core has a memory architecture that gives 11GB/s if one
CPU-socket is used, but 22GB/s when both CPUs-sockets are used
[irrespective of the number of cores in each CPU socket]. One
inference is - max of 2 speedup can be obtained from such machine [due
to 2 memory bank architecture].

So if you have 2 such machines [i.e 4 memory banks] - then you can
expect a theoretical max speedup of 4.

We are generally used to evaluating performance/cpu [or core]. Here
the scalability numbers suck.

However if you do performance/number-of-memory-banks - then things look better.

Its just that we are used to always expecting scalability per node and
assume it translates to scalability per core. [however the scalability
per node - was more about scalability per memory bank - before
multicore cpus took over]


There is also another measure - performance/dollar spent. Generally
the extra cores are practically free - so here this measure also holds
up ok.

Satish
-------------- next part --------------
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./ex2f on a linux-tes named intel-loaner1 with 1 processor, by balay Tue Apr 15 22:02:38 2008
Using Petsc Development Version 2.3.3, Patch 12, unknown HG revision: unknown

                         Max       Max/Min        Avg      Total 
Time (sec):           6.936e+01      1.00000   6.936e+01
Objects:              4.400e+01      1.00000   4.400e+01
Flops:                3.547e+10      1.00000   3.547e+10  3.547e+10
Flops/sec:            5.113e+08      1.00000   5.113e+08  5.113e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       2.349e+03      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 6.9359e+01 100.0%  3.5466e+10 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.349e+03 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397
MatSolve            1192 1.0 1.8658e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 11  0  0  0  27 11  0  0  0   207
MatLUFactorNum         1 1.0 4.1455e-02 1.0 3.23e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    78
MatILUFactorSym        1 1.0 2.9251e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 3.1618e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 5.1751e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecMDot             1153 1.0 1.6326e+01 1.0 1.28e+10 1.0 0.0e+00 0.0e+00 1.2e+03 24 36  0  0 49  24 36  0  0 49   783
VecNorm             1193 1.0 5.0365e+00 1.0 8.59e+08 1.0 0.0e+00 0.0e+00 1.2e+03  7  2  0  0 51   7  2  0  0 51   171
VecScale            1192 1.0 5.4950e-01 1.0 4.29e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0   781
VecCopy               39 1.0 6.6555e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                41 1.0 3.4185e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               78 1.0 1.2492e-01 1.0 5.62e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   450
VecMAXPY            1192 1.0 1.8493e+01 1.0 1.36e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27 38  0  0  0  27 38  0  0  0   736
VecNormalize        1192 1.0 5.5843e+00 1.0 1.29e+09 1.0 0.0e+00 0.0e+00 1.2e+03  8  4  0  0 51   8  4  0  0 51   231
KSPGMRESOrthog      1153 1.0 3.3669e+01 1.0 2.56e+10 1.0 0.0e+00 0.0e+00 1.2e+03 49 72  0  0 49  49 72  0  0 49   760
KSPSetup               1 1.0 1.1875e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   513
PCSetUp                1 1.0 7.5919e-02 1.0 3.23e+06 1.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0    43
PCApply             1192 1.0 1.8661e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 11  0  0  0  27 11  0  0  0   207
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

              Matrix     2              2   54695580     0
                 Vec    37             37  106606176     0
       Krylov Solver     1              1      18016     0
      Preconditioner     1              1        720     0
           Index Set     3              3    4321464     0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
OptionTable: -log_summary ex2f-600-1p.log
OptionTable: -m 600
OptionTable: -n 600
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8
Configure run at: Tue Apr 15 21:39:17 2008
Configure options: --with-mpi-dir=/home/balay/mpich2-svn --with-debugging=0 --download-f-blas-lapack=1 PETSC_ARCH=linux-test --with-shared=0
-----------------------------------------
Libraries compiled on Tue Apr 15 21:45:29 CDT 2008 on intel-loaner1 
Machine characteristics: Linux intel-loaner1 2.6.20-16-generic #2 SMP Tue Feb 12 02:11:24 UTC 2008 x86_64 GNU/Linux 
Using PETSc directory: /home/balay/petsc-dev
Using PETSc arch: linux-test
-----------------------------------------
Using C compiler: /home/balay/mpich2-svn/bin/mpicc -fPIC -O   
Using Fortran compiler: /home/balay/mpich2-svn/bin/mpif90 -I. -fPIC -O    
-----------------------------------------
Using include paths: -I/home/balay/petsc-dev -I/home/balay/petsc-dev/linux-test/include -I/home/balay/petsc-dev/include -I/home/balay/mpich2-svn/include -I. -I/home/balay/mpich2-svn/src/include -I/home/balay/mpich2-svn/src/binding/f90      
------------------------------------------
Using C linker: /home/balay/mpich2-svn/bin/mpicc -fPIC -O 
Using Fortran linker: /home/balay/mpich2-svn/bin/mpif90 -I. -fPIC -O  
Using libraries: -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc           -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lflapack -Wl,-rpath,/home/balay/petsc-dev/linux-test/lib -L/home/balay/petsc-dev/linux-test/lib -lfblas -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -L/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -L/usr/lib/../lib64 -ldl -lgcc_s -lgfortranbegin -lgfortran -lm -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -lm  -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2/../../../../lib64 -Wl,-rpath,/lib/../lib64 -L/lib/../lib64 -Wl,-rpath,/usr/lib/../lib64 -L/usr/lib/../lib64 -ldl -lgcc_s -ldl  
------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-2p.log
Type: application/octet-stream
Size: 9562 bytes
Desc: 
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-4p.log
Type: application/octet-stream
Size: 9563 bytes
Desc: 
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex2f-600-8p.log
Type: application/octet-stream
Size: 9562 bytes
Desc: 
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080415/80bfb8eb/attachment-0002.obj>

From zonexo at gmail.com  Tue Apr 15 23:35:24 2008
From: zonexo at gmail.com (Ben Tay)
Date: Wed, 16 Apr 2008 12:35:24 +0800
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <alpine.LFD.1.10.0804152214270.17603@asterix>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com> <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix>
Message-ID: <4805820C.8030803@gmail.com>

Hi Satish, thank you very much for helping me run the ex2f.F code.

I think I've a clearer picture now. I believe I'm running on Dual-Core 
Intel Xeon 5160. The quad core is only on atlas3-01 to 04 and there's 
only 4 of them. I guess that the lower peak is because I'm using Xeon 
5160, while you are using Xeon X5355.

You mention about the speedups for MatMult and compare between KSPSolve. 
Are these the only things we have to look at? Because I see that some 
other event such as VecMAXPY also takes up a sizable % of the time. To 
get an accurate speedup, do I just compare the time taken by KSPSolve 
between different no. of processors or do I have to look at other events 
such as MatMult as well?

In summary, due to load imbalance, my speedup is quite bad. So maybe 
I'll just send your results to my school's engineer and see if they 
could do anything. For my part, I guess I'll just 've to wait?

Thank alot!

Satish Balay wrote:
> On Wed, 16 Apr 2008, Ben Tay wrote:
>
>   
>> I think you may be right. My school uses :
>>     
>
>   
>>   No of Nodes Processors Qty per node Total cores per node Memory per node  
>>   4 Quad-Core Intel Xeon X5355 2 8 16 GB  
>>   60 Dual-Core Intel Xeon 5160 2 4 8 GB
>>     
>
>
> I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355
> machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I get the following:
>
> << Logs for my run are attached >>
>
> asterix:/home/balay/download-pine>grep MatMult *
> ex2f-600-1p.log:MatMult             1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397
> ex2f-600-2p.log:MatMult             1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   632
> ex2f-600-4p.log:MatMult              969 1.0 4.3311e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 15 11100100  0  15 11100100  0   724
> ex2f-600-8p.log:MatMult             1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 1.8e+04 4.8e+03 0.0e+00 16 11100100  0  16 11100100  0   749
> asterix:/home/balay/download-pine>grep KSPSolve *
> ex2f-600-1p.log:KSPSolve               1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   513
> ex2f-600-2p.log:KSPSolve               1 1.0 4.4005e+01 1.0 1.81e+10 1.0 2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100   824
> ex2f-600-4p.log:KSPSolve               1 1.0 2.8139e+01 1.0 7.21e+09 1.0 5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99  1024
> ex2f-600-8p.log:KSPSolve               1 1.0 3.6260e+01 1.0 4.90e+09 1.0 1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100  1081
> asterix:/home/balay/download-pine>
>
>
> You get the following [with intel compilers?]:
>
> asterix:/home/balay/download-pine/x>grep MatMult *
> log.1:MatMult             1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11  0  0  0  13 11  0  0  0   239
> log.2:MatMult             1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 0.0e+00 11 11100100  0  11 11100100  0   315
> log.4:MatMult              969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 0.0e+00  8 11100100  0   8 11100100  0   321
> asterix:/home/balay/download-pine/x>grep KSPSolve *
> log.1:KSPSolve               1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   292
> log.2:KSPSolve               1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 2.4e+03 99100100100100  99100100100100   352
> log.4:KSPSolve               1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 1.9e+03 98100100100 99  98100100100 99   461
> asterix:/home/balay/download-pine/x>
>
> What exact CPU was this run on?
>
> A couple of comments:
> - my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 3.6 for 2,4 proc runs [so higher
>   load imbalance on your machine]
> - The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 for you
> - Speedups I see for MatMult are:
>
> np   me   you
>
> 2   1.59   1.32
> 4   1.82   1.34
> 8   1.88
>
> --------------------------
>
> The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores.
>
> As Matt indicated perhaps in "Subject: general question on speed using quad core Xeons" thread,
> for sparse linear algebra - the performance is limited by memory bandwidth - not CPU
>
> So one have to look at the hardware memory architecture of the machine
> if you expect scalability.
>
> The 2x quad-core has a memory architecture that gives 11GB/s if one
> CPU-socket is used, but 22GB/s when both CPUs-sockets are used
> [irrespective of the number of cores in each CPU socket]. One
> inference is - max of 2 speedup can be obtained from such machine [due
> to 2 memory bank architecture].
>
> So if you have 2 such machines [i.e 4 memory banks] - then you can
> expect a theoretical max speedup of 4.
>
> We are generally used to evaluating performance/cpu [or core]. Here
> the scalability numbers suck.
>
> However if you do performance/number-of-memory-banks - then things look better.
>
> Its just that we are used to always expecting scalability per node and
> assume it translates to scalability per core. [however the scalability
> per node - was more about scalability per memory bank - before
> multicore cpus took over]
>
>
> There is also another measure - performance/dollar spent. Generally
> the extra cores are practically free - so here this measure also holds
> up ok.
>
> Satish


From balay at mcs.anl.gov  Wed Apr 16 00:25:45 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Wed, 16 Apr 2008 00:25:45 -0500 (CDT)
Subject: Slow speed after changing from serial to parallel (with
 ex2f.F)
In-Reply-To: <4805820C.8030803@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>
 <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix> <4805820C.8030803@gmail.com>
Message-ID: <alpine.LFD.1.10.0804152359590.17603@asterix>

On Wed, 16 Apr 2008, Ben Tay wrote:

> Hi Satish, thank you very much for helping me run the ex2f.F code.
> 
> I think I've a clearer picture now. I believe I'm running on Dual-Core Intel
> Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of
> them. I guess that the lower peak is because I'm using Xeon 5160, while you
> are using Xeon X5355.

I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
xeon 5130 machine [which should be similar to your 5160 machine] and
get the following:

[balay at n001 ~]$ grep MatMult log*
log.1:MatMult             1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   364
log.2:MatMult             1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   615
log.4:MatMult              969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   656
[balay at n001 ~]$ 

> You mention about the speedups for MatMult and compare between KSPSolve. Are
> these the only things we have to look at? Because I see that some other event
> such as VecMAXPY also takes up a sizable % of the time. To get an accurate
> speedup, do I just compare the time taken by KSPSolve between different no. of
> processors or do I have to look at other events such as MatMult as well?

Sometimes we look at individual components like MatMult() VecMAXPY()
to understand whats hapenning in each stage - and at KSPSolve() to
look at the agregate performance for the whole solve [which includes
MatMult VecMAXPY etc..]. Perhaps I should have also looked at
VecMDot() aswell - at 48% of runtime - its the biggest contributor to
KSPSolve() for your run.

Its easy to get lost in the details of log_summary. Looking for
anamolies is one thing. Plotting scalability charts for the solver is
something else..

> In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just
> send your results to my school's engineer and see if they could do anything.
> For my part, I guess I'll just 've to wait?

Yes - load imbalance at MatMult level is bad. On 4 proc run you have
ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
times slower than the other task [so all speedup is lost here]

You could try the latest mpich2 [1.0.7] - just for this SMP
experiment, and see if it makes a difference. I've built mpich2 with
[default gcc/gfortran and]:

./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

There could be something else going on on this machine thats messing
up load-balance for basic petsc example..

Satish


From bsmith at mcs.anl.gov  Wed Apr 16 07:14:37 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Wed, 16 Apr 2008 07:14:37 -0500
Subject: general question on speed using quad core Xeons
In-Reply-To: <48054602.9040200@gmail.com>
References: <48054602.9040200@gmail.com>
Message-ID: <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>


    Randy,

     Please see http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers

     Essentially what has happened is that chip hardware designers  
(Intel, IBM, AMD) hit a wall
on how high they can make their clock speed. They then needed some  
other way to try to
increase the "performance" of their chips; since they could continue  
to make smaller circuits
they came up on putting multiple cores on a single chip, then they can  
"double" or "quad" the
claimed performance very easily. Unfortunately the whole multicore  
"solution" is really
half-assed since it is difficult to effectively use all the cores,  
especially since the memory
bandwidth did not improve as fast.

Now when a company comes out with a half-assed product, do they say,  
"this is a half-assed product"?
Did Microsoft say Vista was "half-assed". No, they emphasis the  
positive parts of their product and
hide the limitations.  This has been true since Grog made his first  
stone wheel in front of this cave.
So Intel mislead everyone on how great multi-cores are.

When you buy earlier dual or quad products you are NOT gettting a  
parallel system (even
though it has 2 cores) because the memory is NOT parallel.

Things are getting a bit better, Intel now has systems with higher  
memory bandwidth.
The thing you have to look for is MEMORY BANDWDITH PER CORE, the  
higher that is the
better performance you get.

Note this doesn't have anything to do with PETSc, any sparse solver  
has the exact same
issues.

    Barry


On Apr 15, 2008, at 7:19 PM, Randall Mackie wrote:
> I'm running my PETSc code on a cluster of quad core Xeon's connected
> by Infiniband. I hadn't much worried about the performance, because
> everything seemed to be working quite well, but today I was actually
> comparing performance (wall clock time) for the same problem, but on
> different combinations of CPUS.
>
> I find that my PETSc code is quite scalable until I start to use
> multiple cores/cpu.
>
> For example, the run time doesn't improve by going from 1 core/cpu
> to 4 cores/cpu, and I find this to be very strange, especially since
> looking at top or Ganglia, all 4 cpus on each node are running at  
> 100% almost
> all of the time. I would have thought if the cpus were going all out,
> that I would still be getting much more scalable results.
>
> We are using mvapich-0.9.9 with infiniband. So, I don't know if
> this is a cluster/Xeon issue, or something else.
>
> Anybody with experience on this?
>
> Thanks, Randy M.
>


From pivello at gmail.com  Wed Apr 16 06:51:05 2008
From: pivello at gmail.com (=?ISO-8859-1?Q?M=E1rcio_Ricardo_Pivello?=)
Date: Wed, 16 Apr 2008 08:51:05 -0300
Subject: PETSc + HYPRE
In-Reply-To: <e7ba66e40804151643t10202d57i74a14606914533a5@mail.gmail.com>
References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
	 <a9f269830804150736s5a291478y5d2f531789cb7673@mail.gmail.com>
	 <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com>
	 <e7ba66e40804151643t10202d57i74a14606914533a5@mail.gmail.com>
Message-ID: <7d6158b80804160451w2b964b32m6401e655e0ec6a4@mail.gmail.com>

Hi, Lisandro. I must apologize for not answering. I read your email and
changed my code, but then I went through a different path. I'm trying to
call the preconditioner from the command line, without any mention to it in
the source code. It should take a couple of hours to get some results and
then, if it doesn't work, I'll change the code.


Thank you very much.


M?rcio Ricardo


On Tue, Apr 15, 2008 at 8:43 PM, Lisandro Dalcin <dalcinl at gmail.com> wrote:

> Sorry for my insistence, but... Did you see my previous mail? The code
> you wrote is not OK. You have to first create the KSP, next extract
> the PC with KSPGetPC, and then configure the PC to use HYPRE+BoomerAMG
>
> To be sure you are actually being using hypre, add -ksp_view to command
> line.
>
>
> On 4/15/08, M?rcio Ricardo Pivello <pivello at gmail.com> wrote:
> > Hy, Matthew, thanks for your help.
> >
> > Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on
> FEM,
> > with fluid-structure interaction. In this case, I'm simulating the blood
> > flow inside an aneurysm in an abdominal aorta artery.
> >  By not working I mean the error does not decrease with time. Our team
> is
> > just starting using HYPRE, in fact this is the very first case we run
> with
> > it.
> >
> >
> > Again, thanks for your help.
> >
> >
> > M?rcio Ricardo.
> >
> >
> >
>
>
> --
> Lisandro Dalc?n
> ---------------
> Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC)
> Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC)
> Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET)
> PTLC - G?emes 3450, (3000) Santa Fe, Argentina
> Tel/Fax: +54-(0)342-451.1594
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080416/d5b88e31/attachment.htm>

From zonexo at gmail.com  Wed Apr 16 08:44:15 2008
From: zonexo at gmail.com (Ben Tay)
Date: Wed, 16 Apr 2008 21:44:15 +0800
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <alpine.LFD.1.10.0804152359590.17603@asterix>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com> <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix> <4805820C.8030803@gmail.com> <alpine.LFD.1.10.0804152359590.17603@asterix>
Message-ID: <480602AF.5060802@gmail.com>

Hi,

Am I right to say that despite all the hype about multi-core processors, 
they can't speed up solving of linear eqns? It's not possible to get a 
2x speedup when using 2 cores. And is this true for all types of linear 
equation solver besides PETSc? What about parallel direct solvers (e.g. 
MUMPS) or those which uses openmp instead of mpich? Well, I just can't 
help feeling disappointed if that's the case...

Also, with a smart enough LSF scheduler, I will be assured of getting 
separate processors ie 1 core from each different processor instead of 
2-4 cores from just 1 processor. In that case, if I use 1 core from 
processor A and 1 core from processor B, I should be able to get a 
decent speedup of more than 1, is that so? This option is also better 
than using 2 or even 4 cores from the same processor.

Thank you very much.

Satish Balay wrote:
> On Wed, 16 Apr 2008, Ben Tay wrote:
>
>   
>> Hi Satish, thank you very much for helping me run the ex2f.F code.
>>
>> I think I've a clearer picture now. I believe I'm running on Dual-Core Intel
>> Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of
>> them. I guess that the lower peak is because I'm using Xeon 5160, while you
>> are using Xeon X5355.
>>     
>
> I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
> xeon 5130 machine [which should be similar to your 5160 machine] and
> get the following:
>
> [balay at n001 ~]$ grep MatMult log*
> log.1:MatMult             1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   364
> log.2:MatMult             1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   615
> log.4:MatMult              969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   656
> [balay at n001 ~]$ 
>
>   
>> You mention about the speedups for MatMult and compare between KSPSolve. Are
>> these the only things we have to look at? Because I see that some other event
>> such as VecMAXPY also takes up a sizable % of the time. To get an accurate
>> speedup, do I just compare the time taken by KSPSolve between different no. of
>> processors or do I have to look at other events such as MatMult as well?
>>     
>
> Sometimes we look at individual components like MatMult() VecMAXPY()
> to understand whats hapenning in each stage - and at KSPSolve() to
> look at the agregate performance for the whole solve [which includes
> MatMult VecMAXPY etc..]. Perhaps I should have also looked at
> VecMDot() aswell - at 48% of runtime - its the biggest contributor to
> KSPSolve() for your run.
>
> Its easy to get lost in the details of log_summary. Looking for
> anamolies is one thing. Plotting scalability charts for the solver is
> something else..
>
>   
>> In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just
>> send your results to my school's engineer and see if they could do anything.
>> For my part, I guess I'll just 've to wait?
>>     
>
> Yes - load imbalance at MatMult level is bad. On 4 proc run you have
> ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
> times slower than the other task [so all speedup is lost here]
>
> You could try the latest mpich2 [1.0.7] - just for this SMP
> experiment, and see if it makes a difference. I've built mpich2 with
> [default gcc/gfortran and]:
>
> ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker
>
> There could be something else going on on this machine thats messing
> up load-balance for basic petsc example..
>
> Satish
>
>
>   


From knepley at gmail.com  Wed Apr 16 08:48:37 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Wed, 16 Apr 2008 08:48:37 -0500
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <480602AF.5060802@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>
	 <4804DB61.3080906@gmail.com>
	 <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
	 <48055FAD.3000105@gmail.com>
	 <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>
	 <48056C08.6030903@gmail.com>
	 <alpine.LFD.1.10.0804152214270.17603@asterix>
	 <4805820C.8030803@gmail.com>
	 <alpine.LFD.1.10.0804152359590.17603@asterix>
	 <480602AF.5060802@gmail.com>
Message-ID: <a9f269830804160648medd2d42y4f3fddd424fc9706@mail.gmail.com>

On Wed, Apr 16, 2008 at 8:44 AM, Ben Tay <zonexo at gmail.com> wrote:
> Hi,
>
>  Am I right to say that despite all the hype about multi-core processors,
> they can't speed up solving of linear eqns? It's not possible to get a 2x
> speedup when using 2 cores. And is this true for all types of linear
> equation solver besides PETSc? What about parallel direct solvers (e.g.
> MUMPS) or those which uses openmp instead of mpich? Well, I just can't help
> feeling disappointed if that's the case...

Notice that Satish got much much better scaling than you did on our box here.
I think something is really wrong either with the installation of MPI
on that box
or something hardware-wise.

  Matt

>  Also, with a smart enough LSF scheduler, I will be assured of getting
> separate processors ie 1 core from each different processor instead of 2-4
> cores from just 1 processor. In that case, if I use 1 core from processor A
> and 1 core from processor B, I should be able to get a decent speedup of
> more than 1, is that so? This option is also better than using 2 or even 4
> cores from the same processor.
>
>  Thank you very much.
>
>  Satish Balay wrote:
>
> > On Wed, 16 Apr 2008, Ben Tay wrote:
> >
> >
> >
> > > Hi Satish, thank you very much for helping me run the ex2f.F code.
> > >
> > > I think I've a clearer picture now. I believe I'm running on Dual-Core
> Intel
> > > Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4
> of
> > > them. I guess that the lower peak is because I'm using Xeon 5160, while
> you
> > > are using Xeon X5355.
> > >
> > >
> >
> > I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
> > xeon 5130 machine [which should be similar to your 5160 machine] and
> > get the following:
> >
> > [balay at n001 ~]$ grep MatMult log*
> > log.1:MatMult             1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   364
> > log.2:MatMult             1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03
> 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   615
> > log.4:MatMult              969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03
> 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   656
> > [balay at n001 ~]$
> >
> >
> > > You mention about the speedups for MatMult and compare between KSPSolve.
> Are
> > > these the only things we have to look at? Because I see that some other
> event
> > > such as VecMAXPY also takes up a sizable % of the time. To get an
> accurate
> > > speedup, do I just compare the time taken by KSPSolve between different
> no. of
> > > processors or do I have to look at other events such as MatMult as well?
> > >
> > >
> >
> > Sometimes we look at individual components like MatMult() VecMAXPY()
> > to understand whats hapenning in each stage - and at KSPSolve() to
> > look at the agregate performance for the whole solve [which includes
> > MatMult VecMAXPY etc..]. Perhaps I should have also looked at
> > VecMDot() aswell - at 48% of runtime - its the biggest contributor to
> > KSPSolve() for your run.
> >
> > Its easy to get lost in the details of log_summary. Looking for
> > anamolies is one thing. Plotting scalability charts for the solver is
> > something else..
> >
> >
> >
> > > In summary, due to load imbalance, my speedup is quite bad. So maybe
> I'll just
> > > send your results to my school's engineer and see if they could do
> anything.
> > > For my part, I guess I'll just 've to wait?
> > >
> > >
> >
> > Yes - load imbalance at MatMult level is bad. On 4 proc run you have
> > ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
> > times slower than the other task [so all speedup is lost here]
> >
> > You could try the latest mpich2 [1.0.7] - just for this SMP
> > experiment, and see if it makes a difference. I've built mpich2 with
> > [default gcc/gfortran and]:
> >
> > ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker
> >
> > There could be something else going on on this machine thats messing
> > up load-balance for basic petsc example..
> >
> > Satish
> >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From rlmackie862 at gmail.com  Wed Apr 16 09:13:26 2008
From: rlmackie862 at gmail.com (Randall Mackie)
Date: Wed, 16 Apr 2008 07:13:26 -0700
Subject: general question on speed using quad core Xeons
In-Reply-To: <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>
References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>
Message-ID: <48060986.8050102@gmail.com>

Thanks Barry - very informative, and gave me a chuckle :-)

Randy


Barry Smith wrote:
> 
>    Randy,
> 
>     Please see 
> http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers
> 
>     Essentially what has happened is that chip hardware designers 
> (Intel, IBM, AMD) hit a wall
> on how high they can make their clock speed. They then needed some other 
> way to try to
> increase the "performance" of their chips; since they could continue to 
> make smaller circuits
> they came up on putting multiple cores on a single chip, then they can 
> "double" or "quad" the
> claimed performance very easily. Unfortunately the whole multicore 
> "solution" is really
> half-assed since it is difficult to effectively use all the cores, 
> especially since the memory
> bandwidth did not improve as fast.
> 
> Now when a company comes out with a half-assed product, do they say, 
> "this is a half-assed product"?
> Did Microsoft say Vista was "half-assed". No, they emphasis the positive 
> parts of their product and
> hide the limitations.  This has been true since Grog made his first 
> stone wheel in front of this cave.
> So Intel mislead everyone on how great multi-cores are.
> 
> When you buy earlier dual or quad products you are NOT gettting a 
> parallel system (even
> though it has 2 cores) because the memory is NOT parallel.
> 
> Things are getting a bit better, Intel now has systems with higher 
> memory bandwidth.
> The thing you have to look for is MEMORY BANDWDITH PER CORE, the higher 
> that is the
> better performance you get.
> 
> Note this doesn't have anything to do with PETSc, any sparse solver has 
> the exact same
> issues.
> 
>    Barry
> 
> 
> 
> On Apr 15, 2008, at 7:19 PM, Randall Mackie wrote:
>> I'm running my PETSc code on a cluster of quad core Xeon's connected
>> by Infiniband. I hadn't much worried about the performance, because
>> everything seemed to be working quite well, but today I was actually
>> comparing performance (wall clock time) for the same problem, but on
>> different combinations of CPUS.
>>
>> I find that my PETSc code is quite scalable until I start to use
>> multiple cores/cpu.
>>
>> For example, the run time doesn't improve by going from 1 core/cpu
>> to 4 cores/cpu, and I find this to be very strange, especially since
>> looking at top or Ganglia, all 4 cpus on each node are running at 100% 
>> almost
>> all of the time. I would have thought if the cpus were going all out,
>> that I would still be getting much more scalable results.
>>
>> We are using mvapich-0.9.9 with infiniband. So, I don't know if
>> this is a cluster/Xeon issue, or something else.
>>
>> Anybody with experience on this?
>>
>> Thanks, Randy M.
>>
> 


From bsmith at mcs.anl.gov  Wed Apr 16 09:17:18 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Wed, 16 Apr 2008 09:17:18 -0500
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <480602AF.5060802@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com> <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix> <4805820C.8030803@gmail.com> <alpine.LFD.1.10.0804152359590.17603@asterix> <480602AF.5060802@gmail.com>
Message-ID: <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov>


On Apr 16, 2008, at 8:44 AM, Ben Tay wrote:
> Hi,
>
> Am I right to say that despite all the hype about multi-core  
> processors, they can't speed up solving of linear eqns? It's not  
> possible to get a 2x speedup when using 2 cores. And is this true  
> for all types of linear equation solver besides PETSc?

    It will basically be the same for any iterative solver package.

> What about parallel direct solvers (e.g. MUMPS)

    direct solvers are a bit less memory bandwidth limited, so scaling  
will be a bit better. But the time spent for problems where
iterative solvers work well will likely be much higher for direct  
solver.

> or those which uses openmp instead of mpich?

    openmp will give no benefit, this is a hardware limitation, not  
software.

> Well, I just can't help feeling disappointed if that's the case...

    If you are going to do parallel computing you need to get use to  
disappointment. At this point in time (especially first generation
dual/quad core systems) memory bandwidth is the fundamental limitation  
(not number of flops your hardware can do)
to speed.

    Barry

>
>
> Also, with a smart enough LSF scheduler, I will be assured of  
> getting separate processors ie 1 core from each different processor  
> instead of 2-4 cores from just 1 processor. In that case, if I use 1  
> core from processor A and 1 core from processor B, I should be able  
> to get a decent speedup of more than 1, is that so?

    So long as your iterative solver ALGORITHM scales well, then you  
should see very good speedup (and most people do). Algorithm scaling  
means
if you increase the number of processes the number of iterations  
should not increase much.


> This option is also better than using 2 or even 4 cores from the  
> same processor.

   Two cores out of the four will likely not be so bad either; all  
four will be bad.

   Barry


>
>
> Thank you very much.
>
> Satish Balay wrote:
>> On Wed, 16 Apr 2008, Ben Tay wrote:
>>
>>
>>> Hi Satish, thank you very much for helping me run the ex2f.F code.
>>>
>>> I think I've a clearer picture now. I believe I'm running on Dual- 
>>> Core Intel
>>> Xeon 5160. The quad core is only on atlas3-01 to 04 and there's  
>>> only 4 of
>>> them. I guess that the lower peak is because I'm using Xeon 5160,  
>>> while you
>>> are using Xeon X5355.
>>>
>>
>> I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
>> xeon 5130 machine [which should be similar to your 5160 machine] and
>> get the following:
>>
>> [balay at n001 ~]$ grep MatMult log*
>> log.1:MatMult             1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e 
>> +00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   364
>> log.2:MatMult             1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e 
>> +03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   615
>> log.4:MatMult              969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e 
>> +03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   656
>> [balay at n001 ~]$
>>
>>> You mention about the speedups for MatMult and compare between  
>>> KSPSolve. Are
>>> these the only things we have to look at? Because I see that some  
>>> other event
>>> such as VecMAXPY also takes up a sizable % of the time. To get an  
>>> accurate
>>> speedup, do I just compare the time taken by KSPSolve between  
>>> different no. of
>>> processors or do I have to look at other events such as MatMult as  
>>> well?
>>>
>>
>> Sometimes we look at individual components like MatMult() VecMAXPY()
>> to understand whats hapenning in each stage - and at KSPSolve() to
>> look at the agregate performance for the whole solve [which includes
>> MatMult VecMAXPY etc..]. Perhaps I should have also looked at
>> VecMDot() aswell - at 48% of runtime - its the biggest contributor to
>> KSPSolve() for your run.
>>
>> Its easy to get lost in the details of log_summary. Looking for
>> anamolies is one thing. Plotting scalability charts for the solver is
>> something else..
>>
>>
>>> In summary, due to load imbalance, my speedup is quite bad. So  
>>> maybe I'll just
>>> send your results to my school's engineer and see if they could do  
>>> anything.
>>> For my part, I guess I'll just 've to wait?
>>>
>>
>> Yes - load imbalance at MatMult level is bad. On 4 proc run you have
>> ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
>> times slower than the other task [so all speedup is lost here]
>>
>> You could try the latest mpich2 [1.0.7] - just for this SMP
>> experiment, and see if it makes a difference. I've built mpich2 with
>> [default gcc/gfortran and]:
>>
>> ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker
>>
>> There could be something else going on on this machine thats messing
>> up load-balance for basic petsc example..
>>
>> Satish
>>
>>
>>
>


From dalcinl at gmail.com  Wed Apr 16 09:24:03 2008
From: dalcinl at gmail.com (Lisandro Dalcin)
Date: Wed, 16 Apr 2008 11:24:03 -0300
Subject: PETSc + HYPRE
In-Reply-To: <7d6158b80804160451w2b964b32m6401e655e0ec6a4@mail.gmail.com>
References: <7d6158b80804150722p49687acdpca3e571b05679026@mail.gmail.com>
	 <a9f269830804150736s5a291478y5d2f531789cb7673@mail.gmail.com>
	 <7d6158b80804151146w67635185v75088d41724a6fb3@mail.gmail.com>
	 <e7ba66e40804151643t10202d57i74a14606914533a5@mail.gmail.com>
	 <7d6158b80804160451w2b964b32m6401e655e0ec6a4@mail.gmail.com>
Message-ID: <e7ba66e40804160724k4b93fa25oc1d24b38ea5fe240@mail.gmail.com>

OK,

You said you are trying to solve NS eqs. Are you using a pressure
projection-like method? In that case, is the matrix of your pressure
problem much different than the Laplacian one? How do you handle the
pressure 'rigid-body' mode?

On 4/16/08, M?rcio Ricardo Pivello <pivello at gmail.com> wrote:
> Hi, Lisandro. I must apologize for not answering. I read your email and
> changed my code, but then I went through a different path. I'm trying to
> call the preconditioner from the command line, without any mention to it in
> the source code. It should take a couple of hours to get some results and
> then, if it doesn't work, I'll change the code.
>
>
> Thank you very much.
>
>
> M?rcio Ricardo
>
>
>
>
>
>
>
> On Tue, Apr 15, 2008 at 8:43 PM, Lisandro Dalcin <dalcinl at gmail.com> wrote:
>
> > Sorry for my insistence, but... Did you see my previous mail? The code
> > you wrote is not OK. You have to first create the KSP, next extract
> > the PC with KSPGetPC, and then configure the PC to use HYPRE+BoomerAMG
> >
> > To be sure you are actually being using hypre, add -ksp_view to command
> line.
> >
> >
> >
> >
> >
> > On 4/15/08, M?rcio Ricardo Pivello <pivello at gmail.com> wrote:
> > > Hy, Matthew, thanks for your help.
> > >
> > > Firstly, I'm solving a 3D Incompressible Navier Stokes solver based on
> FEM,
> > > with fluid-structure interaction. In this case, I'm simulating the blood
> > > flow inside an aneurysm in an abdominal aorta artery.
> > >  By not working I mean the error does not decrease with time. Our team
> is
> > > just starting using HYPRE, in fact this is the very first case we run
> with
> > > it.
> > >
> > >
> > > Again, thanks for your help.
> > >
> > >
> > > M?rcio Ricardo.
> > >
> > >
> > >
> >
> >
> > --
> >
> >
> >
> > Lisandro Dalc?n
> > ---------------
> > Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC)
> > Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC)
> > Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET)
> > PTLC - G?emes 3450, (3000) Santa Fe, Argentina
> > Tel/Fax: +54-(0)342-451.1594
> >
> >
>
>


-- 
Lisandro Dalc?n
---------------
Centro Internacional de M?todos Computacionales en Ingenier?a (CIMEC)
Instituto de Desarrollo Tecnol?gico para la Industria Qu?mica (INTEC)
Consejo Nacional de Investigaciones Cient?ficas y T?cnicas (CONICET)
PTLC - G?emes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594


From balay at mcs.anl.gov  Wed Apr 16 09:27:41 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Wed, 16 Apr 2008 09:27:41 -0500 (CDT)
Subject: general question on speed using quad core Xeons
In-Reply-To: <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>
References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>
Message-ID: <alpine.LFD.1.10.0804160919020.27121@asterix>

Just a note:

Intel does publish benchmarks for their chips.

http://www.intel.com/performance/server/xeon/hpcapp.htm

Satish


From gsanjay at ethz.ch  Wed Apr 16 09:27:33 2008
From: gsanjay at ethz.ch (Sanjay Govindjee)
Date: Wed, 16 Apr 2008 16:27:33 +0200
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com> <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix> <4805820C.8030803@gmail.com> <alpine.LFD.1.10.0804152359590.17603@asterix> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov>
Message-ID: <48060CD5.1010308@ethz.ch>


>>
>> Also, with a smart enough LSF scheduler, I will be assured of getting 
>> separate processors ie 1 core from each different processor instead 
>> of 2-4 cores from just 1 processor. In that case, if I use 1 core 
>> from processor A and 1 core from processor B, I should be able to get 
>> a decent speedup of more than 1, is that so?
>
>  

You still need to be careful with the hardware you choose. If the 
processor's live on the same motherboard then you still need to make 
sure that
they each have their own memory bus. Otherwise you will still face 
memory bottlenecks as each single core, from the different processors, 
fights for bandwidth on the bus. It all
depends on the memory bus architecture of your system. In this regard, I 
recommend staying away from Intel style systems.  -sg


From bsmith at mcs.anl.gov  Wed Apr 16 09:59:31 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Wed, 16 Apr 2008 09:59:31 -0500
Subject: general question on speed using quad core Xeons
In-Reply-To: <alpine.LFD.1.10.0804160919020.27121@asterix>
References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <alpine.LFD.1.10.0804160919020.27121@asterix>
Message-ID: <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov>


    Cool. The pages to look at are

http://www.intel.com/performance/server/xeon/hpc_ansys.htm
http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm

these are the two benchmarks that reflect the bottlenecks of memory  
bandwidth.
When going from dual to quad they get 1.2 times the performance, when  
one would
like 2 times the performance.

    Barry


On Apr 16, 2008, at 9:27 AM, Satish Balay wrote:
> Just a note:
>
> Intel does publish benchmarks for their chips.
>
> http://www.intel.com/performance/server/xeon/hpcapp.htm
>
> Satish
>


From berend at chalmers.se  Wed Apr 16 10:10:32 2008
From: berend at chalmers.se (Berend van Wachem)
Date: Wed, 16 Apr 2008 17:10:32 +0200
Subject: general question on speed using quad core Xeons
In-Reply-To: <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov>
References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <alpine.LFD.1.10.0804160919020.27121@asterix> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov>
Message-ID: <480616E8.9020205@chalmers.se>

Hi Barry,


> http://www.intel.com/performance/server/xeon/hpc_ansys.htm
> http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm


Aren't both benchmarks run on Quads? The difference just being the cache 
per processor? Or am I mistaken?

Berend.


> these are the two benchmarks that reflect the bottlenecks of memory 
> bandwidth.
> When going from dual to quad they get 1.2 times the performance, when 
> one would
> like 2 times the performance.
> 
>    Barry
> 
> 
> On Apr 16, 2008, at 9:27 AM, Satish Balay wrote:
>> Just a note:
>>
>> Intel does publish benchmarks for their chips.
>>
>> http://www.intel.com/performance/server/xeon/hpcapp.htm
>>
>> Satish
>>
> 


From balay at mcs.anl.gov  Wed Apr 16 10:38:18 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Wed, 16 Apr 2008 10:38:18 -0500 (CDT)
Subject: general question on speed using quad core Xeons
In-Reply-To: <480616E8.9020205@chalmers.se>
References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <alpine.LFD.1.10.0804160919020.27121@asterix> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <480616E8.9020205@chalmers.se>
Message-ID: <alpine.LFD.1.10.0804161034390.27121@asterix>

On Wed, 16 Apr 2008, Berend van Wachem wrote:

> Hi Barry,
> 
> 
> > http://www.intel.com/performance/server/xeon/hpc_ansys.htm
> > http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm
> 
> 
> Aren't both benchmarks run on Quads? The difference just being the cache per
> processor? Or am I mistaken?

Yes - both are quads, and yes - the cache sizes are different.

But I think the primary feature that contributes to the performance
difference is memory bandwidth.  The first one is 1333 FSB, the second
one is 1600FSB - i.e 20% improvement in memory bandwidth => 20%
improvement in performance for the above benchmarks.

Satish


From rlmackie862 at gmail.com  Wed Apr 16 10:42:15 2008
From: rlmackie862 at gmail.com (Randall Mackie)
Date: Wed, 16 Apr 2008 08:42:15 -0700
Subject: general question on speed using quad core Xeons
In-Reply-To: <alpine.LFD.1.10.0804161034390.27121@asterix>
References: <48054602.9040200@gmail.com> <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov> <alpine.LFD.1.10.0804160919020.27121@asterix> <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <480616E8.9020205@chalmers.se> <alpine.LFD.1.10.0804161034390.27121@asterix>
Message-ID: <48061E57.3090200@gmail.com>

I just want to say that I really have appreciated this discussion - issues
like this tend to get lost or not addressed when we're working on our
codes, and it's been very enlightening for me.

Randy

Satish Balay wrote:
> On Wed, 16 Apr 2008, Berend van Wachem wrote:
> 
>> Hi Barry,
>>
>>
>>> http://www.intel.com/performance/server/xeon/hpc_ansys.htm
>>> http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm
>>
>> Aren't both benchmarks run on Quads? The difference just being the cache per
>> processor? Or am I mistaken?
> 
> Yes - both are quads, and yes - the cache sizes are different.
> 
> But I think the primary feature that contributes to the performance
> difference is memory bandwidth.  The first one is 1333 FSB, the second
> one is 1600FSB - i.e 20% improvement in memory bandwidth => 20%
> improvement in performance for the above benchmarks.
> 
> Satish
> 


From tribur at vision.ee.ethz.ch  Thu Apr 17 05:23:13 2008
From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch)
Date: Thu, 17 Apr 2008 12:23:13 +0200
Subject: Hypre
Message-ID: <20080417122313.ws2qnzcgg8co480w@email.ee.ethz.ch>

Dear Petsc experts,

Another, more basic problem when using Hypre:

When I try

ex2_hdf5 -f data/sphereInBlockFiner_a_cd3t.h5 -pc_type hypre 
-pc_hypre_type pilut -log_summary

where sphereInBlock_a_cd3t.h5 contains a 102464 x 102464 matrix,
the program seems to hang (it stops with the error "=>> PBS: job 
killed: walltime 384 exceeded limit 360").
Adding the option -ksp_max_it 1 to be sure that it is not iterating 
until 10000000000000 doesn't change anything. The same happens also if 
I use -pc_hypre_type boomeramg.


It is neither the problem of my program nor of the matrix, because

ex2_hdf5 -f data/sphereInBlockFiner_a_cd3t.h5 -pc_type jacobi 
-log_summary -ksp_rtol 0.0000000001

takes only 5s and gives me the correct solution.

What do I do wrong?

Looking forward to your answer,
Kathrin


From knepley at gmail.com  Thu Apr 17 07:01:34 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Thu, 17 Apr 2008 07:01:34 -0500
Subject: Hypre
In-Reply-To: <20080417122313.ws2qnzcgg8co480w@email.ee.ethz.ch>
References: <20080417122313.ws2qnzcgg8co480w@email.ee.ethz.ch>
Message-ID: <a9f269830804170501i33b45ebv8dc8ec86d4549ae5@mail.gmail.com>

On Thu, Apr 17, 2008 at 5:23 AM,  <tribur at vision.ee.ethz.ch> wrote:
> Dear Petsc experts,
>
>  Another, more basic problem when using Hypre:
>
>  When I try
>
>  ex2_hdf5 -f data/sphereInBlockFiner_a_cd3t.h5 -pc_type hypre -pc_hypre_type
> pilut -log_summary
>
>  where sphereInBlock_a_cd3t.h5 contains a 102464 x 102464 matrix,
>  the program seems to hang (it stops with the error "=>> PBS: job killed:
> walltime 384 exceeded limit 360").
>  Adding the option -ksp_max_it 1 to be sure that it is not iterating until
> 10000000000000 doesn't change anything. The same happens also if I use
> -pc_hypre_type boomeramg.

If you really think it is hanging, I would attach gdb and get a stack
trace. You can
either run with -start_in_debugger, or attach gdb to the running process with
gdb <process id>. It is conceivable to me that pilut just takes a
really long time
to factor the matrix. For boomeramg this is less likely, but still believeable.

   Matt

>  It is neither the problem of my program nor of the matrix, because
>
>  ex2_hdf5 -f data/sphereInBlockFiner_a_cd3t.h5 -pc_type jacobi -log_summary
> -ksp_rtol 0.0000000001
>
>  takes only 5s and gives me the correct solution.
>
>  What do I do wrong?
>
>  Looking forward to your answer,
>  Kathrin
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From zonexo at gmail.com  Thu Apr 17 22:55:39 2008
From: zonexo at gmail.com (Ben Tay)
Date: Fri, 18 Apr 2008 11:55:39 +0800
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <480336BE.3070507@gmail.com>	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>	 <48035F88.2080003@gmail.com>	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>	 <4804CAC0.6060201@gmail.com>	 <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>	 <4804D044.2060502@gmail.com>	 <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>	 <4804DB61.3080906@gmail.com> <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
Message-ID: <48081BBB.5050004@gmail.com>

An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080418/ca9caf8e/attachment.htm>

From balay at mcs.anl.gov  Fri Apr 18 00:52:14 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Fri, 18 Apr 2008 00:52:14 -0500 (CDT)
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <48081BBB.5050004@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <480336BE.3070507@gmail.com>  <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>
  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com> <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>
 <48081BBB.5050004@gmail.com>
Message-ID: <alpine.LFD.1.10.0804180045450.24258@asterix>

On Fri, 18 Apr 2008, Ben Tay wrote:

> Hi,
> 
> I've email my school super computing staff and they told me that the queue which I'm using is one meant for testing, hence, it's
> handling of work load is not good. I've sent my job to another queue and it's run on 4 processors. It's my own code because there seems
> to be something wrong with the server displaying the summary when using -log_summary with ex2f.F. I'm trying it again.

Thats wierd. We should first make sure ex2f [or ex2] are running
properly before looking at your code.

> 
> Anyway comparing just kspsolve between the two, the speedup is about 2.7. However, I noticed that for the 4 processors one, its
> MatAssemblyBegin is? 1.5158e+02, which is more than KSPSolve's 4.7041e+00. So is MatAssemblyBegin's time included in KSPSolve? If not,
> does it mean that there's something wrong about my MatAssemblyBegin?

MatAssemblyBegin is not included in KSPSolve(). Something wierd is
going here. There are 2 possibilities.

- whatever code you have before matrix assembly is unbalanced, so
  MatAssemblyBegin() acts as a barrier .

- MPI communication is not optimal within the node.

Its best to first make sure ex2 or ex2f runs fine. As recommended
earlier - you should try latest mpich2 with --with-device=ch3:nemesis:newtcp
and compare ex2/ex2f performance with your current MPI.

Satish

From bsmith at mcs.anl.gov  Fri Apr 18 07:08:46 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Fri, 18 Apr 2008 07:08:46 -0500
Subject: Slow speed after changing from serial to parallel
In-Reply-To: <48081BBB.5050004@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>	 <480336BE.3070507@gmail.com>	 <a9f269830804140623l2102e407pb0ecc08beb50ae80@mail.gmail.com>	 <48035F88.2080003@gmail.com>	 <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>	 <4804CAC0.6060201@gmail.com>	 <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>	 <4804D044.2060502@gmail.com>	 <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>	 <4804DB61.3080906@gmail.com> <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com> <48081BBB.5050004@gmail.com>
Message-ID: <C3547454-37F1-4BD3-93B1-EC2211AE5E60@mcs.anl.gov>


On Apr 17, 2008, at 10:55 PM, Ben Tay wrote:

> Hi,
>
> I've email my school super computing staff and they told me that the  
> queue which I'm using is one meant for testing, hence, it's handling  
> of work load is not good. I've sent my job to another queue and it's  
> run on 4 processors. It's my own code because there seems to be  
> something wrong with the server displaying the summary when using - 
> log_summary with ex2f.F. I'm trying it again.
>
> Anyway comparing just kspsolve between the two, the speedup is about  
> 2.7. However, I noticed that for the 4 processors one, its  
> MatAssemblyBegin is  1.5158e+02, which is more than KSPSolve's  
> 4.7041e+00.

   You have a huge load imbalance in setting the values in the matrix  
(the load imbalance is 2254.7). Are you sure each process is setting
about the same amount of matrix entries? Also are you doing an  
accurate matrix preallocation (see the detailed manual pages for
MatMPIAIJSetPreallocation() and MatCreateMPIAIJ()). You can run with - 
info and grep for malloc to see if the MatSetValues() is allocating
additional memory. If you get the matrix preallocation correct you  
will see a HUGE speed improvement.

   Barry


> So is MatAssemblyBegin's time included in KSPSolve? If not, does it  
> mean that there's something wrong about my MatAssemblyBegin?
>
> Thank you
>
> For 1 processor:
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3 named atlas3-c28 with 1 processor, by g0306332  
> Fri Apr 18 08:46:11 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           1.322e+02      1.00000   1.322e+02
> Objects:              2.200e+01      1.00000   2.200e+01
> Flops:                2.242e+08      1.00000   2.242e+08  2.242e+08
> Flops/sec:            1.696e+06      1.00000   1.696e+06  1.696e+06
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       2.100e+01      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of  
> length N --> 2N flops
>                             and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 1.3217e+02 100.0%  2.2415e+08 100.0%  0.000e 
> +00   0.0%  0.000e+00        0.0%  2.100e+01 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops/sec: Max - maximum over all processors
>                        Ratio - ratio of maximum to minimum over all  
> processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with  
> PetscLogStagePush() and PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in  
> this phase
>       %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
> time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>       ##########################################################
>       #                                                        #
>       #                          WARNING!!!                    #
>       #                                                        #
>       #   This code was run without the PreLoadBegin()         #
>       #   macros. To get timing results we always recommend    #
>       #   preloading. otherwise timing numbers may be          #
>       #   meaningless.                                         #
>       ##########################################################
>
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult                6 1.0 1.8572e-01 1.0 3.77e+08 1.0 0.0e+00 0.0e 
> +00 0.0e+00  0 31  0  0  0   0 31  0  0  0   377
> MatConvert             1 1.0 1.1636e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatAssemblyBegin       1 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 8.8531e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRow        1296000 1.0 2.6576e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         1 1.0 4.4700e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPGMRESOrthog         6 1.0 2.1104e-01 1.0 5.16e+08 1.0 0.0e+00 0.0e 
> +00 6.0e+00  0 49  0  0 29   0 49  0  0 29   516
> KSPSetup               1 1.0 6.5601e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 1.2883e+01 1.0 1.74e+07 1.0 0.0e+00 0.0e 
> +00 1.5e+01 10100  0  0 71  10100  0  0 71    17
> PCSetUp                1 1.0 4.4342e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  3  0  0  0 10   3  0  0  0 10     0
> PCApply                7 1.0 7.7337e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0
> VecMDot                6 1.0 9.8586e-02 1.0 5.52e+08 1.0 0.0e+00 0.0e 
> +00 6.0e+00  0 24  0  0 29   0 24  0  0 29   552
> VecNorm                7 1.0 6.9757e-02 1.0 2.60e+08 1.0 0.0e+00 0.0e 
> +00 7.0e+00  0  8  0  0 33   0  8  0  0 33   260
> VecScale               7 1.0 2.9803e-02 1.0 3.04e+08 1.0 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0   304
> VecCopy                1 1.0 6.1009e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                 9 1.0 3.1438e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY                1 1.0 7.5161e-03 1.0 3.45e+08 1.0 0.0e+00 0.0e 
> +00 0.0e+00  0  1  0  0  0   0  1  0  0  0   345
> VecMAXPY               7 1.0 1.4444e-01 1.0 4.85e+08 1.0 0.0e+00 0.0e 
> +00 0.0e+00  0 31  0  0  0   0 31  0  0  0   485
> VecAssemblyBegin       2 1.0 4.2915e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 6.0e+00  0  0  0  0 29   0  0  0  0 29     0
> VecAssemblyEnd         2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecNormalize           7 1.0 9.9603e-02 1.0 2.73e+08 1.0 0.0e+00 0.0e 
> +00 7.0e+00  0 12  0  0 33   0 12  0  0 33   273
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     1              1   98496004     0
>        Krylov Solver     1              1      17216     0
>       Preconditioner     1              1        272     0
>                  Vec    19             19  186638392     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 9.53674e-08
> OptionTable: -log_summary
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Wed Jan  9 14:33:02 2008
> Configure options: --with-cc=icc --with-fc=ifort --with-x=0 --with- 
> blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared --with- 
> mpi-dir=/lsftmp/g0306332/mpich2/ --with-debugging=0 --with-hypre- 
> dir=/home/enduser/g0306332/lib/hypre_shared
> -----------------------------------------
> Libraries compiled on Wed Jan  9 14:33:36 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed  
> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3
> -----------------------------------------
> Using C compiler: icc -fPIC -O
>
> for 4 processors
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c23 with 4 processors, by  
> g0306332 Fri Apr 18 08:22:11 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                          Max       Max/Min        Avg      Total
>   0.000000000000000E+000   58.1071298622710
>   0.000000000000000E+000   58.1071298622710
>   0.000000000000000E+000   58.1071298622710
>   0.000000000000000E+000   58.1071298622710
> Time (sec):           3.308e+02      1.00177   3.305e+02
> Objects:              2.900e+01      1.00000   2.900e+01
> Flops:                5.605e+07      1.00026   5.604e+07  2.242e+08
> Flops/sec:            1.697e+05      1.00201   1.695e+05  6.782e+05
> MPI Messages:         1.400e+01      2.00000   1.050e+01  4.200e+01
> MPI Message Lengths:  1.248e+05      2.00000   8.914e+03  3.744e+05
> MPI Reductions:       7.500e+00      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of  
> length N --> 2N flops
>                             and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 3.3051e+02 100.0%  2.2415e+08 100.0%  4.200e+01  
> 100.0%  8.914e+03      100.0%  3.000e+01 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops/sec: Max - maximum over all processors
>                        Ratio - ratio of maximum to minimum over all  
> processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with  
> PetscLogStagePush() and PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in  
> this phase
>       %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
> time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>       ##########################################################
>       #                                                        #
>       #                          WARNING!!!                    #
>       #                                                        #
>       #   This code was run without the PreLoadBegin()         #
>       #   macros. To get timing results we always recommend    #
>       #   preloading. otherwise timing numbers may be          #
>       #   meaningless.                                         #
>       #   preloading. otherwise timing numbers may be          #
>       #   meaningless.                                         #
>       ##########################################################
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult                6 1.0 8.2640e-02 1.6 3.37e+08 1.6 3.6e+01 9.6e 
> +03 0.0e+00  0 31 86 92  0   0 31 86 92  0   846
> MatConvert             1 1.0 2.1472e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin       1 1.0 1.5158e+022254.7 0.00e+00 0.0 0.0e+00  
> 0.0e+00 2.0e+00 22  0  0  0  7  22  0  0  0  7     0
> MatAssemblyEnd         1 1.0 1.5766e-01 1.1 0.00e+00 0.0 6.0e+00 4.8e 
> +03 7.0e+00  0  0 14  8 23   0  0 14  8 23     0
> MatGetRow         324000 1.0 8.9608e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            2 1.0 5.9605e-06 2.8 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         1 1.0 5.8902e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPGMRESOrthog         6 1.0 1.1247e-01 1.7 4.11e+08 1.7 0.0e+00 0.0e 
> +00 6.0e+00  0 49  0  0 20   0 49  0  0 20   968
> KSPSetup               1 1.0 1.5483e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 4.7041e+00 1.0 1.19e+07 1.0 3.6e+01 9.6e 
> +03 1.5e+01  1100 86 92 50   1100 86 92 50    48
> PCSetUp                1 1.0 1.5953e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  7   0  0  0  0  7     0
> PCApply                7 1.0 2.6580e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecMDot                6 1.0 7.3443e-02 2.2 4.13e+08 2.2 0.0e+00 0.0e 
> +00 6.0e+00  0 24  0  0 20   0 24  0  0 20   741
> VecNorm                7 1.0 2.5193e-01 1.1 1.94e+07 1.1 0.0e+00 0.0e 
> +00 7.0e+00  0  8  0  0 23   0  8  0  0 23    72
> VecScale               7 1.0 6.6319e-03 2.8 9.64e+08 2.8 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0  1368
> VecCopy                1 1.0 2.3100e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                 9 1.0 1.4173e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY                1 1.0 2.9502e-03 1.7 3.72e+08 1.7 0.0e+00 0.0e 
> +00 0.0e+00  0  1  0  0  0   0  1  0  0  0   879
> VecMAXPY               7 1.0 4.9046e-02 1.4 5.09e+08 1.4 0.0e+00 0.0e 
> +00 0.0e+00  0 31  0  0  0   0 31  0  0  0  1427
> VecAssemblyBegin       2 1.0 4.3297e-04 3.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 6.0e+00  0  0  0  0 20   0  0  0  0 20     0
> VecAssemblyEnd         2 1.0 5.2452e-06 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin        6 1.0 6.9666e-04 6.3 0.00e+00 0.0 3.6e+01 9.6e 
> +03 0.0e+00  0  0 86 92  0   0  0 86 92  0     0
> VecScatterEnd          6 1.0 1.4806e-02102.6 0.00e+00 0.0 0.0e+00  
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecNormalize           7 1.0 2.5431e-01 1.1 2.86e+07 1.1 0.0e+00 0.0e 
> +00 7.0e+00  0 12  0  0 23   0 12  0  0 23   107
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     3              3   49252812     0
>        Krylov Solver     1              1      17216     0
>       Preconditioner     1              1        272     0
>            Index Set     2              2       5488     0
>                  Vec    21             21   49273624     0
>          Vec Scatter     1              1          0     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 1.90735e-07
> Average time for MPI_Barrier(): 5.62668e-06
> Average time for zero size MPI_Send(): 6.73532e-06
> OptionTable: -log_summary
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan  8 22:22:08 2008
>
>
>>>>
>>>>
>>>
>>
>>
>>


From recrusader at gmail.com  Fri Apr 18 20:40:04 2008
From: recrusader at gmail.com (Yujie)
Date: Fri, 18 Apr 2008 18:40:04 -0700
Subject: how to combine several matrice into one matrix
Message-ID: <7ff0ee010804181840i2195c9e1wf6757ce6faff5a72@mail.gmail.com>

Hi, everyone

Assuming there are A1(M*N) A2(M*N) A3(M*N), I want to get
     A1
A=A2
     A3

My method is

MatGetArray(A1,&a1);
MatSetValues(A,a1);
MatGetArray(A2,&a2);
MatSetValues(A,a2);
MatGetArray(A3,&a3);
MatSetValues(A,a3);

Is there any better methods for it? The above codes are slow. thanks a lot.

Regards,
Yujie
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080418/aad26e73/attachment.htm>

From bsmith at mcs.anl.gov  Fri Apr 18 21:12:00 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Fri, 18 Apr 2008 21:12:00 -0500
Subject: how to combine several matrice into one matrix
In-Reply-To: <7ff0ee010804181840i2195c9e1wf6757ce6faff5a72@mail.gmail.com>
References: <7ff0ee010804181840i2195c9e1wf6757ce6faff5a72@mail.gmail.com>
Message-ID: <23BC215B-E85A-401B-A1BE-96A4CACF183F@mcs.anl.gov>


    For dense matrices only.

    You can call MatGetArray() on A and then do direct copies of the  
arrays.

   Barry

On Apr 18, 2008, at 8:40 PM, Yujie wrote:

> Hi, everyone
>
> Assuming there are A1(M*N) A2(M*N) A3(M*N), I want to get
>      A1
> A=A2
>      A3
>
> My method is
>
> MatGetArray(A1,&a1);
> MatSetValues(A,a1);
> MatGetArray(A2,&a2);
> MatSetValues(A,a2);
> MatGetArray(A3,&a3);
> MatSetValues(A,a3);
>
> Is there any better methods for it? The above codes are slow. thanks  
> a lot.
>
> Regards,
> Yujie
>
>


From zonexo at gmail.com  Fri Apr 18 23:11:34 2008
From: zonexo at gmail.com (Ben Tay)
Date: Sat, 19 Apr 2008 12:11:34 +0800
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <48060CD5.1010308@ethz.ch>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com> <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix> <4805820C.8030803@gmail.com> <alpine.LFD.1.10.0804152359590.17603@asterix> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> <48060CD5.1010308@ethz.ch>
Message-ID: <480970F6.5060007@gmail.com>

An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080419/64d55f59/attachment.htm>

From balay at mcs.anl.gov  Sat Apr 19 08:52:51 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Sat, 19 Apr 2008 08:52:51 -0500 (CDT)
Subject: Slow speed after changing from serial to parallel (with
 ex2f.F)
In-Reply-To: <480970F6.5060007@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>
 <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix> <4805820C.8030803@gmail.com> <alpine.LFD.1.10.0804152359590.17603@asterix> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> <48060CD5.1010308@ethz.ch>
 <480970F6.5060007@gmail.com>
Message-ID: <alpine.LFD.1.10.0804190820560.24649@asterix.localdomain>

On Sat, 19 Apr 2008, Ben Tay wrote:

> Btw, I'm not able to try the latest mpich2 because I do not have the
> administrator rights. I was told that some special configuration is
> required.

You don't need admin rights to install/use MPICH with the options I
mentioned. I was sugesting just running in SMP mode on a single
machine [from 1-8 procs on Quad-Core Intel Xeon X5355, to compare with
my SMP runs] with:

./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

> Btw, should there be any different in speed whether I use mpiuni and
> ifort or mpi and mpif90? I tried on ex2f (below) and there's only a
> small difference. If there is a large difference (mpi being slower),
> then it mean there's something wrong in the code?

For one - you are not using MPIUNI. You are using
--with-mpi-dir=/lsftmp/g0306332/mpich2. However - if compilers are the
same & compiler options are the same, I would expect the same
performance in both the cases. Do you get such different times for
different runs of the same binary?

MatMult 384 vs 423

What if you run both of the binaries on the same machine? [as a single
job?].

If you are using pbs scheduler - sugest doing:
- squb -I [to get interactive access to thenodes]
- login to each node - to check no one else is using the scheduled nodes.
- run multiple jobs during this single allocation for comparision.

These are general tips to help you debug performance on your cluster.

BTW: I get:
ex2f-600-1p.log:MatMult             1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397

You get:
log.1:MatMult???????????? 1879 1.0 2.8137e+01 1.0 3.84e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11? 0? 0? 0? 12 11? 0? 0? 0?? 384


There is a difference in number of iterations. Are you sure you are
using the same ex2f with -m 600 -n 600 options?

Satish

From zonexo at gmail.com  Sat Apr 19 10:18:49 2008
From: zonexo at gmail.com (Ben Tay)
Date: Sat, 19 Apr 2008 23:18:49 +0800
Subject: Slow speed after changing from serial to parallel (with ex2f.F)
In-Reply-To: <alpine.LFD.1.10.0804190820560.24649@asterix.localdomain>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com> <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix> <4805820C.8030803@gmail.com> <alpine.LFD.1.10.0804152359590.17603@asterix> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> <48060CD5.1010308@ethz.ch> <480970F6.5060007@gmail.com> <alpine.LFD.1.10.0804190820560.24649@asterix.localdomain>
Message-ID: <480A0D59.9050804@gmail.com>

Hi Satish,

1st of all, I forgot to inform u that I've changed the m and n to 800. I 
would like to see if the larger value can make the scaling better. If 
req, I can redo the test with m,n=600.

I can install MPICH but I don't think I can choose to run on a single 
machine using from 1 to 8 procs. In order to run the code, I usually 
have to use the command

bsub -o log -q linux64 ./a.out       for single procs

bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where 
$=no. of procs.       for multiple procs

After that, when the job is running, I'll be given the server which my 
job runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 
procs) or 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 
procs). I was told that 2*atlas3-c10 doesn't mean that it is running on 
a dual core single cpu.

Btw, are you saying that I should 1st install the latest MPICH2 build 
with the option :

./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then 
install PETSc with the MPICH2?

So after that do you know how to do what you've suggest for my servers? 
I don't really understand what you mean. May I supposed to run 4 jobs on 
1 quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that 
atlas3-c00 to c03 are the location of the quad cores. I can force to use 
them by

bsub -o log -q mcore_parallel -n $ -m quadcore -a mvapich mpirun.lsf ./a.out

Lastly, I make a mistake in the different times reported by the same 
compiler. Sorry abt that.

Thank you very much.


Satish Balay wrote:
> On Sat, 19 Apr 2008, Ben Tay wrote:
>
>   
>> Btw, I'm not able to try the latest mpich2 because I do not have the
>> administrator rights. I was told that some special configuration is
>> required.
>>     
>
> You don't need admin rights to install/use MPICH with the options I
> mentioned. I was sugesting just running in SMP mode on a single
> machine [from 1-8 procs on Quad-Core Intel Xeon X5355, to compare with
> my SMP runs] with:
>
> ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker
>
>   
>> Btw, should there be any different in speed whether I use mpiuni and
>> ifort or mpi and mpif90? I tried on ex2f (below) and there's only a
>> small difference. If there is a large difference (mpi being slower),
>> then it mean there's something wrong in the code?
>>     
>
> For one - you are not using MPIUNI. You are using
> --with-mpi-dir=/lsftmp/g0306332/mpich2. However - if compilers are the
> same & compiler options are the same, I would expect the same
> performance in both the cases. Do you get such different times for
> different runs of the same binary?
>
> MatMult 384 vs 423
>
> What if you run both of the binaries on the same machine? [as a single
> job?].
>
> If you are using pbs scheduler - sugest doing:
> - squb -I [to get interactive access to thenodes]
> - login to each node - to check no one else is using the scheduled nodes.
> - run multiple jobs during this single allocation for comparision.
>
> These are general tips to help you debug performance on your cluster.
>
> BTW: I get:
> ex2f-600-1p.log:MatMult             1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397
>
> You get:
> log.1:MatMult             1879 1.0 2.8137e+01 1.0 3.84e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11  0  0  0  12 11  0  0  0   384
>
>
> There is a difference in number of iterations. Are you sure you are
> using the same ex2f with -m 600 -n 600 options?
>
> Satish


From balay at mcs.anl.gov  Sat Apr 19 13:19:34 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Sat, 19 Apr 2008 13:19:34 -0500 (CDT)
Subject: Slow speed after changing from serial to parallel (with
 ex2f.F)
In-Reply-To: <480A0D59.9050804@gmail.com>
References: <804ab5d40804130212y7199b8e7l805a1ef152f6c757@mail.gmail.com>  <48035F88.2080003@gmail.com>  <a9f269830804140658need1b47k5ca40dc5808976ff@mail.gmail.com>  <4804CAC0.6060201@gmail.com>  <a9f269830804150846i1fa3eb3ap7be2792404dd4c4c@mail.gmail.com>
  <4804D044.2060502@gmail.com>  <CFD0A231-1B00-45A3-B9B5-C5A8DACF6787@mcs.anl.gov>  <4804DB61.3080906@gmail.com>  <a9f269830804151033q61b860d4x4e1cf09bcdf1024c@mail.gmail.com>  <48055FAD.3000105@gmail.com> <a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com>
 <48056C08.6030903@gmail.com> <alpine.LFD.1.10.0804152214270.17603@asterix> <4805820C.8030803@gmail.com> <alpine.LFD.1.10.0804152359590.17603@asterix> <480602AF.5060802@gmail.com> <2035258F-8269-4231-9110-DE9188AE5FC8@mcs.anl.gov> <48060CD5.1010308@ethz.ch>
 <480970F6.5060007@gmail.com> <alpine.LFD.1.10.0804190820560.24649@asterix.localdomain> <480A0D59.9050804@gmail.com>
Message-ID: <alpine.LFD.1.10.0804191226390.24649@asterix.localdomain>

Ben,

This conversation is getting long and winding. And we are are getting
into your cluster adminstration - which is not PETSc related.

I'll sugest you figureout about using the cluster from your system
admin and how to use bsub.

http://www.vub.ac.be/BFUCC/LSF/man/bsub.1.html

However I'll point out the following things.

- I'll sugest learning about scheduling an interactive job on your
  cluster. This will help you with running multiple jobs on the same
  machine.

- When making comparisions, have minimum changes between thing you
compare runs.

 * For eg: you are comparing runs between different queues '-q
 linux64' '-q mcore_parallel'. There might be differences here that
 can result in different performance.

 * If you are getting part of the machine [for -n 1 jobs] - verify if
 you are sharing the other part with some other job. Without this
 verification - your numbers are not meaningful. [depending upon how
 the queue is configured - it can either allocate part of the node or
 full node]

 * you should be able to request 4procs [i.e 1 complete machine] but
 be able to run either -np 1, 2 or 4 on the allocation. [This is
 easier to do in interactive mode]. This ensures nobody else is using
 the machine.  And you can run your code multiple times - to see if
 you are getting consistant results.

Regarding the primary issue you've had - with performance debugging
your PETSc appliation in *SMP-mode*, we've observed performance
anamolies in your log_summary for both your code, and ex2.f.F This
could be due one or more of the following:

- issues in your code
- issues with MPI you are using
- isues with the cluster you are using.

To narrow down - the comparisions I sugest:

- compare my ex2f.F with the *exact* same runs on your machine [You've
claimed that you also hav access to a 2-quad-core Intel Xeon X5355
machine]. So you should be able to reproduce the exact same experiment
as me - and compare the results. This should keep both software same -
and show differences in system software etc..

>>>>>
? No of Nodes Processors Qty per node Total cores per node Memory per node ?
? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ?
                               ^^^ 
? 60 Dual-Core Intel Xeon 5160 2 4 8 GB
<<<<<

i.e configure latest mpich2 with  [default compilers gcc/gfortran]:
./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

Build PETSc with this MPI [and same compilers]
./config/configure.py --with-mpi-dir= --with-debugging=0

And run ex2f.F 600x600 on 1, 2, 4, 8 procs on a *single* X5355
machine. [it might have a different queue name]

- Now compare ex2f.F performance wtih MPICH [as built above] and the
current MPI you are using. This should identify the performance
differences between MPI implemenations within the box [within the SMP
box]

- Now compare runs between ex2f.F and your application.

At each of the above steps of comparision - we are hoping to identify
the reason for differences and rectify. Perhaps this is not possible
on your cluster and you can't improve on what you already have..

If you can't debug the SMP performance issues, you can avoid SMP
completely, and use 1 MPI task per machine [or 1 MPI task per memory
bank => 2 per machine]. But you'll still have to do similar analysis
to make sure there are no performance anamolies in the tool chain.

[i.e hardware, system software, MPI, application]

If you are willing to do the above steps, we can help with the
comparisions. As mentioned - this is getting long and windy. If you
have futher questions in this regard - we should contiune it at
petsc-maint at mcs.anl.gov

Satish

On Sat, 19 Apr 2008, Ben Tay wrote:

> Hi Satish,
> 
> 1st of all, I forgot to inform u that I've changed the m and n to 800. I would
> like to see if the larger value can make the scaling better. If req, I can
> redo the test with m,n=600.
> 
> I can install MPICH but I don't think I can choose to run on a single machine
> using from 1 to 8 procs. In order to run the code, I usually have to use the
> command
> 
> bsub -o log -q linux64 ./a.out       for single procs
> 
> bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where $=no.
> of procs.       for multiple procs
> 
> After that, when the job is running, I'll be given the server which my job
> runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 procs) or
> 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 procs). I was told
> that 2*atlas3-c10 doesn't mean that it is running on a dual core single cpu.
> 
> Btw, are you saying that I should 1st install the latest MPICH2 build with the
> option :
> 
> ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then install
> PETSc with the MPICH2?
> 
> So after that do you know how to do what you've suggest for my servers? I
> don't really understand what you mean. May I supposed to run 4 jobs on 1
> quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that
> atlas3-c00 to c03 are the location of the quad cores. I can force to use them
> by
> 
> bsub -o log -q mcore_parallel -n $ -m quadcore -a mvapich mpirun.lsf ./a.out
> 
> Lastly, I make a mistake in the different times reported by the same compiler.
> Sorry abt that.
> 
> Thank you very much.

From recrusader at gmail.com  Sat Apr 19 18:08:50 2008
From: recrusader at gmail.com (Yujie)
Date: Sat, 19 Apr 2008 16:08:50 -0700
Subject: how to combine several matrice into one matrix
In-Reply-To: <23BC215B-E85A-401B-A1BE-96A4CACF183F@mcs.anl.gov>
References: <7ff0ee010804181840i2195c9e1wf6757ce6faff5a72@mail.gmail.com>
	 <23BC215B-E85A-401B-A1BE-96A4CACF183F@mcs.anl.gov>
Message-ID: <7ff0ee010804191608t120e5fa2hbafbaf243b22440b@mail.gmail.com>

Dear Barry:

Regarding my method,


On 4/18/08, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>
>   For dense matrices only.
>
>   You can call MatGetArray() on A and then do direct copies of the arrays.
>
>  Barry
>
> On Apr 18, 2008, at 8:40 PM, Yujie wrote:
>
> Hi, everyone
> >
> > Assuming there are A1(M*N) A2(M*N) A3(M*N), I want to get
> >     A1
> > A=A2
> >     A3
> >
> > My method is
> >
> > MatGetArray(A1,&a1);
> > MatSetValues(A,a1);
> > MatGetArray(A2,&a2);
> > MatSetValues(A,a2);
> > MatGetArray(A3,&a3);
> > MatSetValues(A,a3);
> >
> > Is there any better methods for it? The above codes are slow. thanks a
> > lot.
> >
> > Regards,
> > Yujie
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080419/92b77745/attachment.htm>

From w_subber at yahoo.com  Sun Apr 20 11:43:25 2008
From: w_subber at yahoo.com (Waad Subber)
Date: Sun, 20 Apr 2008 09:43:25 -0700 (PDT)
Subject: MatMatMult
Message-ID: <3922.29302.qm@web38202.mail.mud.yahoo.com>

Hi 

I want to multiply two sparse sequential matrices. In order to do that I think I should use MatMatMult ; however,  I need the expected fill ratio which I don't know in advance. 

For matrix A and matrix B I might get the nnz(A) and nnz(B) from MatGetInfo. What about nnz(C) ? 

Thanks

Waad


---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080420/8a760013/attachment.htm>

From bsmith at mcs.anl.gov  Sun Apr 20 12:59:29 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Sun, 20 Apr 2008 12:59:29 -0500
Subject: MatMatMult
In-Reply-To: <3922.29302.qm@web38202.mail.mud.yahoo.com>
References: <3922.29302.qm@web38202.mail.mud.yahoo.com>
Message-ID: <4838A11C-49C9-4B40-9A3C-3A96C06696C0@mcs.anl.gov>


   Waad,

    There is no way to compute this in advance. Use PETSC_DEFAULT to  
use the default estimate.
You can run the program with -info and search for "Fill ratio" this  
will give you needed value which will
give you an idea of what to use in the future.

   I have added this info to the manual page.

    Barry

Note that the needed ratio does depend on the matrix size so you may  
need to adjust the value for
larger matrices.

On Apr 20, 2008, at 11:43 AM, Waad Subber wrote:

> Hi
>
> I want to multiply two sparse sequential matrices. In order to do  
> that I think I should use MatMatMult ; however,  I need the expected  
> fill ratio which I don't know in advance.
>
> For matrix A and matrix B I might get the nnz(A) and nnz(B) from  
> MatGetInfo. What about nnz(C) ?
>
> Thanks
>
> Waad
>
>
>
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  
> Try it now.


From amjad11 at gmail.com  Mon Apr 21 00:34:07 2008
From: amjad11 at gmail.com (amjad ali)
Date: Mon, 21 Apr 2008 10:34:07 +0500
Subject: general question on speed using quad core Xeons
In-Reply-To: <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov>
References: <48054602.9040200@gmail.com>
	 <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>
	 <alpine.LFD.1.10.0804160919020.27121@asterix>
	 <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov>
Message-ID: <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com>

Hello Petsc team (especially Satish and Barry).

YOU SAID: FOR Better performance

(1) high per-CPU memory performance. Each CPU (core in dual core systems)
needs to have its own memory bandwith of roughly 2 or more gigabytes.

(2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you
get.

>From these points I started to look for RAM Sticks with higher MHz rates
(and obviously CPUs and motherboards supporting this speed).

But you also reflected to:

http://www.intel.com/performance/server/xeon/hpc_ansys.htm
http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm

On these pages you pointed out that: systems with CPUs of 20% higher FSB
speed are performing 20% better. But you see also RAM speed is 20% higher
for the better performing system (i.e 800MHz vs 667 MHz).

So my question is that which is the actual indicator of "memory
bandwidth"per core?
Whether it is
(1) CPU's FSB speed
(2) RAM speed
(3) Motherboard's System Bus Speed.

How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU
core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus
Speed).

With best regards,
Amjad Ali.


On 4/16/08, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>
>   Cool. The pages to look at are
>
> http://www.intel.com/performance/server/xeon/hpc_ansys.htm
> http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm
>
> these are the two benchmarks that reflect the bottlenecks of memory
> bandwidth.
> When going from dual to quad they get 1.2 times the performance, when one
> would
> like 2 times the performance.
>
>   Barry
>
>
> On Apr 16, 2008, at 9:27 AM, Satish Balay wrote:
>
> > Just a note:
> >
> > Intel does publish benchmarks for their chips.
> >
> > http://www.intel.com/performance/server/xeon/hpcapp.htm
> >
> > Satish
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080421/8c8fa293/attachment.htm>

From tribur at vision.ee.ethz.ch  Mon Apr 21 05:54:55 2008
From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch)
Date: Mon, 21 Apr 2008 12:54:55 +0200
Subject: Schur system + MatShell
Message-ID: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch>

Dear all,

Sorry for switching from Schur to Hypre and back, but I'm trying two 
approaches at the same time to find the optimal solution for our 
convection-diffusion/Stokes problems: a) solving the global stiffness 
matrix directly and in parallel using Petsc and a suitable 
preconditioner (???) and b) applying first non-overlapping domain 
decomposition and than solving the Schur complement system.

Being concerned with b in the moment, I managed to set up and solve the 
global Schur system using MATDENSE. The solving works well with, e.g., 
gmres+jacobi, but the assembling of the global Schur matrix takes too 
long. Therefore, I'm trying to use the matrix in unassembled form using 
MatShell. Not very successfully, however:

1) When I use KSPGMRES, I got the error
[1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c
[1]PETSC ERROR: PCApplyBAorAB() line 584 in src/ksp/pc/interface/precon.c
[1]PETSC ERROR: GMREScycle() line 159 in src/ksp/ksp/impls/gmres/gmres.c
[1]PETSC ERROR: KSPSolve_GMRES() line 241 in src/ksp/ksp/impls/gmres/gmres.c
[1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c

2) Using KSPBICG, it iterates without error message, but the result is 
wrong (norm of residual 1.42768 instead of something like 1.0e-10), 
although my Mat-functions PETSC_SchurMatMult and 
PETSC_SchurMatMultTranspose seem to be correct. I tested the latter 
comparing the vectors y1 and y2 computed by, e.g., MatMult(S,x,y1) and 
PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < e-15 for both functions.


Could you please have a look at my code snippet below?

Thank you very much!
Kathrin


PS: My Code:

Vec gtot, x;
...
Mat Stot;    IS is;
ISCreateGeneral(PETSC_COMM_SELF, NPb, &uBId_global[0], &is);
localData ctx;
ctx.NPb  = NPb;   //size of local Schur system S
ctx.Sloc = &S[0];
ctx.is = is;
MatCreateShell(PETSC_COMM_WORLD,m,n,NPb_tot,NPb_tot,&ctx,&Stot);
MatShellSetOperation(Stot,MATOP_MULT,(void(*)(void)) 
PETSC_SchurMatMult); 
MatShellSetOperation(Stot,MATOP_MULT_TRANSPOSE,(void(*)(void))PETSC_SchurMatMultTranspose);
KSP ksp;
KSPCreate(PETSC_COMM_WORLD,&ksp);
PC prec;
KSPSetOperators(ksp,Stot,Stot,DIFFERENT_NONZERO_PATTERN);
KSPGetPC(ksp,&prec);
PCSetType(prec, PCNONE);
KSPSetType(ksp, KSPBICG);
KSPSetTolerances(ksp, 1.e-10, 1.e-50,PETSC_DEFAULT,PETSC_DEFAULT);
KSPSolve(ksp,gtot,x);
...


From petsc-maint at mcs.anl.gov  Mon Apr 21 09:18:47 2008
From: petsc-maint at mcs.anl.gov (Satish Balay)
Date: Mon, 21 Apr 2008 09:18:47 -0500 (CDT)
Subject: general question on speed using quad core Xeons
In-Reply-To: <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com>
References: <48054602.9040200@gmail.com>  <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>  <alpine.LFD.1.10.0804160919020.27121@asterix>  <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804210807420.4725@asterix.localdomain>

On Mon, 21 Apr 2008, amjad ali wrote:

> Hello Petsc team (especially Satish and Barry).
> 
> YOU SAID: FOR Better performance
> 
> (1) high per-CPU memory performance. Each CPU (core in dual core systems)
> needs to have its own memory bandwith of roughly 2 or more gigabytes.

This 2GB/core number is a rabbit out of the hat. We just put some
reference point out - a few years back for SMP machines [when the age
of multi-core chips hasn't yet begun].

Now Intel has chipsets that can give 25GB/s. They now put 4 cores or 8
cores on this machine. [i.e 6Gb/s for 4core and 3Gb/s for the 8core
machine]

But the trend now is to cram more and more cores - so expect the
number of cores to increase faster than the chipset
memory-bandwidth. [i.e badwidth per core is likely to get smaller and
smaller]

> 
> (2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you
> get.
> 
> From these points I started to look for RAM Sticks with higher MHz rates
> (and obviously CPUs and motherboards supporting this speed).
> 
> But you also reflected to:
> 
> http://www.intel.com/performance/server/xeon/hpc_ansys.htm
> http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm
> 
> On these pages you pointed out that: systems with CPUs of 20% higher FSB
> speed are performing 20% better. But you see also RAM speed is 20% higher
> for the better performing system (i.e 800MHz vs 667 MHz).
> 
> So my question is that which is the actual indicator of "memory
> bandwidth"per core?
> Whether it is
> (1) CPU's FSB speed
> (2) RAM speed
> (3) Motherboard's System Bus Speed.

The answer is a bit complicated here. It depends upon the system
architure.

CPU Chip[s]  <-----> chipset  <-----> memory [banks]

- Is the bandwidth on the CPU-Chip side is same as on the memory side?
 [there are machines where this is different, but most macines use
 *synchronous* buses - so that the 'memory chipset' does not have to
 do translation/buffering]

For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]:
bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec *  = 25.6GByte/sec

The othe CPU side - its balanced by FSB1600 =>
Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se

So generally all the 3 things you've listed has to *match* correctly.
[Some CPUs and chipsets support multiple FSB frequencies - so have to
check what freq is set for the machine you are buying.]

This choice can have *cost* implications.. Is it worth it to spend 20%
more to get 20%more bandwidth? Perhaps yes for sparse-matrix
appliations - but not for others..

> How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU
> core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus
> Speed).

As mentioned 2GB/core is a approximate nubmer we thought off a few
years back - when there were no multi-core machine [just SMP
chipsets].

All we can do is eavalue the memorybandwidth number for a given
machine. We can't *ensure* it - as this is a choice made by and other
chip designers.[intel, amd, ibm etc..] The choice for the currently
available products was probably made a few years back.
          
There is another component to this memory bandwidth debate. Which of
the following do we want?

1. best scalability chip? [when comparing the performance from 1-N cores]
2. overall best performance on 1-core. or N cores [i.e node].

And from the system architecture issues - mentioned above - there are
a couple of other issues that influcene this.

- are the CPU-Chips sharing bandwidth or spliting bandwidth?
- within the CPU-Chip [multi-core] is the memory bus shared or split?

The first one can achieved by the hardware spliting up 1/Nth total
available bandwidth per core. So it shows scalable results. But the
1-core performance can be low.

The second choice could happen by not spliting - but sharing at the
core level.  For eg: Intel machines - memory bandwidth is divided at
the CPU-chip level.

For the example case MatMult from ex2 on 8-core intel machine had the
following performance on 1,2,4,8 cores:
397, 632, 724, 749 [MFlop/s]

To me - its not clear which architecture is better. For publishing
scalability results - the above numbers don't look good. [but it could
be the best performance you can squeze out any sequential job - or out
of any 8-core architecture]

Satish


From jed at 59A2.org  Mon Apr 21 09:53:30 2008
From: jed at 59A2.org (Jed Brown)
Date: Mon, 21 Apr 2008 16:53:30 +0200
Subject: flexible block matrix
Message-ID: <20080421145330.GA1994@brakk.ethz.ch>

I am solving a Stokes problem with nonlinear slip boundary conditions.  I don't
think I can take advantage of block structure since the normal component of
velocity has a Dirichlet constraint and this must be built into the velocity
space in order to preserve conditioning.  An alternative formulation involves a
Lagrange multiplier for the constraint, but even with clever preconditioning,
this system is still more expensive to solve according to [1].

In solving the (velocity-pressure) saddle point problem, many approximate solves
with the velocity system is needed in the preconditioner, hence I need a strong
preconditioner for the velocity system.  Currently, I am using algebraic
multigrid on a low-order discretization which works fairly well.  Since Hypre
and ML only take AIJ matrices, perhaps I shouldn't worry about blocking after
all.  Is there a way to use MATBAIJ when some nodes have fewer degrees of
freedom?  Should I bother?

Note that my method (currently just a single element) uses a high order
discretization on some elements and low order on others.  The global matrix for
the low order elements is assembled, but it is applied locally for the high order
elements taking advantage of the tensor product basis.  For the preconditioner,
a low order discretization on the nodes of the high order elements is globally
assembled and added to the global matrix from the low-order elements.
Experiments with a single element (spectral rather than spectral/hp element)
show this to be effective, converging in a constant number of iterations
independent of polynomial order when using a V-cycle of AMG as a preconditioner.

Thanks.

Jed


[1] B?nsch, H?hn 2000, `Numerical treatment of the Navier-Stokes equations with
slip boundary conditions', SIAM J. Sci. Comput.


From balay at mcs.anl.gov  Mon Apr 21 10:33:39 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Mon, 21 Apr 2008 10:33:39 -0500 (CDT)
Subject: flexible block matrix
In-Reply-To: <20080421145330.GA1994@brakk.ethz.ch>
References: <20080421145330.GA1994@brakk.ethz.ch>
Message-ID: <alpine.LFD.1.10.0804211026150.4725@asterix.localdomain>

On Mon, 21 Apr 2008, Jed Brown wrote:

> I am solving a Stokes problem with nonlinear slip boundary conditions.  I don't
> think I can take advantage of block structure since the normal component of
> velocity has a Dirichlet constraint and this must be built into the velocity
> space in order to preserve conditioning.  An alternative formulation involves a
> Lagrange multiplier for the constraint, but even with clever preconditioning,
> this system is still more expensive to solve according to [1].
> 
> In solving the (velocity-pressure) saddle point problem, many approximate solves
> with the velocity system is needed in the preconditioner, hence I need a strong
> preconditioner for the velocity system.  Currently, I am using algebraic
> multigrid on a low-order discretization which works fairly well.  Since Hypre
> and ML only take AIJ matrices, perhaps I shouldn't worry about blocking after
> all.  Is there a way to use MATBAIJ when some nodes have fewer degrees of
> freedom?  Should I bother?

I'll say - don't bother. BAIJ can't support varing block size. The
code that supports it is INODE code - which is already part of AIJ
type - and is the default for AIJ.

You can run your code with -mat_no_inode to see the performance
difference between basic AIJ and INODE-AIJ. [The primary thing to look
for in -log_summary is MatMult()]

Inode code looks for consequitive *rows* with same column indices, and
marks them as a single inode. For each inode - [i.e say 5 rows] the
column indices are loaded only once, and used for all 5 rows - thus
improving the performance. A matrix can have an inode structure of
[2,2,3,3,1,3] etc.. i.e 14x14 matrix.

Satish

> Note that my method (currently just a single element) uses a high order
> discretization on some elements and low order on others.  The global matrix for
> the low order elements is assembled, but it is applied locally for the high order
> elements taking advantage of the tensor product basis.  For the preconditioner,
> a low order discretization on the nodes of the high order elements is globally
> assembled and added to the global matrix from the low-order elements.
> Experiments with a single element (spectral rather than spectral/hp element)
> show this to be effective, converging in a constant number of iterations
> independent of polynomial order when using a V-cycle of AMG as a preconditioner.
> 
> Thanks.
> 
> Jed
> 
> 
> [1] B?nsch, H?hn 2000, `Numerical treatment of the Navier-Stokes equations with
> slip boundary conditions', SIAM J. Sci. Comput.
> 
> 

From petsc-maint at mcs.anl.gov  Mon Apr 21 09:18:47 2008
From: petsc-maint at mcs.anl.gov (Satish Balay)
Date: Mon, 21 Apr 2008 09:18:47 -0500 (CDT)
Subject: general question on speed using quad core Xeons
In-Reply-To: <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com>
References: <48054602.9040200@gmail.com>  <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>  <alpine.LFD.1.10.0804160919020.27121@asterix>  <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804210807420.4725@asterix.localdomain>

On Mon, 21 Apr 2008, amjad ali wrote:

> Hello Petsc team (especially Satish and Barry).
> 
> YOU SAID: FOR Better performance
> 
> (1) high per-CPU memory performance. Each CPU (core in dual core systems)
> needs to have its own memory bandwith of roughly 2 or more gigabytes.

This 2GB/core number is a rabbit out of the hat. We just put some
reference point out - a few years back for SMP machines [when the age
of multi-core chips hasn't yet begun].

Now Intel has chipsets that can give 25GB/s. They now put 4 cores or 8
cores on this machine. [i.e 6Gb/s for 4core and 3Gb/s for the 8core
machine]

But the trend now is to cram more and more cores - so expect the
number of cores to increase faster than the chipset
memory-bandwidth. [i.e badwidth per core is likely to get smaller and
smaller]

> 
> (2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you
> get.
> 
> From these points I started to look for RAM Sticks with higher MHz rates
> (and obviously CPUs and motherboards supporting this speed).
> 
> But you also reflected to:
> 
> http://www.intel.com/performance/server/xeon/hpc_ansys.htm
> http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm
> 
> On these pages you pointed out that: systems with CPUs of 20% higher FSB
> speed are performing 20% better. But you see also RAM speed is 20% higher
> for the better performing system (i.e 800MHz vs 667 MHz).
> 
> So my question is that which is the actual indicator of "memory
> bandwidth"per core?
> Whether it is
> (1) CPU's FSB speed
> (2) RAM speed
> (3) Motherboard's System Bus Speed.

The answer is a bit complicated here. It depends upon the system
architure.

CPU Chip[s]  <-----> chipset  <-----> memory [banks]

- Is the bandwidth on the CPU-Chip side is same as on the memory side?
 [there are machines where this is different, but most macines use
 *synchronous* buses - so that the 'memory chipset' does not have to
 do translation/buffering]

For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]:
bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec *  = 25.6GByte/sec

The othe CPU side - its balanced by FSB1600 =>
Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se

So generally all the 3 things you've listed has to *match* correctly.
[Some CPUs and chipsets support multiple FSB frequencies - so have to
check what freq is set for the machine you are buying.]

This choice can have *cost* implications.. Is it worth it to spend 20%
more to get 20%more bandwidth? Perhaps yes for sparse-matrix
appliations - but not for others..

> How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU
> core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus
> Speed).

As mentioned 2GB/core is a approximate nubmer we thought off a few
years back - when there were no multi-core machine [just SMP
chipsets].

All we can do is eavalue the memorybandwidth number for a given
machine. We can't *ensure* it - as this is a choice made by and other
chip designers.[intel, amd, ibm etc..] The choice for the currently
available products was probably made a few years back.
          
There is another component to this memory bandwidth debate. Which of
the following do we want?

1. best scalability chip? [when comparing the performance from 1-N cores]
2. overall best performance on 1-core. or N cores [i.e node].

And from the system architecture issues - mentioned above - there are
a couple of other issues that influcene this.

- are the CPU-Chips sharing bandwidth or spliting bandwidth?
- within the CPU-Chip [multi-core] is the memory bus shared or split?

The first one can achieved by the hardware spliting up 1/Nth total
available bandwidth per core. So it shows scalable results. But the
1-core performance can be low.

The second choice could happen by not spliting - but sharing at the
core level.  For eg: Intel machines - memory bandwidth is divided at
the CPU-chip level.

For the example case MatMult from ex2 on 8-core intel machine had the
following performance on 1,2,4,8 cores:
397, 632, 724, 749 [MFlop/s]

To me - its not clear which architecture is better. For publishing
scalability results - the above numbers don't look good. [but it could
be the best performance you can squeze out any sequential job - or out
of any 8-core architecture]

Satish


From balay at mcs.anl.gov  Mon Apr 21 10:55:21 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Mon, 21 Apr 2008 10:55:21 -0500 (CDT)
Subject: Schur system + MatShell
In-Reply-To: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch>
References: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch>
Message-ID: <alpine.LFD.1.10.0804211047130.10239@asterix.localdomain>

On Mon, 21 Apr 2008, tribur at vision.ee.ethz.ch wrote:

> Dear all,
> 
> Sorry for switching from Schur to Hypre and back, but I'm trying two
> approaches at the same time to find the optimal solution for our
> convection-diffusion/Stokes problems: a) solving the global stiffness matrix
> directly and in parallel using Petsc and a suitable preconditioner (???) and
> b) applying first non-overlapping domain decomposition and than solving the
> Schur complement system.
> 
> Being concerned with b in the moment, I managed to set up and solve the global
> Schur system using MATDENSE. The solving works well with, e.g., gmres+jacobi,
> but the assembling of the global Schur matrix takes too long.

Hmm - with dense - if you have some other efficient way of assembling
the matrix - you can specify this directly to MatCreateMPIDense() - [or
use MatGetArray() - and set the values directly into this array]

> Therefore, I'm
> trying to use the matrix in unassembled form using MatShell. Not very
> successfully, however:
> 
> 1) When I use KSPGMRES, I got the error
> [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c
> [1]PETSC ERROR: PCApplyBAorAB() line 584 in src/ksp/pc/interface/precon.c
> [1]PETSC ERROR: GMREScycle() line 159 in src/ksp/ksp/impls/gmres/gmres.c
> [1]PETSC ERROR: KSPSolve_GMRES() line 241 in src/ksp/ksp/impls/gmres/gmres.c
> [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c

Which version of PETSc is this? I can't place the line numbers
correctly with latest petsc-2.3.3. [Can you send the complete error
trace?]

> 2) Using KSPBICG, it iterates without error message, but the result is wrong
> (norm of residual 1.42768 instead of something like 1.0e-10), although my
> Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose seem to be
> correct. I tested the latter comparing the vectors y1 and y2 computed by,
> e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < e-15
> for both functions.

Not sure what the problem could be. Can you confirm that the code is
valgrind clean? It could explain the issue 1 aswell.

With mpich2 you can do the following on linux:

 mpiexec -np 2 valgrind --tool=memcheck ./executable

Satish

> 
> 
> Could you please have a look at my code snippet below?
> 
> Thank you very much!
> Kathrin
> 
> 
> 
> PS: My Code:
> 
> Vec gtot, x;
> ...
> Mat Stot;    IS is;
> ISCreateGeneral(PETSC_COMM_SELF, NPb, &uBId_global[0], &is);
> localData ctx;
> ctx.NPb  = NPb;   //size of local Schur system S
> ctx.Sloc = &S[0];
> ctx.is = is;
> MatCreateShell(PETSC_COMM_WORLD,m,n,NPb_tot,NPb_tot,&ctx,&Stot);
> MatShellSetOperation(Stot,MATOP_MULT,(void(*)(void)) PETSC_SchurMatMult);
> MatShellSetOperation(Stot,MATOP_MULT_TRANSPOSE,(void(*)(void))PETSC_SchurMatMultTranspose);
> KSP ksp;
> KSPCreate(PETSC_COMM_WORLD,&ksp);
> PC prec;
> KSPSetOperators(ksp,Stot,Stot,DIFFERENT_NONZERO_PATTERN);
> KSPGetPC(ksp,&prec);
> PCSetType(prec, PCNONE);
> KSPSetType(ksp, KSPBICG);
> KSPSetTolerances(ksp, 1.e-10, 1.e-50,PETSC_DEFAULT,PETSC_DEFAULT);
> KSPSolve(ksp,gtot,x);
> ...
> 
> 
> 


From bsmith at mcs.anl.gov  Mon Apr 21 11:43:10 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Mon, 21 Apr 2008 11:43:10 -0500
Subject: Schur system + MatShell
In-Reply-To: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch>
References: <20080421125455.sieixiv1j40w4kg8@email.ee.ethz.ch>
Message-ID: <92B048D2-BCC4-4463-9EE0-9353959C6B11@mcs.anl.gov>


On Apr 21, 2008, at 5:54 AM, tribur at vision.ee.ethz.ch wrote:

> Dear all,
>
> Sorry for switching from Schur to Hypre and back, but I'm trying two  
> approaches at the same time to find the optimal solution for our  
> convection-diffusion/Stokes problems: a) solving the global  
> stiffness matrix directly and in parallel using Petsc and a suitable  
> preconditioner (???) and b) applying first non-overlapping domain  
> decomposition and than solving the Schur complement system.
>
> Being concerned with b in the moment, I managed to set up and solve  
> the global Schur system using MATDENSE. The solving works well with,  
> e.g., gmres+jacobi, but the assembling of the global Schur matrix  
> takes too long.

    Even if GMRES+Jacobi works reasonably well, GMRES without Jacobi  
can be much worse, this is a danger of matrix free without
some kind of preconditioner.

> Therefore, I'm trying to use the matrix in unassembled form using  
> MatShell. Not very successfully, however:
>
> 1) When I use KSPGMRES, I got the error
> [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c
> [1]PETSC ERROR: PCApplyBAorAB() line 584 in src/ksp/pc/interface/ 
> precon.c
> [1]PETSC ERROR: GMREScycle() line 159 in src/ksp/ksp/impls/gmres/ 
> gmres.c
> [1]PETSC ERROR: KSPSolve_GMRES() line 241 in src/ksp/ksp/impls/gmres/ 
> gmres.c
> [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c
>
> 2) Using KSPBICG, it iterates without error message, but the result  
> is wrong (norm of residual 1.42768 instead of something like  
> 1.0e-10), although my Mat-functions PETSC_SchurMatMult and  
> PETSC_SchurMatMultTranspose seem to be correct. I tested the latter  
> comparing the vectors y1 and y2 computed by, e.g., MatMult(S,x,y1)  
> and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < e-15 for both  
> functions.

Run it with -ksp_monitor_true_residual (-ksp_truemonitor) for PETSc  
pre 2.3.3) and -ksp_converged_reason to see
what is happening. Note that KSPSolve() does NOT generate an error if  
it fails to converge, you need to check with
KSPGetConvergedReason() or -ksp_converged_reason after the solve to  
see if KSP thinks it has converged or
why it did not converge.

    Barry
>
>
>
> Could you please have a look at my code snippet below?
>
> Thank you very much!
> Kathrin
>
>
>
> PS: My Code:
>
> Vec gtot, x;
> ...
> Mat Stot;    IS is;
> ISCreateGeneral(PETSC_COMM_SELF, NPb, &uBId_global[0], &is);
> localData ctx;
> ctx.NPb  = NPb;   //size of local Schur system S
> ctx.Sloc = &S[0];
> ctx.is = is;
> MatCreateShell(PETSC_COMM_WORLD,m,n,NPb_tot,NPb_tot,&ctx,&Stot);
> MatShellSetOperation(Stot,MATOP_MULT,(void(*)(void))  
> PETSC_SchurMatMult); MatShellSetOperation(Stot,MATOP_MULT_TRANSPOSE, 
> (void(*)(void))PETSC_SchurMatMultTranspose);
> KSP ksp;
> KSPCreate(PETSC_COMM_WORLD,&ksp);
> PC prec;
> KSPSetOperators(ksp,Stot,Stot,DIFFERENT_NONZERO_PATTERN);
> KSPGetPC(ksp,&prec);
> PCSetType(prec, PCNONE);
> KSPSetType(ksp, KSPBICG);
> KSPSetTolerances(ksp, 1.e-10, 1.e-50,PETSC_DEFAULT,PETSC_DEFAULT);
> KSPSolve(ksp,gtot,x);
> ...
>
>


From bsmith at mcs.anl.gov  Mon Apr 21 11:47:26 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Mon, 21 Apr 2008 11:47:26 -0500
Subject: flexible block matrix
In-Reply-To: <20080421145330.GA1994@brakk.ethz.ch>
References: <20080421145330.GA1994@brakk.ethz.ch>
Message-ID: <553D810D-E87C-4084-A9C4-F4F82B690D6B@mcs.anl.gov>


   I concur with Satish, AIJ with inodes is essentially variable block  
size
so trying to force BAIJ when it is not appropriate is unnecessary.

   Barry

On Apr 21, 2008, at 9:53 AM, Jed Brown wrote:

> I am solving a Stokes problem with nonlinear slip boundary  
> conditions.  I don't
> think I can take advantage of block structure since the normal  
> component of
> velocity has a Dirichlet constraint and this must be built into the  
> velocity
> space in order to preserve conditioning.  An alternative formulation  
> involves a
> Lagrange multiplier for the constraint, but even with clever  
> preconditioning,
> this system is still more expensive to solve according to [1].
>
> In solving the (velocity-pressure) saddle point problem, many  
> approximate solves
> with the velocity system is needed in the preconditioner, hence I  
> need a strong
> preconditioner for the velocity system.  Currently, I am using  
> algebraic
> multigrid on a low-order discretization which works fairly well.   
> Since Hypre
> and ML only take AIJ matrices, perhaps I shouldn't worry about  
> blocking after
> all.  Is there a way to use MATBAIJ when some nodes have fewer  
> degrees of
> freedom?  Should I bother?
>
> Note that my method (currently just a single element) uses a high  
> order
> discretization on some elements and low order on others.  The global  
> matrix for
> the low order elements is assembled, but it is applied locally for  
> the high order
> elements taking advantage of the tensor product basis.  For the  
> preconditioner,
> a low order discretization on the nodes of the high order elements  
> is globally
> assembled and added to the global matrix from the low-order elements.
> Experiments with a single element (spectral rather than spectral/hp  
> element)
> show this to be effective, converging in a constant number of  
> iterations
> independent of polynomial order when using a V-cycle of AMG as a  
> preconditioner.
>
> Thanks.
>
> Jed
>
>
> [1] B?nsch, H?hn 2000, `Numerical treatment of the Navier-Stokes  
> equations with
> slip boundary conditions', SIAM J. Sci. Comput.
>


From amjad11 at gmail.com  Tue Apr 22 00:45:42 2008
From: amjad11 at gmail.com (amjad ali)
Date: Tue, 22 Apr 2008 10:45:42 +0500
Subject: general question on speed using quad core Xeons
In-Reply-To: <alpine.LFD.1.10.0804210807420.4725@asterix.localdomain>
References: <48054602.9040200@gmail.com>
	 <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>
	 <alpine.LFD.1.10.0804160919020.27121@asterix>
	 <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov>
	 <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com>
	 <alpine.LFD.1.10.0804210807420.4725@asterix.localdomain>
Message-ID: <428810f20804212245y27fab8bfh336aa5a26ff98528@mail.gmail.com>

Hello Dr. Satish,
Thanks for your intellectual reply.


>
> The othe CPU side - its balanced by FSB1600 =>
> Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se
>
> So generally all the 3 things you've listed has to *match* correctly.
> [Some CPUs and chipsets support multiple FSB frequencies - so have to
> check what freq is set for the machine you are buying.]


Currently I am making a gigabit ethernet cluster of 4 compute nodes
(totaling 8 cores), with each node having
One Processor: Intel Core2Duo E6750 2.66 GHz Processor, FSB 1333MHz, 4MB L2.
Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset
supporting 1333/1066/800 MHz FSB .
RAM: 2GB DDR2 800MHz ECC System Memory.

What memory-bandwidth/CPU-core will be there for this system?
Any other comment/remark?
My area work deals in sparse matrices.
I near future I would like to add 12 similar compute nodes in the cluster.

On such a cluster what if I relapce "C2D 2.66 GHz FSB1333 processor" with
"Intel Xeon 3070/3075 2.66 GHz FSB1066/1333 processor"? Would there be any
significant improvement in performance?

with best regards,
Amjad Ali.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080422/9ceebc29/attachment.htm>

From tribur at vision.ee.ethz.ch  Tue Apr 22 07:06:12 2008
From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch)
Date: Tue, 22 Apr 2008 14:06:12 +0200
Subject: Schur system + MatShell
Message-ID: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch>

Dear Satish, dear Barry, dear rest,

Thank you for your response.

>> Being concerned with b in the moment, I managed to set up and solve 
>> the global
>> Schur system using MATDENSE. The solving works well with, e.g., 
>> gmres+jacobi,
>> but the assembling of the global Schur matrix takes too long.
>
> Hmm - with dense - if you have some other efficient way of assembling
> the matrix - you can specify this directly to MatCreateMPIDense() - [or
> use MatGetArray() - and set the values directly into this array]

I don't see an alternative, as the partitioning of PETSc has nothing to 
do with my partitioning (unstructured mesh, partitioned with Metis). 
Moreover, in case of 2 Subdomains, e.g., the local Schur complements S1 
and S2 have the same size as the global one, S=S1+S2, and there is no 
matrix format in PETSc supporting this, isn't it?


>> 2) Using KSPBICG, it iterates without error message, but the result is wrong
>> (norm of residual 1.42768 instead of something like 1.0e-10), although my
>> Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose seem to be
>> correct. I tested the latter comparing the vectors y1 and y2 computed by,
>> e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was < e-15
>> for both functions.
>
> Not sure what the problem could be. Can you confirm that the code is
> valgrind clean? It could explain the issue 1 aswell.

Valgrind didn't find an error in my PETSC_SchurMatMult, but PETSc gave 
me now also an error message when running with KSPBICG (same 
MatShell-code as in my previous e-mail):

[1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c
[1]PETSC ERROR: KSPSolve_BiCG() line 95 in src/ksp/ksp/impls/bicg/bicg.c
[1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c

The error seems to occurr at the second call of PETSC_SchurMatMult.
I attached the related source files (petsc version petsc-2.3.3-p8, 
downloaded about 4 months ago), and below you find additionally my 
MatMult-function PETSC_SchurMatMult (maybe there is problem with ctx?).

I'm stuck and I'll be very grateful for any help,
Kathrin


PS My user defined MatMult-function:

typedef struct {
  int NPb;
  IS is;       //int *uBId_global;
  double *Sloc;
} localData;


void PETSC_SchurMatMult(Mat Stot, Vec xtot, Vec ytot){
   localData * ctx;
  MatShellGetContext(Stot, (void**) &ctx);
  int NPb = ctx->NPb;   IS is   = ctx->is;
  double *Sloc = ctx->Sloc;

  //extracting local vector xloc
  Vec xloc;
  VecCreateSeq(PETSC_COMM_SELF, NPb, &xloc);
  VecScatter ctx2;
  VecScatterCreate(xtot,is,xloc,PETSC_NULL, &ctx2);
  VecScatterBegin(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD);
  VecScatterEnd(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD);
  VecScatterDestroy(ctx2);

  //local matrix multiplication
  vector<double> yloc_array(NPb,0);
  PetscScalar *xloc_array;
  VecGetArray(xloc, &xloc_array);
  for(int k=0; k<NPb; k++)
    for(int l=0; l<NPb; l++)
      yloc_array[k] += Sloc[k*NPb+l] * xloc_array[l];
  VecRestoreArray(xloc, &xloc_array);
  VecDestroy(xloc);

  //scatter yloc to ytot
  Vec yloc;
  VecCreateSeqWithArray(PETSC_COMM_SELF, NPb, PETSC_NULL, &yloc);
  VecPlaceArray(yloc,&yloc_array[0]);
  VecScatter ctx3;
  VecScatterCreate(yloc, PETSC_NULL, ytot, is, &ctx3);
  VecScatterBegin(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD);
  VecScatterEnd(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD);
  VecScatterDestroy(ctx3);
  VecDestroy(yloc);

}


From bsmith at mcs.anl.gov  Tue Apr 22 07:11:06 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Tue, 22 Apr 2008 07:11:06 -0500
Subject: Schur system + MatShell
In-Reply-To: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch>
References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch>
Message-ID: <C723E687-FD9D-4204-BBD6-22185C3B7431@mcs.anl.gov>


On Apr 22, 2008, at 7:06 AM, tribur at vision.ee.ethz.ch wrote:

> Dear Satish, dear Barry, dear rest,
>
> Thank you for your response.
>
>>> Being concerned with b in the moment, I managed to set up and  
>>> solve the global
>>> Schur system using MATDENSE. The solving works well with, e.g.,  
>>> gmres+jacobi,
>>> but the assembling of the global Schur matrix takes too long.
>>
>> Hmm - with dense - if you have some other efficient way of assembling
>> the matrix - you can specify this directly to MatCreateMPIDense() -  
>> [or
>> use MatGetArray() - and set the values directly into this array]
>
> I don't see an alternative, as the partitioning of PETSc has nothing  
> to do with my partitioning (unstructured mesh, partitioned with  
> Metis). Moreover, in case of 2 Subdomains, e.g., the local Schur  
> complements S1 and S2 have the same size as the global one, S=S1+S2,  
> and there is no matrix format in PETSc supporting this, isn't it?
>
>
>>> 2) Using KSPBICG, it iterates without error message, but the  
>>> result is wrong
>>> (norm of residual 1.42768 instead of something like 1.0e-10),  
>>> although my
>>> Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose  
>>> seem to be
>>> correct. I tested the latter comparing the vectors y1 and y2  
>>> computed by,
>>> e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2)  
>>> was < e-15
>>> for both functions.
>>
>> Not sure what the problem could be. Can you confirm that the code is
>> valgrind clean? It could explain the issue 1 aswell.
>
> Valgrind didn't find an error in my PETSC_SchurMatMult, but PETSc  
> gave me now also an error message when running with KSPBICG (same  
> MatShell-code as in my previous e-mail):
>
> [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c
> [1]PETSC ERROR: KSPSolve_BiCG() line 95 in src/ksp/ksp/impls/bicg/ 
> bicg.c
> [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c
>

   What is the error message? This just tells you a problem in  
MatMult(). You need to send EVERYTHING that was
printed when the program stopped with an error. Send it to petsc-maint at mcs.anl.gov


    Barry

> The error seems to occurr at the second call of PETSC_SchurMatMult.
> I attached the related source files (petsc version petsc-2.3.3-p8,  
> downloaded about 4 months ago), and below you find additionally my  
> MatMult-function PETSC_SchurMatMult (maybe there is problem with  
> ctx?).
>
> I'm stuck and I'll be very grateful for any help,
> Kathrin
>
>
> PS My user defined MatMult-function:
>
> typedef struct {
> int NPb;
> IS is;       //int *uBId_global;
> double *Sloc;
> } localData;
>
>
> void PETSC_SchurMatMult(Mat Stot, Vec xtot, Vec ytot){
>  localData * ctx;
> MatShellGetContext(Stot, (void**) &ctx);
> int NPb = ctx->NPb;   IS is   = ctx->is;
> double *Sloc = ctx->Sloc;
>
> //extracting local vector xloc
> Vec xloc;
> VecCreateSeq(PETSC_COMM_SELF, NPb, &xloc);
> VecScatter ctx2;
> VecScatterCreate(xtot,is,xloc,PETSC_NULL, &ctx2);
> VecScatterBegin(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD);
> VecScatterEnd(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD);
> VecScatterDestroy(ctx2);
>
> //local matrix multiplication
> vector<double> yloc_array(NPb,0);
> PetscScalar *xloc_array;
> VecGetArray(xloc, &xloc_array);
> for(int k=0; k<NPb; k++)
>   for(int l=0; l<NPb; l++)
>     yloc_array[k] += Sloc[k*NPb+l] * xloc_array[l];
> VecRestoreArray(xloc, &xloc_array);
> VecDestroy(xloc);
>
> //scatter yloc to ytot
> Vec yloc;
> VecCreateSeqWithArray(PETSC_COMM_SELF, NPb, PETSC_NULL, &yloc);
> VecPlaceArray(yloc,&yloc_array[0]);
> VecScatter ctx3;
> VecScatterCreate(yloc, PETSC_NULL, ytot, is, &ctx3);
> VecScatterBegin(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD);
> VecScatterEnd(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD);
> VecScatterDestroy(ctx3);
> VecDestroy(yloc);
>
> }
>
>
>
>
>
>
>
>


From knepley at gmail.com  Tue Apr 22 07:16:23 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 22 Apr 2008 07:16:23 -0500
Subject: Schur system + MatShell
In-Reply-To: <C723E687-FD9D-4204-BBD6-22185C3B7431@mcs.anl.gov>
References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch>
	 <C723E687-FD9D-4204-BBD6-22185C3B7431@mcs.anl.gov>
Message-ID: <a9f269830804220516g11ba4335rd9bd53c35fc26ba@mail.gmail.com>

On Tue, Apr 22, 2008 at 7:11 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>  On Apr 22, 2008, at 7:06 AM, tribur at vision.ee.ethz.ch wrote:
>
>
> > Dear Satish, dear Barry, dear rest,
> >
> > Thank you for your response.
> >
> >
> > >
> > > > Being concerned with b in the moment, I managed to set up and solve
> the global
> > > > Schur system using MATDENSE. The solving works well with, e.g.,
> gmres+jacobi,
> > > > but the assembling of the global Schur matrix takes too long.
> > > >
> > >
> > > Hmm - with dense - if you have some other efficient way of assembling
> > > the matrix - you can specify this directly to MatCreateMPIDense() - [or
> > > use MatGetArray() - and set the values directly into this array]
> > >
> >
> > I don't see an alternative, as the partitioning of PETSc has nothing to do
> with my partitioning (unstructured mesh, partitioned with Metis). Moreover,
> in case of 2 Subdomains, e.g., the local Schur complements S1 and S2 have
> the same size as the global one, S=S1+S2, and there is no matrix format in
> PETSc supporting this, isn't it?

This does not make sense to me. You decide how PETSc partitions things (if
you want), And, I really do not understand what you want in parallel.
If you mean
that you solve the local Schur complements independently, then use a local
matrix for each one. The important thing is to work out the linear algebra prior
to coding. Then wrapping it with PETSc Mat/Vec is easy.

   Matt

> >
> >
> >
> > >
> > > > 2) Using KSPBICG, it iterates without error message, but the result is
> wrong
> > > > (norm of residual 1.42768 instead of something like 1.0e-10), although
> my
> > > > Mat-functions PETSC_SchurMatMult and PETSC_SchurMatMultTranspose seem
> to be
> > > > correct. I tested the latter comparing the vectors y1 and y2 computed
> by,
> > > > e.g., MatMult(S,x,y1) and PETS_SchurMatMult(S,x,y2). Norm(y1-y2) was <
> e-15
> > > > for both functions.
> > > >
> > >
> > > Not sure what the problem could be. Can you confirm that the code is
> > > valgrind clean? It could explain the issue 1 aswell.
> > >
> >
> > Valgrind didn't find an error in my PETSC_SchurMatMult, but PETSc gave me
> now also an error message when running with KSPBICG (same MatShell-code as
> in my previous e-mail):
> >
> > [1]PETSC ERROR: MatMult() line 1632 in src/mat/interface/matrix.c
> > [1]PETSC ERROR: KSPSolve_BiCG() line 95 in src/ksp/ksp/impls/bicg/bicg.c
> > [1]PETSC ERROR: KSPSolve() line 379 in src/ksp/ksp/interface/itfunc.c
> >
> >
>
>   What is the error message? This just tells you a problem in MatMult(). You
> need to send EVERYTHING that was
>  printed when the program stopped with an error. Send it to
> petsc-maint at mcs.anl.gov
>
>
>    Barry
>
>
>
>
> > The error seems to occurr at the second call of PETSC_SchurMatMult.
> > I attached the related source files (petsc version petsc-2.3.3-p8,
> downloaded about 4 months ago), and below you find additionally my
> MatMult-function PETSC_SchurMatMult (maybe there is problem with ctx?).
> >
> > I'm stuck and I'll be very grateful for any help,
> > Kathrin
> >
> >
> > PS My user defined MatMult-function:
> >
> > typedef struct {
> > int NPb;
> > IS is;       //int *uBId_global;
> > double *Sloc;
> > } localData;
> >
> >
> > void PETSC_SchurMatMult(Mat Stot, Vec xtot, Vec ytot){
> >  localData * ctx;
> > MatShellGetContext(Stot, (void**) &ctx);
> > int NPb = ctx->NPb;   IS is   = ctx->is;
> > double *Sloc = ctx->Sloc;
> >
> > //extracting local vector xloc
> > Vec xloc;
> > VecCreateSeq(PETSC_COMM_SELF, NPb, &xloc);
> > VecScatter ctx2;
> > VecScatterCreate(xtot,is,xloc,PETSC_NULL, &ctx2);
> > VecScatterBegin(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD);
> > VecScatterEnd(ctx2,xtot,xloc,INSERT_VALUES,SCATTER_FORWARD);
> > VecScatterDestroy(ctx2);
> >
> > //local matrix multiplication
> > vector<double> yloc_array(NPb,0);
> > PetscScalar *xloc_array;
> > VecGetArray(xloc, &xloc_array);
> > for(int k=0; k<NPb; k++)
> >  for(int l=0; l<NPb; l++)
> >    yloc_array[k] += Sloc[k*NPb+l] * xloc_array[l];
> > VecRestoreArray(xloc, &xloc_array);
> > VecDestroy(xloc);
> >
> > //scatter yloc to ytot
> > Vec yloc;
> > VecCreateSeqWithArray(PETSC_COMM_SELF, NPb, PETSC_NULL, &yloc);
> > VecPlaceArray(yloc,&yloc_array[0]);
> > VecScatter ctx3;
> > VecScatterCreate(yloc, PETSC_NULL, ytot, is, &ctx3);
> > VecScatterBegin(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD);
> > VecScatterEnd(ctx3, yloc, ytot, ADD_VALUES, SCATTER_FORWARD);
> > VecScatterDestroy(ctx3);
> > VecDestroy(yloc);
> >
> > }
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From amjad11 at gmail.com  Tue Apr 22 08:43:29 2008
From: amjad11 at gmail.com (amjad ali)
Date: Tue, 22 Apr 2008 14:43:29 +0100
Subject: Selection between C2D and Xeon 3000 for PETSc Sparse solvers
Message-ID: <428810f20804220643r618753dayb3cae42b9f92b7e7@mail.gmail.com>

 Hello,

Please help me out in selecting any one choice of the following:
 (Currently I am making a gigabit ethernet cluster of 4 compute nodes
(totaling 8 cores), with each node having)

(Choice 1)
One Processor: Intel Core2Duo E6750 2.66 GHz Processor, FSB 1333MHz, 4MB L2.
Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset
supporting 1333/1066/800 MHz FSB .
RAM: 2GB DDR2 800MHz ECC System Memory.

(Choice 2)
One Processor: Intel Xeon 3075 2.66 GHz FSB1333 4MBL2.
 Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200
Chipset supporting 1333/1066/800 MHz FSB .
RAM: 2GB DDR2 800MHz ECC System Memory.

Which one system has larger memory-bandwidth/CPU-core?
Any other comment/remark?
My area work deals in sparse matrices.
I near future I would like to add 12 similar compute nodes in the cluster.

with best regards,
Amjad Ali.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080422/1cf8165d/attachment.htm>

From petsc-maint at mcs.anl.gov  Tue Apr 22 09:08:22 2008
From: petsc-maint at mcs.anl.gov (Satish Balay)
Date: Tue, 22 Apr 2008 09:08:22 -0500 (CDT)
Subject: general question on speed using quad core Xeons
In-Reply-To: <428810f20804212245y27fab8bfh336aa5a26ff98528@mail.gmail.com>
References: <48054602.9040200@gmail.com>  <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>  <alpine.LFD.1.10.0804160919020.27121@asterix>  <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov>  <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com> 
 <alpine.LFD.1.10.0804210807420.4725@asterix.localdomain> <428810f20804212245y27fab8bfh336aa5a26ff98528@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804220843400.14086@asterix.localdomain>

On Tue, 22 Apr 2008, amjad ali wrote:

> > The othe CPU side - its balanced by FSB1600 =>
> > Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se
> >
> > So generally all the 3 things you've listed has to *match* correctly.
> > [Some CPUs and chipsets support multiple FSB frequencies - so have to
> > check what freq is set for the machine you are buying.]
> 
> Currently I am making a gigabit ethernet cluster of 4 compute nodes
> (totaling 8 cores), with each node having
> One Processor: Intel Core2Duo E6750 2.66 GHz Processor, FSB 1333MHz, 4MB L2.
> Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset
> supporting 1333/1066/800 MHz FSB .
> RAM: 2GB DDR2 800MHz ECC System Memory.
> 
> What memory-bandwidth/CPU-core will be there for this system?
> Any other comment/remark?
> My area work deals in sparse matrices.
> I near future I would like to add 12 similar compute nodes in the cluster.

http://www.intel.com/cd/products/services/emea/eng/chipsets/374398.htm
It says 12.8 GB/s for DDR2-800. I think the CPU with 1333 => 10.7GB/s

It would be unbalanced - and I don't know how this will affect
things.. [Perhaps it will perform better than DDR2-677 RAM]

> On such a cluster what if I relapce "C2D 2.66 GHz FSB1333 processor" with
> "Intel Xeon 3070/3075 2.66 GHz FSB1066/1333 processor"? Would there be any
> significant improvement in performance?

I doubt it will make a difference. But this is unproven speculation.

Satish


From Amit.Itagi at seagate.com  Tue Apr 22 09:14:51 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Tue, 22 Apr 2008 10:14:51 -0400
Subject: Multiple versions of PetSc
Message-ID: <OF4D24ABDF.63C50AD6-ON85257433.004E0123-85257433.004F0107@seagate.com>


Hi,

I have a naive question. I have a program that uses a C++, complex version
of PetSc. I need to run a second program that uses a C, real version of
PetSc. For the first program, I have defined the PETSC_ARCH and PETSC_DIR
variables in my .tcshrc . In order to get the second program working, do I
need to install a second version of PetSc ? How do I separate the
environment variables ?

Thanks

Rgds,
Amit


From tribur at vision.ee.ethz.ch  Tue Apr 22 09:25:59 2008
From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch)
Date: Tue, 22 Apr 2008 16:25:59 +0200
Subject: Schur system + MatShell
In-Reply-To: <a9f269830804220516g11ba4335rd9bd53c35fc26ba@mail.gmail.com>
References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch>
	<C723E687-FD9D-4204-BBD6-22185C3B7431@mcs.anl.gov>
	<a9f269830804220516g11ba4335rd9bd53c35fc26ba@mail.gmail.com>
Message-ID: <20080422162559.w8uw27kw2044koo8@email.ee.ethz.ch>

Dear Matt,

> This does not make sense to me. You decide how PETSc partitions things (if
> you want), And, I really do not understand what you want in parallel.
> If you mean
> that you solve the local Schur complements independently, then use a local
> matrix for each one. The important thing is to work out the linear 
> algebra prior
> to coding. Then wrapping it with PETSc Mat/Vec is easy.

The linear algebra is completely clear. Again: I have the local Schur 
systems given (and NOT the solution of the local Schur systems), and I 
would like to solve the global Schur complement system in parallel. The 
global Schur complement system is theoretically constructed by putting 
and adding elements of the local systems in certain locations of a 
global matrix. Wrapping this with PETSc Mat/Vec, without the 
time-intensive assembling, is not easy for me as a PETSc-beginner. But 
I'm curious of the solution you propose...


From knepley at gmail.com  Tue Apr 22 09:37:07 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 22 Apr 2008 10:37:07 -0400
Subject: Schur system + MatShell
In-Reply-To: <20080422162559.w8uw27kw2044koo8@email.ee.ethz.ch>
References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch>
	 <C723E687-FD9D-4204-BBD6-22185C3B7431@mcs.anl.gov>
	 <a9f269830804220516g11ba4335rd9bd53c35fc26ba@mail.gmail.com>
	 <20080422162559.w8uw27kw2044koo8@email.ee.ethz.ch>
Message-ID: <a9f269830804220737p7224183cse305372063c190c6@mail.gmail.com>

On 4/22/08, tribur at vision.ee.ethz.ch <tribur at vision.ee.ethz.ch> wrote:
> Dear Matt,
>
> > This does not make sense to me. You decide how PETSc partitions things (if
> > you want), And, I really do not understand what you want in parallel.
> > If you mean
> > that you solve the local Schur complements independently, then use a local
> > matrix for each one. The important thing is to work out the linear algebra
> prior
> > to coding. Then wrapping it with PETSc Mat/Vec is easy.
> >
>
> The linear algebra is completely clear. Again: I have the local Schur
> systems given (and NOT the solution of the local Schur systems), and I would
> like to solve the global Schur complement system in parallel. The global
> Schur complement system is theoretically constructed by putting and adding
> elements of the local systems in certain locations of a global matrix.
> Wrapping this with PETSc Mat/Vec, without the time-intensive assembling, is
> not easy for me as a PETSc-beginner. But I'm curious of the solution you
> propose...

Did you verify that the Schur complement matrix was properly preallocated before
assembly? This is the likely source of time. You can run with -info and search
for "malloc" in the output.

   Matt

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From balay at mcs.anl.gov  Tue Apr 22 09:41:38 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Tue, 22 Apr 2008 09:41:38 -0500 (CDT)
Subject: Multiple versions of PetSc
In-Reply-To: <OF4D24ABDF.63C50AD6-ON85257433.004E0123-85257433.004F0107@seagate.com>
References: <OF4D24ABDF.63C50AD6-ON85257433.004E0123-85257433.004F0107@seagate.com>
Message-ID: <alpine.LFD.1.10.0804220937480.14086@asterix.localdomain>

On Tue, 22 Apr 2008, Amit.Itagi at seagate.com wrote:

> 
> Hi,
> 
> I have a naive question. I have a program that uses a C++, complex version
> of PetSc. I need to run a second program that uses a C, real version of
> PetSc. For the first program, I have defined the PETSC_ARCH and PETSC_DIR
> variables in my .tcshrc . In order to get the second program working, do I
> need to install a second version of PetSc ? How do I separate the
> environment variables ?

You would just install with a different PETSC_ARCH value.

Now at compile time - you can use the correct PETSC_ARCH value with make.

for eg:

./config/configure.py PETSC_ARCH=linux-complex --with-clanguage=cxx --with-scalar-type=complex
make PETSC_ARCH=linux-complex all test
make PETSC_ARCH=linux-complex mycode

./config/configure.py PETSC_ARCH=linux-real
make PETSC_ARCH=linux-real all test
make PETSC_ARCH=linux-real mycode

You can set a default PETSC_ARCH in your .cshrc - but to use the other
build - you change it at command-line to make [as indicated above]

Note: both version can coexist in the same PETSC_DIR.

Satish


From knepley at gmail.com  Tue Apr 22 09:42:51 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 22 Apr 2008 10:42:51 -0400
Subject: Multiple versions of PetSc
In-Reply-To: <OF4D24ABDF.63C50AD6-ON85257433.004E0123-85257433.004F0107@seagate.com>
References: <OF4D24ABDF.63C50AD6-ON85257433.004E0123-85257433.004F0107@seagate.com>
Message-ID: <a9f269830804220742u4e3cdb1bx7873ff0adbe6b0e6@mail.gmail.com>

To build a different configuration of PETSc:

  1) cd $PETSC_DIR

  2) configure with new options, including --PETSC_ARCH=<new arch name>

  3) make PETS_ARCH=<new arch name>

  4) Build your code with PETSC_ARCH=<new arch name>

   Matt

On 4/22/08, Amit.Itagi at seagate.com <Amit.Itagi at seagate.com> wrote:
>
> Hi,
>
> I have a naive question. I have a program that uses a C++, complex version
> of PetSc. I need to run a second program that uses a C, real version of
> PetSc. For the first program, I have defined the PETSC_ARCH and PETSC_DIR
> variables in my .tcshrc . In order to get the second program working, do I
> need to install a second version of PetSc ? How do I separate the
> environment variables ?
>
> Thanks
>
> Rgds,
> Amit
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From balay at mcs.anl.gov  Tue Apr 22 09:52:07 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Tue, 22 Apr 2008 09:52:07 -0500 (CDT)
Subject: Schur system + MatShell
In-Reply-To: <a9f269830804220737p7224183cse305372063c190c6@mail.gmail.com>
References: <20080422140612.f924l740asc0g8sk@email.ee.ethz.ch>  <C723E687-FD9D-4204-BBD6-22185C3B7431@mcs.anl.gov>  <a9f269830804220516g11ba4335rd9bd53c35fc26ba@mail.gmail.com>  <20080422162559.w8uw27kw2044koo8@email.ee.ethz.ch>
 <a9f269830804220737p7224183cse305372063c190c6@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804220947230.14086@asterix.localdomain>

On Tue, 22 Apr 2008, Matthew Knepley wrote:

> On 4/22/08, tribur at vision.ee.ethz.ch <tribur at vision.ee.ethz.ch> wrote:
> > Dear Matt,
> >
> > > This does not make sense to me. You decide how PETSc partitions things (if
> > > you want), And, I really do not understand what you want in parallel.
> > > If you mean
> > > that you solve the local Schur complements independently, then use a local
> > > matrix for each one. The important thing is to work out the linear algebra
> > prior
> > > to coding. Then wrapping it with PETSc Mat/Vec is easy.
> > >
> >
> > The linear algebra is completely clear. Again: I have the local Schur
> > systems given (and NOT the solution of the local Schur systems), and I would
> > like to solve the global Schur complement system in parallel. The global
> > Schur complement system is theoretically constructed by putting and adding
> > elements of the local systems in certain locations of a global matrix.
> > Wrapping this with PETSc Mat/Vec, without the time-intensive assembling, is
> > not easy for me as a PETSc-beginner. But I'm curious of the solution you
> > propose...
> 
> Did you verify that the Schur complement matrix was properly preallocated before
> assembly? This is the likely source of time. You can run with -info and search
> for "malloc" in the output.

Isn't this using MATDENSE? If that the case - then I think the problem
is due to wrong partitioning - causing communiation during
MatAssembly().

-info should clearly show the communication part aswell.

The fix would be to specify the local partition sizes for this matrix
- and not use PETSC_DECIDE.

Satish


From Amit.Itagi at seagate.com  Tue Apr 22 10:06:34 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Tue, 22 Apr 2008 11:06:34 -0400
Subject: Multiple versions of PetSc
In-Reply-To: <a9f269830804220742u4e3cdb1bx7873ff0adbe6b0e6@mail.gmail.com>
Message-ID: <OF5F725F7D.1E53946B-ON85257433.0052FADE-85257433.0053BD47@seagate.com>

Thanks, Satish and Matt.

Rgds,
Amit


             "Matthew Knepley"                                             
             <knepley at gmail.co                                             
             m>                                                         To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: Multiple versions of PetSc      
                                                                           
                                                                           
             04/22/2008 10:42                                              
             AM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
To build a different configuration of PETSc:

  1) cd $PETSC_DIR

  2) configure with new options, including --PETSC_ARCH=<new arch name>

  3) make PETS_ARCH=<new arch name>

  4) Build your code with PETSC_ARCH=<new arch name>

   Matt

On 4/22/08, Amit.Itagi at seagate.com <Amit.Itagi at seagate.com> wrote:
>
> Hi,
>
> I have a naive question. I have a program that uses a C++, complex
version
> of PetSc. I need to run a second program that uses a C, real version of
> PetSc. For the first program, I have defined the PETSC_ARCH and PETSC_DIR
> variables in my .tcshrc . In order to get the second program working, do
I
> need to install a second version of PetSc ? How do I separate the
> environment variables ?
>
> Thanks
>
> Rgds,
> Amit
>
>


--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From petsc-maint at mcs.anl.gov  Tue Apr 22 10:25:06 2008
From: petsc-maint at mcs.anl.gov (Satish Balay)
Date: Tue, 22 Apr 2008 10:25:06 -0500 (CDT)
Subject: Selection between C2D and Xeon 3000 for PETSc Sparse solvers
In-Reply-To: <428810f20804220643r618753dayb3cae42b9f92b7e7@mail.gmail.com>
References: <428810f20804220643r618753dayb3cae42b9f92b7e7@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804221023140.14086@asterix.localdomain>

On Tue, 22 Apr 2008, amjad ali wrote:

>  Hello,
> 
> Please help me out in selecting any one choice of the following:
>  (Currently I am making a gigabit ethernet cluster of 4 compute nodes
> (totaling 8 cores), with each node having)
> 
> (Choice 1)
> One Processor: Intel Core2Duo E6750 2.66 GHz Processor, FSB 1333MHz, 4MB L2.
> Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200 Chipset
> supporting 1333/1066/800 MHz FSB .
> RAM: 2GB DDR2 800MHz ECC System Memory.
> 
> (Choice 2)
> One Processor: Intel Xeon 3075 2.66 GHz FSB1333 4MBL2.
>  Motherboard: Intel Entry Server Board Intel S3200SHV with intel 3200
> Chipset supporting 1333/1066/800 MHz FSB .
> RAM: 2GB DDR2 800MHz ECC System Memory.
> 
> Which one system has larger memory-bandwidth/CPU-core?
> Any other comment/remark?
> My area work deals in sparse matrices.
> I near future I would like to add 12 similar compute nodes in the cluster.

Based on the above numbers - the memory bandwidth numbers should be
the same. And I expect the performance to be the same in both cases.

Ideally you would have access to both machines [perhaps from the
vendor] - and run streams benchmark on each - to see if there is any
difference.

Satish


From recrusader at gmail.com  Tue Apr 22 15:16:32 2008
From: recrusader at gmail.com (Yujie)
Date: Tue, 22 Apr 2008 13:16:32 -0700
Subject: about MatMult()
Message-ID: <7ff0ee010804221316oa73a9c2s101b225fc3b760bf@mail.gmail.com>

the following is about MatMult() in manual.
"
The parallel matrix can multiply a vector with n local entries, returning a
vector with m local entries.
That is, to form the product
MatMult(Mat A,Vec x,Vec y);
the vectors x and y should be generated with
VecCreateMPI(MPI Comm comm,n,N,&x);
VecCreateMPI(MPI Comm comm,m,M,&y);
"
I am wondering whether I must create Vector "y" before I call MatMult()
regardless of parrellel and sequentail modes?
thanks a lot.

Regards,
Yujie
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080422/51f11ff5/attachment.htm>

From bsmith at mcs.anl.gov  Tue Apr 22 15:48:47 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Tue, 22 Apr 2008 15:48:47 -0500
Subject: about MatMult()
In-Reply-To: <7ff0ee010804221316oa73a9c2s101b225fc3b760bf@mail.gmail.com>
References: <7ff0ee010804221316oa73a9c2s101b225fc3b760bf@mail.gmail.com>
Message-ID: <9940D583-C343-4879-AD0A-7DFD1C73C489@mcs.anl.gov>


On Apr 22, 2008, at 3:16 PM, Yujie wrote:

> the following is about MatMult() in manual.
> "
> The parallel matrix can multiply a vector with n local entries,  
> returning a vector with m local entries.
> That is, to form the product
> MatMult(Mat A,Vec x,Vec y);
> the vectors x and y should be generated with
> VecCreateMPI(MPI Comm comm,n,N,&x);
> VecCreateMPI(MPI Comm comm,m,M,&y);
> "
> I am wondering whether I must create Vector "y" before I call  
> MatMult() regardless of parrellel and sequentail modes?

    y is the location where the product of A*x is stored; if it is not  
created before the call to MatMult() the program will
crash (or more likely generate an error message).

   Barry

>
> thanks a lot.
>
> Regards,
> Yujie


From Amit.Itagi at seagate.com  Tue Apr 22 20:45:03 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Tue, 22 Apr 2008 21:45:03 -0400
Subject: Multilevel solver
Message-ID: <OFAFD95F08.4855CA73-ON85257434.000745BB-85257434.000A5C07@seagate.com>


Hi,

I am trying to implement a multilevel method for an EM problem. The
reference is : "Comparison of hierarchical basis functions for efficient
multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger, IET
Sci. Meas. Technol. 2007, 1(1), pp 48-52.

Here is the summary:

The matrix equation Ax=b is solved using GMRES with a multilevel
pre-conditioner. A has a block structure.

A11    A12       *         x1  =  b1
A21    A22                  x2       b2

A11 is mxm and A33 is nxn, where m is not equal to n.

Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using superLU or
MUMPS)

Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user a SOR
solver or a parallel LU)

Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)

This gives the approximate solution to

A11     A12     *      e1   =  b1
A21     A22             e2       b2

and is used as the pre-conditioner for the GMRES.


Which PetSc method can implement this pre-conditioner ? I tried a PCSHELL
type PC. With Hong's help, I also got the parallel LU to work
withSuperLU/MUMPS. My program runs successfully on multiple processes on a
single machine. But when I submit the program over multiple machines, I get
a crash in the PCApply routine after several GMRES iterations. I think this
has to do with using PCSHELL with GMRES (which is not a good idea). Is
there a different way to implement this ? Does this resemble the usage
pattern of one of the AMG preconditioners ?


Thanks

Rgds,
Amit


From bsmith at mcs.anl.gov  Tue Apr 22 21:08:04 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Tue, 22 Apr 2008 21:08:04 -0500
Subject: Multilevel solver
In-Reply-To: <OFAFD95F08.4855CA73-ON85257434.000745BB-85257434.000A5C07@seagate.com>
References: <OFAFD95F08.4855CA73-ON85257434.000745BB-85257434.000A5C07@seagate.com>
Message-ID: <F67BAE01-DA10-4F46-A696-7FE88BBDDC03@mcs.anl.gov>

  Amit,

     Using a a PCSHELL should be fine (it can be used with GMRES),
my guess is there is a memory corruption error somewhere that is
causing the crash. This could be tracked down with www.valgrind.com

    Another way to you could implement this is with some very recent
additions I made to PCFIELDSPLIT that are in petsc-dev
(http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
With this you would chose
PCSetType(pc,PCFIELDSPLIT
PCFieldSplitSetIS(pc,is1
PCFieldSplitSetIS(pc,is2
PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
to use LU on A11 use the command line options
-fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
and SOR on A22
-fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly - 
fieldsplit_1_pc_sor_lits <lits> where
    <its> is the number of iterations you want to use block A22

is1 is the IS that contains the indices for all the vector entries in  
the 1 block while is2 is all indices in the
vector for the 2 block. You can use ISCreateGeneral() to create these.

   Probably it is easiest just to try this out.

   Barry


On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:

>
> Hi,
>
> I am trying to implement a multilevel method for an EM problem. The
> reference is : "Comparison of hierarchical basis functions for  
> efficient
> multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,  
> IET
> Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>
> Here is the summary:
>
> The matrix equation Ax=b is solved using GMRES with a multilevel
> pre-conditioner. A has a block structure.
>
> A11    A12       *         x1  =  b1
> A21    A22                  x2       b2
>
> A11 is mxm and A33 is nxn, where m is not equal to n.
>
> Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using  
> superLU or
> MUMPS)
>
> Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user  
> a SOR
> solver or a parallel LU)
>
> Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>
> This gives the approximate solution to
>
> A11     A12     *      e1   =  b1
> A21     A22             e2       b2
>
> and is used as the pre-conditioner for the GMRES.
>
>
> Which PetSc method can implement this pre-conditioner ? I tried a  
> PCSHELL
> type PC. With Hong's help, I also got the parallel LU to work
> withSuperLU/MUMPS. My program runs successfully on multiple  
> processes on a
> single machine. But when I submit the program over multiple  
> machines, I get
> a crash in the PCApply routine after several GMRES iterations. I  
> think this
> has to do with using PCSHELL with GMRES (which is not a good idea). Is
> there a different way to implement this ? Does this resemble the usage
> pattern of one of the AMG preconditioners ?
>
>
> Thanks
>
> Rgds,
> Amit
>


From berend at chalmers.se  Wed Apr 23 06:30:36 2008
From: berend at chalmers.se (Berend van Wachem)
Date: Wed, 23 Apr 2008 13:30:36 +0200
Subject: valgrind error
Message-ID: <480F1DDC.2040000@chalmers.se>

Dear Petsc-Team,

My program based upon PETSc seems to work fine, but I get a long list of 
errors with valgrind, see below. Does anyone have an idea what is going 
wrong?


==19756== Conditional jump or move depends on uninitialised value(s)
==19756==    at 0x83EDC11: MatLUFactorNumeric_SeqAIJ (aijfact.c:529)
==19756==    by 0x8376CEB: MatLUFactorNumeric (matrix.c:2227)
==19756==    by 0x826EFFE: PCSetUp_ILU (ilu.c:564)
==19756==    by 0x82EAAF2: PCSetUp (precon.c:787)
==19756==    by 0x8283262: KSPSetUp (itfunc.c:234)
==19756==    by 0x8283F63: KSPSolve (itfunc.c:347)
==19756==    by 0x80A7235: SolveMatrix (solvematrix.c:61)
==19756==    by 0x814F120: main (main.c:280)
==19756==
==19756== Conditional jump or move depends on uninitialised value(s)
==19756==    at 0x83EDC5F: MatLUFactorNumeric_SeqAIJ (aijfact.c:529)
==19756==    by 0x8376CEB: MatLUFactorNumeric (matrix.c:2227)
==19756==    by 0x826EFFE: PCSetUp_ILU (ilu.c:564)
==19756==    by 0x82EAAF2: PCSetUp (precon.c:787)
==19756==    by 0x8283262: KSPSetUp (itfunc.c:234)
==19756==    by 0x8283F63: KSPSolve (itfunc.c:347)
==19756==    by 0x80A7235: SolveMatrix (solvematrix.c:61)
==19756==    by 0x814F120: main (main.c:280)
==19756==
==19756== Conditional jump or move depends on uninitialised value(s)
==19756==    at 0x83ED8E0: MatLUFactorNumeric_SeqAIJ (aijfact.c:523)
==19756==    by 0x8376CEB: MatLUFactorNumeric (matrix.c:2227)
==19756==    by 0x826EFFE: PCSetUp_ILU (ilu.c:564)
==19756==    by 0x82EAAF2: PCSetUp (precon.c:787)
==19756==    by 0x8283262: KSPSetUp (itfunc.c:234)
==19756==    by 0x8283F63: KSPSolve (itfunc.c:347)
==19756==    by 0x80A7235: SolveMatrix (solvematrix.c:61)
==19756==    by 0x814F120: main (main.c:280)
==19756==
==19756== Conditional jump or move depends on uninitialised value(s)
==19756==    at 0x83ED674: MatLUFactorNumeric_SeqAIJ (aijfact.c:504)
==19756==    by 0x8376CEB: MatLUFactorNumeric (matrix.c:2227)
==19756==    by 0x826EFFE: PCSetUp_ILU (ilu.c:564)
==19756==    by 0x82EAAF2: PCSetUp (precon.c:787)
==19756==    by 0x8283262: KSPSetUp (itfunc.c:234)
==19756==    by 0x8283F63: KSPSolve (itfunc.c:347)
==19756==    by 0x80A7235: SolveMatrix (solvematrix.c:61)
==19756==    by 0x814F120: main (main.c:280)
==19756==
==19756== Conditional jump or move depends on uninitialised value(s)
==19756==    at 0x88D8214: dnrm2_ (dnrm2.f:58)
==19756==    by 0x8695C91: VecNorm_MPI (pvec2.c:79)
==19756==    by 0x866B95F: VecNorm (rvector.c:162)
==19756==    by 0x829C1B6: KSPSolve_BCGS (bcgs.c:45)
==19756==    by 0x8284523: KSPSolve (itfunc.c:379)
==19756==    by 0x80A7235: SolveMatrix (solvematrix.c:61)
==19756==    by 0x814F120: main (main.c:280)


From Amit.Itagi at seagate.com  Wed Apr 23 08:11:21 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 23 Apr 2008 09:11:21 -0400
Subject: Multilevel solver
In-Reply-To: <F67BAE01-DA10-4F46-A696-7FE88BBDDC03@mcs.anl.gov>
Message-ID: <OFA1FFE7AD.3AA14A23-ON85257434.004865CD-85257434.004930E9@seagate.com>

Barry,

This looks interesting. I will give it a shot.

Thanks

Rgds,
Amit


             Barry Smith                                                   
             <bsmith at mcs.anl.g                                             
             ov>                                                        To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: Multilevel solver               
                                                                           
                                                                           
             04/22/2008 10:08                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
  Amit,

     Using a a PCSHELL should be fine (it can be used with GMRES),
my guess is there is a memory corruption error somewhere that is
causing the crash. This could be tracked down with www.valgrind.com

    Another way to you could implement this is with some very recent
additions I made to PCFIELDSPLIT that are in petsc-dev
(http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
With this you would chose
PCSetType(pc,PCFIELDSPLIT
PCFieldSplitSetIS(pc,is1
PCFieldSplitSetIS(pc,is2
PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
to use LU on A11 use the command line options
-fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
and SOR on A22
-fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
fieldsplit_1_pc_sor_lits <lits> where
    <its> is the number of iterations you want to use block A22

is1 is the IS that contains the indices for all the vector entries in
the 1 block while is2 is all indices in the
vector for the 2 block. You can use ISCreateGeneral() to create these.

   Probably it is easiest just to try this out.

   Barry


On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:

>
> Hi,
>
> I am trying to implement a multilevel method for an EM problem. The
> reference is : "Comparison of hierarchical basis functions for
> efficient
> multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
> IET
> Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>
> Here is the summary:
>
> The matrix equation Ax=b is solved using GMRES with a multilevel
> pre-conditioner. A has a block structure.
>
> A11    A12       *         x1  =  b1
> A21    A22                  x2       b2
>
> A11 is mxm and A33 is nxn, where m is not equal to n.
>
> Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
> superLU or
> MUMPS)
>
> Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
> a SOR
> solver or a parallel LU)
>
> Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>
> This gives the approximate solution to
>
> A11     A12     *      e1   =  b1
> A21     A22             e2       b2
>
> and is used as the pre-conditioner for the GMRES.
>
>
> Which PetSc method can implement this pre-conditioner ? I tried a
> PCSHELL
> type PC. With Hong's help, I also got the parallel LU to work
> withSuperLU/MUMPS. My program runs successfully on multiple
> processes on a
> single machine. But when I submit the program over multiple
> machines, I get
> a crash in the PCApply routine after several GMRES iterations. I
> think this
> has to do with using PCSHELL with GMRES (which is not a good idea). Is
> there a different way to implement this ? Does this resemble the usage
> pattern of one of the AMG preconditioners ?
>
>
> Thanks
>
> Rgds,
> Amit
>


From petsc-maint at mcs.anl.gov  Wed Apr 23 08:54:32 2008
From: petsc-maint at mcs.anl.gov (Satish Balay)
Date: Wed, 23 Apr 2008 08:54:32 -0500 (CDT)
Subject: [PETSC #17608] Re: general question on speed using quad core
 Xeons
In-Reply-To: <alpine.LFD.1.10.0804210807420.4725@asterix.localdomain>
References: <48054602.9040200@gmail.com>  <8A55CDB3-B1D9-421A-9CD4-713C6625D2B5@mcs.anl.gov>  <alpine.LFD.1.10.0804160919020.27121@asterix>  <97750812-F5EC-4E26-A221-F0A14C0CD9A7@mcs.anl.gov> <428810f20804202234x3dc79246j88e3afccff7bec5b@mail.gmail.com>
 <alpine.LFD.1.10.0804210807420.4725@asterix.localdomain>
Message-ID: <alpine.LFD.1.10.0804230849000.3971@asterix.localdomain>

On Mon, 21 Apr 2008, Satish Balay wrote:

> For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]:
> bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec *  = 25.6GByte/sec

My math was incorrect here..

DDR2-800 = 6.4Gb/s [its 2(ddr)* 400MHz/sec * 8bytes ]
So this machine has 4 memory banks. i.e the above is:

> bandwidth = 4(banks)* 2(ddr)* 400 MHz/sec* 8(bytes bus)  *  = 25.6GByte/sec


Satish


From Amit.Itagi at seagate.com  Wed Apr 23 09:07:23 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 23 Apr 2008 10:07:23 -0400
Subject: Multilevel solver
Message-ID: <OF5D4F1FBD.40B14EE0-ON85257434.004D949E-85257434.004D94A6@seagate.com>

An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080423/8a3baf72/attachment.htm>

From knepley at gmail.com  Wed Apr 23 09:23:09 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Wed, 23 Apr 2008 09:23:09 -0500
Subject: Multilevel solver
In-Reply-To: <OF5D4F1FBD.40B14EE0-ON85257434.004D949E-85257434.004D94A6@seagate.com>
References: <OF5D4F1FBD.40B14EE0-ON85257434.004D949E-85257434.004D94A6@seagate.com>
Message-ID: <a9f269830804230723m318f0001w49c89a098f114aa1@mail.gmail.com>

On Wed, Apr 23, 2008 at 9:07 AM,  <Amit.Itagi at seagate.com> wrote:
> Barry,
>
> This is what valgrind gives me. Any idea ? What is confusing me is that I
> get the crash after several GMRES iterations.

1) Always start with the simplest case, meaning serial

2) When you run valgrind in parallel, you need --trace-children=yes, since
    MPI usually spawns other processes

3) It is possible to corrupt memory so badly that valgrind crashes
like this, but it is hard.

   Matt

> [2]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the
> batch system) has told this process to end
> [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> ------------------------------------------------------------------------
> [3]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[3]PETSC
> ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to
> find memory corruption errors
> [2]PETSC ERROR: Caught signal number 1 Hang up: Some other process (or the
> batch system) has told this process to end
> [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [2]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[2]PETSC
> ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to
> find memory corruption errors
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the
> batch system) has told this process to end
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[0]PETSC
> ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to
> find memory corruption errors
> [3]PETSC ERROR: likely location of problem given in stack below
> [3]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> ------------------------------------------------------------------------
> [1]PETSC ERROR: [2]PETSC ERROR: Caught signal number 15 Terminate: Somet
> process (or the batch system) has told this process to end
> likely location of problem given in stack below
> [1]PETSC ERROR: [2]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> ---------------------  Stack Frames ------------------------------------
> [1]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[1]PETSC
> ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to
> find memory corruption errors
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [1]PETSC ERROR: likely location of problem given in stack below
> [1]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [3]PETSC ERROR:       INSTEAD the line number of the start of the function
> [3]PETSC ERROR:       is given.
> [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [3]PETSC ERROR: [2]PETSC ERROR:       INSTEAD the line number of the start
> of the function
> [2]PETSC ERROR:       is given.
> [3] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c
> [3]PETSC ERROR: [3] PCApply line 346 src/ksp/pc/interface/precon.c
> [3]PETSC ERROR: [3] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c
> [3]PETSC ERROR: [3] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c
> [2]PETSC ERROR: [0]PETSC ERROR: Note: The EXACT line numbers in the stack
> are not available,
> [2] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c
> [0]PETSC ERROR:       INSTEAD the line number of the start of the function
> [2]PETSC ERROR: [2] PCApply line 346 src/ksp/pc/interface/precon.c
> [0]PETSC ERROR:       is given.
> [2]PETSC ERROR: [2] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c
> [2]PETSC ERROR: [2] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c
> [1]PETSC ERROR: [0]PETSC ERROR: Note: The EXACT line numbers in the stack
> are not available,
> [0] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c
> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
> [0]PETSC ERROR: [0] PCApply line 346 src/ksp/pc/interface/precon.c
> [0]PETSC ERROR: [1]PETSC ERROR:       is given.
> [0] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c
> [0]PETSC ERROR: [0] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c
> [1]PETSC ERROR: [1] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c
> [1]PETSC ERROR: [1] PCApply line 346 src/ksp/pc/interface/precon.c
> [1]PETSC ERROR: [3]PETSC ERROR: [1] PCApplyBAorAB line 539
> src/ksp/pc/interface/precon.c
> --------------------- Error Message ------------------------------------
> [1]PETSC ERROR: [1] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c
> [2]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> [0]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> [3]PETSC ERROR: Signal received!
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [3]PETSC ERROR: Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40
> CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
> [3]PETSC ERROR: See docs/changes/index.html for recent updates.
> [3]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>
> Thanks
>
> Rgds,
> Amit
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From balay at mcs.anl.gov  Wed Apr 23 09:40:39 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Wed, 23 Apr 2008 09:40:39 -0500 (CDT)
Subject: Multilevel solver
In-Reply-To: <a9f269830804230723m318f0001w49c89a098f114aa1@mail.gmail.com>
References: <OF5D4F1FBD.40B14EE0-ON85257434.004D949E-85257434.004D94A6@seagate.com> <a9f269830804230723m318f0001w49c89a098f114aa1@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804230938530.3971@asterix.localdomain>


If using valgrind - I sugest using MPICH2 [installed with options
--enable-g=meminit --enable-fast]

And valgrind can be invoked with:
mpiexec -np 2 valgrind --tool=memcheck -q ./executable -exectuable-options

Satish

On Wed, 23 Apr 2008, Matthew Knepley wrote:

> On Wed, Apr 23, 2008 at 9:07 AM,  <Amit.Itagi at seagate.com> wrote:
> > Barry,
> >
> > This is what valgrind gives me. Any idea ? What is confusing me is that I
> > get the crash after several GMRES iterations.
> 
> 1) Always start with the simplest case, meaning serial
> 
> 2) When you run valgrind in parallel, you need --trace-children=yes, since
>     MPI usually spawns other processes
> 
> 3) It is possible to corrupt memory so badly that valgrind crashes
> like this, but it is hard.
> 
>    Matt
> 
> > [2]PETSC ERROR:
> > ------------------------------------------------------------------------
> > [3]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the
> > batch system) has told this process to end
> > [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > ------------------------------------------------------------------------
> > [3]PETSC ERROR: or see
> > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[3]PETSC
> > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to
> > find memory corruption errors
> > [2]PETSC ERROR: Caught signal number 1 Hang up: Some other process (or the
> > batch system) has told this process to end
> > [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [2]PETSC ERROR: or see
> > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[2]PETSC
> > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to
> > find memory corruption errors
> > [0]PETSC ERROR:
> > ------------------------------------------------------------------------
> > [0]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the
> > batch system) has told this process to end
> > [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [0]PETSC ERROR: or see
> > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[0]PETSC
> > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to
> > find memory corruption errors
> > [3]PETSC ERROR: likely location of problem given in stack below
> > [3]PETSC ERROR: ---------------------  Stack Frames
> > ------------------------------------
> > ------------------------------------------------------------------------
> > [1]PETSC ERROR: [2]PETSC ERROR: Caught signal number 15 Terminate: Somet
> > process (or the batch system) has told this process to end
> > likely location of problem given in stack below
> > [1]PETSC ERROR: [2]PETSC ERROR: Try option -start_in_debugger or
> > -on_error_attach_debugger
> > ---------------------  Stack Frames ------------------------------------
> > [1]PETSC ERROR: or see
> > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[1]PETSC
> > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to
> > find memory corruption errors
> > [0]PETSC ERROR: likely location of problem given in stack below
> > [0]PETSC ERROR: ---------------------  Stack Frames
> > ------------------------------------
> > [1]PETSC ERROR: likely location of problem given in stack below
> > [1]PETSC ERROR: ---------------------  Stack Frames
> > ------------------------------------
> > [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [3]PETSC ERROR:       INSTEAD the line number of the start of the function
> > [3]PETSC ERROR:       is given.
> > [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> > [3]PETSC ERROR: [2]PETSC ERROR:       INSTEAD the line number of the start
> > of the function
> > [2]PETSC ERROR:       is given.
> > [3] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c
> > [3]PETSC ERROR: [3] PCApply line 346 src/ksp/pc/interface/precon.c
> > [3]PETSC ERROR: [3] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c
> > [3]PETSC ERROR: [3] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c
> > [2]PETSC ERROR: [0]PETSC ERROR: Note: The EXACT line numbers in the stack
> > are not available,
> > [2] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c
> > [0]PETSC ERROR:       INSTEAD the line number of the start of the function
> > [2]PETSC ERROR: [2] PCApply line 346 src/ksp/pc/interface/precon.c
> > [0]PETSC ERROR:       is given.
> > [2]PETSC ERROR: [2] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c
> > [2]PETSC ERROR: [2] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c
> > [1]PETSC ERROR: [0]PETSC ERROR: Note: The EXACT line numbers in the stack
> > are not available,
> > [0] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c
> > [1]PETSC ERROR:       INSTEAD the line number of the start of the function
> > [0]PETSC ERROR: [0] PCApply line 346 src/ksp/pc/interface/precon.c
> > [0]PETSC ERROR: [1]PETSC ERROR:       is given.
> > [0] PCApplyBAorAB line 539 src/ksp/pc/interface/precon.c
> > [0]PETSC ERROR: [0] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c
> > [1]PETSC ERROR: [1] PCApply_Shell line 122 src/ksp/pc/impls/shell/shellpc.c
> > [1]PETSC ERROR: [1] PCApply line 346 src/ksp/pc/interface/precon.c
> > [1]PETSC ERROR: [3]PETSC ERROR: [1] PCApplyBAorAB line 539
> > src/ksp/pc/interface/precon.c
> > --------------------- Error Message ------------------------------------
> > [1]PETSC ERROR: [1] GMREScycle line 133 src/ksp/ksp/impls/gmres/gmres.c
> > [2]PETSC ERROR: --------------------- Error Message
> > ------------------------------------
> > [0]PETSC ERROR: --------------------- Error Message
> > ------------------------------------
> > [3]PETSC ERROR: Signal received!
> > [3]PETSC ERROR:
> > ------------------------------------------------------------------------
> > [3]PETSC ERROR: Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40
> > CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
> > [3]PETSC ERROR: See docs/changes/index.html for recent updates.
> > [3]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> >
> > Thanks
> >
> > Rgds,
> > Amit
> >
> 
> 
> 
> 


From Amit.Itagi at seagate.com  Wed Apr 23 13:32:11 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 23 Apr 2008 14:32:11 -0400
Subject: Multilevel solver
In-Reply-To: <F67BAE01-DA10-4F46-A696-7FE88BBDDC03@mcs.anl.gov>
Message-ID: <OF54DC1F1E.8FBA63BA-ON85257434.0065AE9C-85257434.00669083@seagate.com>

Barry,

Is the installation of petsc-dev different from the installation of the
2.3.3 release ? I ran the config. But the folder tree seems to be
different. Hence, make is giving problems.

Amit


             Barry Smith                                                   
             <bsmith at mcs.anl.g                                             
             ov>                                                        To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: Multilevel solver               
                                                                           
                                                                           
             04/22/2008 10:08                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
  Amit,

     Using a a PCSHELL should be fine (it can be used with GMRES),
my guess is there is a memory corruption error somewhere that is
causing the crash. This could be tracked down with www.valgrind.com

    Another way to you could implement this is with some very recent
additions I made to PCFIELDSPLIT that are in petsc-dev
(http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
With this you would chose
PCSetType(pc,PCFIELDSPLIT
PCFieldSplitSetIS(pc,is1
PCFieldSplitSetIS(pc,is2
PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
to use LU on A11 use the command line options
-fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
and SOR on A22
-fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
fieldsplit_1_pc_sor_lits <lits> where
    <its> is the number of iterations you want to use block A22

is1 is the IS that contains the indices for all the vector entries in
the 1 block while is2 is all indices in the
vector for the 2 block. You can use ISCreateGeneral() to create these.

   Probably it is easiest just to try this out.

   Barry


On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:

>
> Hi,
>
> I am trying to implement a multilevel method for an EM problem. The
> reference is : "Comparison of hierarchical basis functions for
> efficient
> multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
> IET
> Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>
> Here is the summary:
>
> The matrix equation Ax=b is solved using GMRES with a multilevel
> pre-conditioner. A has a block structure.
>
> A11    A12       *         x1  =  b1
> A21    A22                  x2       b2
>
> A11 is mxm and A33 is nxn, where m is not equal to n.
>
> Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
> superLU or
> MUMPS)
>
> Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
> a SOR
> solver or a parallel LU)
>
> Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>
> This gives the approximate solution to
>
> A11     A12     *      e1   =  b1
> A21     A22             e2       b2
>
> and is used as the pre-conditioner for the GMRES.
>
>
> Which PetSc method can implement this pre-conditioner ? I tried a
> PCSHELL
> type PC. With Hong's help, I also got the parallel LU to work
> withSuperLU/MUMPS. My program runs successfully on multiple
> processes on a
> single machine. But when I submit the program over multiple
> machines, I get
> a crash in the PCApply routine after several GMRES iterations. I
> think this
> has to do with using PCSHELL with GMRES (which is not a good idea). Is
> there a different way to implement this ? Does this resemble the usage
> pattern of one of the AMG preconditioners ?
>
>
> Thanks
>
> Rgds,
> Amit
>


From knepley at gmail.com  Wed Apr 23 13:43:11 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Wed, 23 Apr 2008 13:43:11 -0500
Subject: Multilevel solver
In-Reply-To: <OF54DC1F1E.8FBA63BA-ON85257434.0065AE9C-85257434.00669083@seagate.com>
References: <F67BAE01-DA10-4F46-A696-7FE88BBDDC03@mcs.anl.gov>
	 <OF54DC1F1E.8FBA63BA-ON85257434.0065AE9C-85257434.00669083@seagate.com>
Message-ID: <a9f269830804231143u4487eafcs70d56eac80d997c4@mail.gmail.com>

On Wed, Apr 23, 2008 at 1:32 PM,  <Amit.Itagi at seagate.com> wrote:
> Barry,
>
>  Is the installation of petsc-dev different from the installation of the
>  2.3.3 release ? I ran the config. But the folder tree seems to be
>  different. Hence, make is giving problems.

1) Always always send the error log. I cannot tell anything from the
description "problems".

2) Some things have moved, but of course, make will work with the new
organization.

  Matt

>  Amit
>
>              Barry Smith
>              <bsmith at mcs.anl.g
>              ov>                                                        To
>              Sent by:                  petsc-users at mcs.anl.gov
>              owner-petsc-users                                          cc
>              @mcs.anl.gov
>              No Phone Info                                         Subject
>              Available                 Re: Multilevel solver
>
>
>              04/22/2008 10:08
>              PM
>
>
>              Please respond to
>              petsc-users at mcs.a
>                   nl.gov
>
>
>
>
>
>
>   Amit,
>
>      Using a a PCSHELL should be fine (it can be used with GMRES),
>  my guess is there is a memory corruption error somewhere that is
>  causing the crash. This could be tracked down with www.valgrind.com
>
>     Another way to you could implement this is with some very recent
>  additions I made to PCFIELDSPLIT that are in petsc-dev
>  (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
>  With this you would chose
>  PCSetType(pc,PCFIELDSPLIT
>  PCFieldSplitSetIS(pc,is1
>  PCFieldSplitSetIS(pc,is2
>  PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  to use LU on A11 use the command line options
>  -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
>  and SOR on A22
>  -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
>  fieldsplit_1_pc_sor_lits <lits> where
>     <its> is the number of iterations you want to use block A22
>
>  is1 is the IS that contains the indices for all the vector entries in
>  the 1 block while is2 is all indices in the
>  vector for the 2 block. You can use ISCreateGeneral() to create these.
>
>    Probably it is easiest just to try this out.
>
>    Barry
>
>
>  On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:
>
>  >
>  > Hi,
>  >
>  > I am trying to implement a multilevel method for an EM problem. The
>  > reference is : "Comparison of hierarchical basis functions for
>  > efficient
>  > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
>  > IET
>  > Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>  >
>  > Here is the summary:
>  >
>  > The matrix equation Ax=b is solved using GMRES with a multilevel
>  > pre-conditioner. A has a block structure.
>  >
>  > A11    A12       *         x1  =  b1
>  > A21    A22                  x2       b2
>  >
>  > A11 is mxm and A33 is nxn, where m is not equal to n.
>  >
>  > Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
>  > superLU or
>  > MUMPS)
>  >
>  > Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
>  > a SOR
>  > solver or a parallel LU)
>  >
>  > Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>  >
>  > This gives the approximate solution to
>  >
>  > A11     A12     *      e1   =  b1
>  > A21     A22             e2       b2
>  >
>  > and is used as the pre-conditioner for the GMRES.
>  >
>  >
>  > Which PetSc method can implement this pre-conditioner ? I tried a
>  > PCSHELL
>  > type PC. With Hong's help, I also got the parallel LU to work
>  > withSuperLU/MUMPS. My program runs successfully on multiple
>  > processes on a
>  > single machine. But when I submit the program over multiple
>  > machines, I get
>  > a crash in the PCApply routine after several GMRES iterations. I
>  > think this
>  > has to do with using PCSHELL with GMRES (which is not a good idea). Is
>  > there a different way to implement this ? Does this resemble the usage
>  > pattern of one of the AMG preconditioners ?
>  >
>  >
>  > Thanks
>  >
>  > Rgds,
>  > Amit
>  >
>
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From Amit.Itagi at seagate.com  Wed Apr 23 15:05:04 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 23 Apr 2008 16:05:04 -0400
Subject: Multilevel solver
In-Reply-To: <a9f269830804231143u4487eafcs70d56eac80d997c4@mail.gmail.com>
Message-ID: <OFCE5977DD.F80FD1E2-ON85257434.006E043A-85257434.006F1177@seagate.com>

Here is my make log.

==========================================

See documentation/faq.html and documentation/bugreporting.html
for help with installation problems. Please send EVERYTHING
printed out below when reporting problems

To subscribe to the PETSc announcement list, send mail to
majordomo at mcs.anl.gov with the message:
subscribe petsc-announce

To subscribe to the PETSc users mailing list, send mail to
majordomo at mcs.anl.gov with the message:
subscribe petsc-users

==========================================
On Wed Apr 23 15:37:17 EDT 2008 on tabla
Machine characteristics: Linux tabla 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux
-----------------------------------------
Using PETSc directory: /home/amit/programs/ParEM/petsc-dev
Using PETSc arch: linux-gnu-c-debug
-----------------------------------------
PETSC_VERSION_RELEASE    0
PETSC_VERSION_MAJOR      2
PETSC_VERSION_MINOR      3
PETSC_VERSION_SUBMINOR   3
PETSC_VERSION_PATCH      12
PETSC_VERSION_DATE       "May, 23, 2007"
PETSC_VERSION_PATCH_DATE "unknown"
PETSC_VERSION_HG         "unknown"
-----------------------------------------
Using configure Options: --PETSC_ARCH=linux-gnu-c-debug --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx --with-mpi=1
--download-mpich=1 --with-metis=1 --download-metis=1 --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 --download-superlu_dist=1
--with-mumps=1 --download-blacs=1 --download-scalapack=1 --download-mumps=1 COPTFLAGS="-O3 -march=p4 -mtune=p4 -ffast-math -malign-double
-funroll-loops -pipe -fomit-frame-pointer -finline-functions -msse2" CXXOPTFLAGS="-O3 -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops
-pipe -fomit-frame-pointer -finline-functions -msse2" --with-shared=0
Using configuration flags:
#define INCLUDED_PETSCCONF_H
#define IS_COLORING_MAX 65535
#define STDC_HEADERS 1
#define MPIU_COLORING_VALUE MPI_UNSIGNED_SHORT
#define PETSC_HAVE_SUPERLU_DIST 1
#define PETSC_STATIC_INLINE static inline
#define PETSC_HAVE_BLACS 1
#define PETSC_HAVE_MUMPS 1
#define PETSC_DIR_SEPARATOR '/'
#define PETSC_HAVE_BLASLAPACK 1
#define PETSC_PATH_SEPARATOR ':'
#define PETSC_REPLACE_DIR_SEPARATOR '\\'
#define PETSC_HAVE_DOUBLE_ALIGN_MALLOC 1
#define PETSC_RESTRICT  __restrict__
#define PETSC_HAVE_X11 1
#define PETSC_HAVE_SOWING 1
#define PETSC_HAVE_SCALAPACK 1
#define PETSC_HAVE_MPI 1
#define PETSC_USE_SOCKET_VIEWER 1
#define PETSC_HAVE_PARMETIS 1
#define PETSC_HAVE_C2HTML 1
#define PETSC_HAVE_FORTRAN 1
#define PETSC_HAVE_STRING_H 1
#define PETSC_HAVE_SYS_TYPES_H 1
#define PETSC_HAVE_ENDIAN_H 1
#define PETSC_HAVE_SYS_PROCFS_H 1
#define PETSC_HAVE_LINUX_KERNEL_H 1
#define PETSC_HAVE_TIME_H 1
#define PETSC_HAVE_MATH_H 1
#define PETSC_HAVE_STDLIB_H 1
#define PETSC_HAVE_SYS_PARAM_H 1
#define PETSC_HAVE_SYS_SOCKET_H 1
#define PETSC_HAVE_UNISTD_H 1
#define PETSC_HAVE_SYS_WAIT_H 1
#define PETSC_HAVE_LIMITS_H 1
#define PETSC_HAVE_SEARCH_H 1
#define PETSC_HAVE_NETINET_IN_H 1
#define PETSC_HAVE_FLOAT_H 1
#define PETSC_HAVE_SYS_SYSINFO_H 1
#define PETSC_HAVE_SYS_RESOURCE_H 1
#define PETSC_HAVE_SYS_TIMES_H 1
#define PETSC_HAVE_NETDB_H 1
#define PETSC_HAVE_MALLOC_H 1
#define PETSC_HAVE_PWD_H 1
#define PETSC_HAVE_FCNTL_H 1
#define PETSC_HAVE_STRINGS_H 1
#define PETSC_HAVE_MEMORY_H 1
#define PETSC_TIME_WITH_SYS_TIME 1
#define PETSC_HAVE_SYS_TIME_H 1
#define PETSC_HAVE_SYS_UTSNAME_H 1
#define PETSC_USING_F90 1
#define PETSC_PRINTF_FORMAT_CHECK(A,B) __attribute__((format (printf, A, B)))
#define PETSC_C_STATIC_INLINE static inline
#define PETSC_HAVE_FORTRAN_UNDERSCORE 1
#define PETSC_HAVE_CXX_NAMESPACE 1
#define PETSC_C_RESTRICT  __restrict__
#define PETSC_USE_F90_SRC_IMPL 1
#define PETSC_CXX_RESTRICT  __restrict__
#define PETSC_CXX_STATIC_INLINE static inline
#define PETSC_HAVE_LIBBLAS 1
#define PETSC_HAVE_LIBDMUMPS 1
#define PETSC_HAVE_LIBZMUMPS 1
#define PETSC_HAVE_LIBSCALAPACK 1
#define PETSC_HAVE_LIBM 1
#define PETSC_HAVE_LIBMETIS 1
#define PETSC_HAVE_LIBLAPACK 1
#define PETSC_HAVE_LIBCMUMPS 1
#define PETSC_HAVE_LIBSMUMPS 1
#define PETSC_HAVE_LIBGCC_S 1
#define PETSC_HAVE_LIBPORD 1
#define PETSC_HAVE_LIBGFORTRANBEGIN 1
#define PETSC_HAVE_ERF 1
#define PETSC_HAVE_LIBSUPERLU_DIST_2 1
#define PETSC_HAVE_LIBBLACS 1
#define PETSC_HAVE_LIBPARMETIS 1
#define PETSC_HAVE_LIBGFORTRAN 1
#define PETSC_ARCH_NAME "linux-gnu-c-debug"
#define PETSC_ARCH linux
#define PETSC_DIR /home/amit/programs/ParEM/petsc-dev
#define PETSC_CLANGUAGE_CXX 1
#define PETSC_USE_ERRORCHECKING 1
#define PETSC_MISSING_DREAL 1
#define PETSC_SIZEOF_MPI_COMM 4
#define PETSC_BITS_PER_BYTE 8
#define PETSC_SIZEOF_MPI_FINT 4
#define PETSC_SIZEOF_VOID_P 4
#define PETSC_RETSIGTYPE void
#define PETSC_HAVE_CXX_COMPLEX 1
#define PETSC_SIZEOF_LONG 4
#define PETSC_USE_FORTRANKIND 1
#define PETSC_SIZEOF_SIZE_T 4
#define PETSC_SIZEOF_CHAR 1
#define PETSC_SIZEOF_DOUBLE 8
#define PETSC_SIZEOF_FLOAT 4
#define PETSC_HAVE_C99_COMPLEX 1
#define PETSC_SIZEOF_INT 4
#define PETSC_SIZEOF_LONG_LONG 8
#define PETSC_SIZEOF_SHORT 2
#define PETSC_HAVE_STRCASECMP 1
#define PETSC_HAVE_ISNAN 1
#define PETSC_HAVE_POPEN 1
#define PETSC_HAVE_SIGSET 1
#define PETSC_HAVE_GETWD 1
#define PETSC_HAVE_TIMES 1
#define PETSC_HAVE_SNPRINTF 1
#define PETSC_HAVE_GETPWUID 1
#define PETSC_HAVE_ISINF 1
#define PETSC_HAVE_GETHOSTBYNAME 1
#define PETSC_HAVE_SLEEP 1
#define PETSC_HAVE_FORK 1
#define PETSC_HAVE_RAND 1
#define PETSC_HAVE_GETTIMEOFDAY 1
#define PETSC_HAVE_UNAME 1
#define PETSC_HAVE_GETHOSTNAME 1
#define PETSC_HAVE_MKSTEMP 1
#define PETSC_HAVE_SIGACTION 1
#define PETSC_HAVE_DRAND48 1
#define PETSC_HAVE_VA_COPY 1
#define PETSC_HAVE_CLOCK 1
#define PETSC_HAVE_ACCESS 1
#define PETSC_HAVE_SIGNAL 1
#define PETSC_HAVE_GETRUSAGE 1
#define PETSC_HAVE_MEMALIGN 1
#define PETSC_HAVE_GETDOMAINNAME 1
#define PETSC_HAVE_TIME 1
#define PETSC_HAVE_LSEEK 1
#define PETSC_HAVE_SOCKET 1
#define PETSC_HAVE_SYSINFO 1
#define PETSC_HAVE_READLINK 1
#define PETSC_HAVE_REALPATH 1
#define PETSC_HAVE_MEMMOVE 1
#define PETSC_HAVE__GFORTRAN_IARGC 1
#define PETSC_SIGNAL_CAST
#define PETSC_HAVE_GETCWD 1
#define PETSC_HAVE_VPRINTF 1
#define PETSC_HAVE_BZERO 1
#define PETSC_HAVE_GETPAGESIZE 1
#define PETSC_USE_COMPLEX 1
#define PETSC_USE_GDB_DEBUGGER 1
#define PETSC_HAVE_GFORTRAN_IARGC 1
#define PETSC_USE_DEBUG 1
#define PETSC_USE_INFO 1
#define PETSC_USE_LOG 1
#define PETSC_IS_COLOR_VALUE_TYPE short
#define PETSC_USE_CTABLE 1
#define PETSC_USE_PROC_FOR_SIZE 1
#define PETSC_HAVE_MPI_COMM_C2F 1
#define PETSC_HAVE_MPI_COMM_F2C 1
#define PETSC_HAVE_MPI_FINT 1
#define PETSC_HAVE_MPI_F90MODULE 1
#define PETSC_HAVE_MPI_ALLTOALLW 1
#define PETSC_HAVE_MPI_COMM_SPAWN 1
#define PETSC_HAVE_MPI_WIN_CREATE 1
#define PETSC_HAVE_MPI_FINALIZED 1
#define HAVE_GZIP 1
#define PETSC_BLASLAPACK_UNDERSCORE 1
-----------------------------------------
Using include paths: -I/home/amit/programs/ParEM/petsc-dev -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include
-I/home/amit/programs/ParEM/petsc-dev/include  -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include
-I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include   -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include
-I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include
------------------------------------------
Using C/C++ compiler: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpicxx
C/C++ Compiler version:
Using Fortran compiler: gfortran -g
Fortran Compiler version:
-----------------------------------------
Using C/C++ linker:
Using Fortran linker: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpif90
-----------------------------------------
Using libraries: -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib
-lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc        -lX11
-Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lsuperlu_dist_2.2
-Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lparmetis -lmetis
-Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lcmumps -ldmumps
-lsmumps -lzmumps -lpord -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib
-L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lscalapack -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib
-L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lblacs  -llapack -lblas -L/usr/lib/gcc/i486-linux-gnu/4.1.3 -L/lib -lgcc_s
-lgfortranbegin -lgfortran -lm -L/usr/lib/gcc/i486-linux-gnu/4.2.1 -lm -lstdc++ -lstdc++ -lgcc_s
------------------------------------------
Using mpiexec: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpiexec
==========================================
/bin/rm -f -f /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib/libpetsc*.*
BEGINNING TO COMPILE LIBRARIES IN ALL DIRECTORIES
=========================================
libfast in: /home/amit/programs/ParEM/petsc-dev/src
libfast in: /home/amit/programs/ParEM/petsc-dev/src/inline
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/socket
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/socket/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/string
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/string/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/vu
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/x
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/x/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/ps
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/dll
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/dll/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/time
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/time/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/utils
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/utils/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls/rand
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls/rand48
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src/fsrc
make[8]: *** No rule to make target `libf'.  Stop.
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-mod
make[7]: *** No rule to make target `petscmod.o'.  Stop.
make[6]: *** [buildmod] Error 2
make[5]: [libfast] Error 2 (ignored)
libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/shared
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/shared/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/general
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/general/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/stride
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/stride/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls/constant
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls/string
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/seq
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/seq/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/csrperm
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/crl
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/superlu_dist
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/mumps
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/csrperm
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/crl
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/aij
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj/mpi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj/mpi/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/maij
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/is
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/is/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/mpi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/mpi/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/normal
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/normal/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/lrc
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/lrc/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/scatter
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/scatter/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/blockmat
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/blockmat/seq
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/composite
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/composite/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls/pmetis
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls/pmetis/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/color
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/color/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/jacobi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/jacobi/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/none
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/sor
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/sor/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/shell
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/shell/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/dmmg
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/dmmg/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/eisens
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/eisens/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/ksp
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/ksp/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/redundant
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/redundant/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/is
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/is/nn
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/pbjacobi
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mat
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/fieldsplit
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/fieldsplit/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ilu
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ilu/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/icc
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/cholesky
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/cholesky/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/galerkin
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/galerkin/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/openmp
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asa
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/cp
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface
iguess.c: In function ???PetscErrorCode KSPFischerGuessFormGuess_Method1(KSPFischerGuess_Method1*, _p_Vec*, _p_Vec*)???:
iguess.c:79: warning: cannot pass objects of non-POD type ???struct std::complex<double>??? through ???...???; call will abort at runtime
iguess.c: In function ???PetscErrorCode KSPFischerGuessFormGuess_Method2(KSPFischerGuess_Method2*, _p_Vec*, _p_Vec*)???:
iguess.c:198: warning: cannot pass objects of non-POD type ???struct std::complex<double>??? through ???...???; call will abort at runtime
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cr
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgs
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgsl
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgsl/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/cgne
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/gltr
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/gltr/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/nash
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/nash/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/stcg
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/stcg/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cgs
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/lgmres
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cheby
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cheby/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/rich
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/rich/ftn-autolibfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/lsqr
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/preonly
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/tcqmr
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/tfqmr
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/qcg
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/qcg/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bicg
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/minres
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/symmlq
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/lcd
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/mf
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/mf/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/tr
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/test
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/picard
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tutorials/ex10d
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/euler
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/rk
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/rk/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit/beuler
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit/cn
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/pseudo
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/pseudo/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/basic
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/basic/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/f90-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/examples
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/examples/tests
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/ftn-auto
libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ftn-custom
libfast in: /home/amit/programs/ParEM/petsc-dev/src/contrib
libfast in: /home/amit/programs/ParEM/petsc-dev/src/contrib/fun3d
libfast in: /home/amit/programs/ParEM/petsc-dev/src/benchmarks
libfast in: /home/amit/programs/ParEM/petsc-dev/src/fortran
libfast in: /home/amit/programs/ParEM/petsc-dev/src/fortran/fsrc
make[7]: *** No rule to make target `libf'.  Stop.
libfast in: /home/amit/programs/ParEM/petsc-dev/src/docs
libfast in: /home/amit/programs/ParEM/petsc-dev/include
libfast in: /home/amit/programs/ParEM/petsc-dev/include/finclude
libfast in: /home/amit/programs/ParEM/petsc-dev/tutorials
libfast in: /home/amit/programs/ParEM/petsc-dev/tutorials/multiphysics
Completed building libraries
=========================================
Shared libraries disabled
********************************************************************
  Error during compile, check linux-gnu-c-debug/conf/make.log
  Send it and linux-gnu-c-debug/conf/configure.log to petsc-maint at mcs.anl.gov
********************************************************************
make: [all] Error 1 (ignored)
Running test examples to verify correct installation
make[2]: [ex19.PETSc] Error 2 (ignored)
make[2]: [ex5f.PETSc] Error 2 (ignored)
--------------Error detected during compile or link!-----------------------
See http://www.mcs.anl.gov/petsc/petsc-2/documentation/troubleshooting.html
gfortran -I/home/amit/programs/ParEM/petsc-dev/include/finclude   -c -o ex5f.o ex5f.F
In file included from ex5f.F:43:
ex5f.h:32: error: include/finclude/petsc.h: No such file or directory
ex5f.h:33: error: include/finclude/petscvec.h: No such file or directory
ex5f.h:34: error: include/finclude/petscda.h: No such file or directory
ex5f.h:35: error: include/finclude/petscis.h: No such file or directory
ex5f.h:36: error: include/finclude/petscmat.h: No such file or directory
ex5f.h:37: error: include/finclude/petscksp.h: No such file or directory
ex5f.h:38: error: include/finclude/petscpc.h: No such file or directory
ex5f.h:39: error: include/finclude/petscsnes.h: No such file or directory
make[3]: *** [ex5f.o] Error 1
Completed test examples


I guess the tests fail because the program looks for
include/finclude/petsc.h in /include/finclude. What about libf ?


             "Matthew Knepley"                                             
             <knepley at gmail.co                                             
             m>                                                         To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: Multilevel solver               
                                                                           
                                                                           
             04/23/2008 02:43                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
On Wed, Apr 23, 2008 at 1:32 PM,  <Amit.Itagi at seagate.com> wrote:
> Barry,
>
>  Is the installation of petsc-dev different from the installation of the
>  2.3.3 release ? I ran the config. But the folder tree seems to be
>  different. Hence, make is giving problems.

1) Always always send the error log. I cannot tell anything from the
description "problems".

2) Some things have moved, but of course, make will work with the new
organization.

  Matt

>  Amit
>
>              Barry Smith
>              <bsmith at mcs.anl.g
>              ov>
To
>              Sent by:                  petsc-users at mcs.anl.gov
>              owner-petsc-users
cc
>              @mcs.anl.gov
>              No Phone Info
Subject
>              Available                 Re: Multilevel solver
>
>
>              04/22/2008 10:08
>              PM
>
>
>              Please respond to
>              petsc-users at mcs.a
>                   nl.gov
>
>
>
>
>
>
>   Amit,
>
>      Using a a PCSHELL should be fine (it can be used with GMRES),
>  my guess is there is a memory corruption error somewhere that is
>  causing the crash. This could be tracked down with www.valgrind.com
>
>     Another way to you could implement this is with some very recent
>  additions I made to PCFIELDSPLIT that are in petsc-dev
>  (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
>  With this you would chose
>  PCSetType(pc,PCFIELDSPLIT
>  PCFieldSplitSetIS(pc,is1
>  PCFieldSplitSetIS(pc,is2
>  PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  to use LU on A11 use the command line options
>  -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
>  and SOR on A22
>  -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
>  fieldsplit_1_pc_sor_lits <lits> where
>     <its> is the number of iterations you want to use block A22
>
>  is1 is the IS that contains the indices for all the vector entries in
>  the 1 block while is2 is all indices in the
>  vector for the 2 block. You can use ISCreateGeneral() to create these.
>
>    Probably it is easiest just to try this out.
>
>    Barry
>
>
>  On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:
>
>  >
>  > Hi,
>  >
>  > I am trying to implement a multilevel method for an EM problem. The
>  > reference is : "Comparison of hierarchical basis functions for
>  > efficient
>  > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
>  > IET
>  > Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>  >
>  > Here is the summary:
>  >
>  > The matrix equation Ax=b is solved using GMRES with a multilevel
>  > pre-conditioner. A has a block structure.
>  >
>  > A11    A12       *         x1  =  b1
>  > A21    A22                  x2       b2
>  >
>  > A11 is mxm and A33 is nxn, where m is not equal to n.
>  >
>  > Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
>  > superLU or
>  > MUMPS)
>  >
>  > Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
>  > a SOR
>  > solver or a parallel LU)
>  >
>  > Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>  >
>  > This gives the approximate solution to
>  >
>  > A11     A12     *      e1   =  b1
>  > A21     A22             e2       b2
>  >
>  > and is used as the pre-conditioner for the GMRES.
>  >
>  >
>  > Which PetSc method can implement this pre-conditioner ? I tried a
>  > PCSHELL
>  > type PC. With Hong's help, I also got the parallel LU to work
>  > withSuperLU/MUMPS. My program runs successfully on multiple
>  > processes on a
>  > single machine. But when I submit the program over multiple
>  > machines, I get
>  > a crash in the PCApply routine after several GMRES iterations. I
>  > think this
>  > has to do with using PCSHELL with GMRES (which is not a good idea). Is
>  > there a different way to implement this ? Does this resemble the usage
>  > pattern of one of the AMG preconditioners ?
>  >
>  >
>  > Thanks
>  >
>  > Rgds,
>  > Amit
>  >
>
>
>
>


--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From knepley at gmail.com  Wed Apr 23 15:20:06 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Wed, 23 Apr 2008 15:20:06 -0500
Subject: Multilevel solver
In-Reply-To: <OFCE5977DD.F80FD1E2-ON85257434.006E043A-85257434.006F1177@seagate.com>
References: <a9f269830804231143u4487eafcs70d56eac80d997c4@mail.gmail.com>
	 <OFCE5977DD.F80FD1E2-ON85257434.006E043A-85257434.006F1177@seagate.com>
Message-ID: <a9f269830804231320y4b58f25dq12d8ebab5978c91a@mail.gmail.com>

On Wed, Apr 23, 2008 at 3:05 PM,  <Amit.Itagi at seagate.com> wrote:
> Here is my make log.

When you clone petsc-dev, you need to run

  make allfortranstubs

before 'make'. The dev docs will be fixed,

  Matt

>  ==========================================
>
>  See documentation/faq.html and documentation/bugreporting.html
>  for help with installation problems. Please send EVERYTHING
>  printed out below when reporting problems
>
>  To subscribe to the PETSc announcement list, send mail to
>  majordomo at mcs.anl.gov with the message:
>  subscribe petsc-announce
>
>  To subscribe to the PETSc users mailing list, send mail to
>  majordomo at mcs.anl.gov with the message:
>  subscribe petsc-users
>
>  ==========================================
>  On Wed Apr 23 15:37:17 EDT 2008 on tabla
>  Machine characteristics: Linux tabla 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux
>  -----------------------------------------
>  Using PETSc directory: /home/amit/programs/ParEM/petsc-dev
>  Using PETSc arch: linux-gnu-c-debug
>  -----------------------------------------
>  PETSC_VERSION_RELEASE    0
>  PETSC_VERSION_MAJOR      2
>  PETSC_VERSION_MINOR      3
>  PETSC_VERSION_SUBMINOR   3
>  PETSC_VERSION_PATCH      12
>  PETSC_VERSION_DATE       "May, 23, 2007"
>  PETSC_VERSION_PATCH_DATE "unknown"
>  PETSC_VERSION_HG         "unknown"
>  -----------------------------------------
>  Using configure Options: --PETSC_ARCH=linux-gnu-c-debug --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx --with-mpi=1
>  --download-mpich=1 --with-metis=1 --download-metis=1 --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 --download-superlu_dist=1
>  --with-mumps=1 --download-blacs=1 --download-scalapack=1 --download-mumps=1 COPTFLAGS="-O3 -march=p4 -mtune=p4 -ffast-math -malign-double
>  -funroll-loops -pipe -fomit-frame-pointer -finline-functions -msse2" CXXOPTFLAGS="-O3 -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops
>  -pipe -fomit-frame-pointer -finline-functions -msse2" --with-shared=0
>  Using configuration flags:
>  #define INCLUDED_PETSCCONF_H
>  #define IS_COLORING_MAX 65535
>  #define STDC_HEADERS 1
>  #define MPIU_COLORING_VALUE MPI_UNSIGNED_SHORT
>  #define PETSC_HAVE_SUPERLU_DIST 1
>  #define PETSC_STATIC_INLINE static inline
>  #define PETSC_HAVE_BLACS 1
>  #define PETSC_HAVE_MUMPS 1
>  #define PETSC_DIR_SEPARATOR '/'
>  #define PETSC_HAVE_BLASLAPACK 1
>  #define PETSC_PATH_SEPARATOR ':'
>  #define PETSC_REPLACE_DIR_SEPARATOR '\\'
>  #define PETSC_HAVE_DOUBLE_ALIGN_MALLOC 1
>  #define PETSC_RESTRICT  __restrict__
>  #define PETSC_HAVE_X11 1
>  #define PETSC_HAVE_SOWING 1
>  #define PETSC_HAVE_SCALAPACK 1
>  #define PETSC_HAVE_MPI 1
>  #define PETSC_USE_SOCKET_VIEWER 1
>  #define PETSC_HAVE_PARMETIS 1
>  #define PETSC_HAVE_C2HTML 1
>  #define PETSC_HAVE_FORTRAN 1
>  #define PETSC_HAVE_STRING_H 1
>  #define PETSC_HAVE_SYS_TYPES_H 1
>  #define PETSC_HAVE_ENDIAN_H 1
>  #define PETSC_HAVE_SYS_PROCFS_H 1
>  #define PETSC_HAVE_LINUX_KERNEL_H 1
>  #define PETSC_HAVE_TIME_H 1
>  #define PETSC_HAVE_MATH_H 1
>  #define PETSC_HAVE_STDLIB_H 1
>  #define PETSC_HAVE_SYS_PARAM_H 1
>  #define PETSC_HAVE_SYS_SOCKET_H 1
>  #define PETSC_HAVE_UNISTD_H 1
>  #define PETSC_HAVE_SYS_WAIT_H 1
>  #define PETSC_HAVE_LIMITS_H 1
>  #define PETSC_HAVE_SEARCH_H 1
>  #define PETSC_HAVE_NETINET_IN_H 1
>  #define PETSC_HAVE_FLOAT_H 1
>  #define PETSC_HAVE_SYS_SYSINFO_H 1
>  #define PETSC_HAVE_SYS_RESOURCE_H 1
>  #define PETSC_HAVE_SYS_TIMES_H 1
>  #define PETSC_HAVE_NETDB_H 1
>  #define PETSC_HAVE_MALLOC_H 1
>  #define PETSC_HAVE_PWD_H 1
>  #define PETSC_HAVE_FCNTL_H 1
>  #define PETSC_HAVE_STRINGS_H 1
>  #define PETSC_HAVE_MEMORY_H 1
>  #define PETSC_TIME_WITH_SYS_TIME 1
>  #define PETSC_HAVE_SYS_TIME_H 1
>  #define PETSC_HAVE_SYS_UTSNAME_H 1
>  #define PETSC_USING_F90 1
>  #define PETSC_PRINTF_FORMAT_CHECK(A,B) __attribute__((format (printf, A, B)))
>  #define PETSC_C_STATIC_INLINE static inline
>  #define PETSC_HAVE_FORTRAN_UNDERSCORE 1
>  #define PETSC_HAVE_CXX_NAMESPACE 1
>  #define PETSC_C_RESTRICT  __restrict__
>  #define PETSC_USE_F90_SRC_IMPL 1
>  #define PETSC_CXX_RESTRICT  __restrict__
>  #define PETSC_CXX_STATIC_INLINE static inline
>  #define PETSC_HAVE_LIBBLAS 1
>  #define PETSC_HAVE_LIBDMUMPS 1
>  #define PETSC_HAVE_LIBZMUMPS 1
>  #define PETSC_HAVE_LIBSCALAPACK 1
>  #define PETSC_HAVE_LIBM 1
>  #define PETSC_HAVE_LIBMETIS 1
>  #define PETSC_HAVE_LIBLAPACK 1
>  #define PETSC_HAVE_LIBCMUMPS 1
>  #define PETSC_HAVE_LIBSMUMPS 1
>  #define PETSC_HAVE_LIBGCC_S 1
>  #define PETSC_HAVE_LIBPORD 1
>  #define PETSC_HAVE_LIBGFORTRANBEGIN 1
>  #define PETSC_HAVE_ERF 1
>  #define PETSC_HAVE_LIBSUPERLU_DIST_2 1
>  #define PETSC_HAVE_LIBBLACS 1
>  #define PETSC_HAVE_LIBPARMETIS 1
>  #define PETSC_HAVE_LIBGFORTRAN 1
>  #define PETSC_ARCH_NAME "linux-gnu-c-debug"
>  #define PETSC_ARCH linux
>  #define PETSC_DIR /home/amit/programs/ParEM/petsc-dev
>  #define PETSC_CLANGUAGE_CXX 1
>  #define PETSC_USE_ERRORCHECKING 1
>  #define PETSC_MISSING_DREAL 1
>  #define PETSC_SIZEOF_MPI_COMM 4
>  #define PETSC_BITS_PER_BYTE 8
>  #define PETSC_SIZEOF_MPI_FINT 4
>  #define PETSC_SIZEOF_VOID_P 4
>  #define PETSC_RETSIGTYPE void
>  #define PETSC_HAVE_CXX_COMPLEX 1
>  #define PETSC_SIZEOF_LONG 4
>  #define PETSC_USE_FORTRANKIND 1
>  #define PETSC_SIZEOF_SIZE_T 4
>  #define PETSC_SIZEOF_CHAR 1
>  #define PETSC_SIZEOF_DOUBLE 8
>  #define PETSC_SIZEOF_FLOAT 4
>  #define PETSC_HAVE_C99_COMPLEX 1
>  #define PETSC_SIZEOF_INT 4
>  #define PETSC_SIZEOF_LONG_LONG 8
>  #define PETSC_SIZEOF_SHORT 2
>  #define PETSC_HAVE_STRCASECMP 1
>  #define PETSC_HAVE_ISNAN 1
>  #define PETSC_HAVE_POPEN 1
>  #define PETSC_HAVE_SIGSET 1
>  #define PETSC_HAVE_GETWD 1
>  #define PETSC_HAVE_TIMES 1
>  #define PETSC_HAVE_SNPRINTF 1
>  #define PETSC_HAVE_GETPWUID 1
>  #define PETSC_HAVE_ISINF 1
>  #define PETSC_HAVE_GETHOSTBYNAME 1
>  #define PETSC_HAVE_SLEEP 1
>  #define PETSC_HAVE_FORK 1
>  #define PETSC_HAVE_RAND 1
>  #define PETSC_HAVE_GETTIMEOFDAY 1
>  #define PETSC_HAVE_UNAME 1
>  #define PETSC_HAVE_GETHOSTNAME 1
>  #define PETSC_HAVE_MKSTEMP 1
>  #define PETSC_HAVE_SIGACTION 1
>  #define PETSC_HAVE_DRAND48 1
>  #define PETSC_HAVE_VA_COPY 1
>  #define PETSC_HAVE_CLOCK 1
>  #define PETSC_HAVE_ACCESS 1
>  #define PETSC_HAVE_SIGNAL 1
>  #define PETSC_HAVE_GETRUSAGE 1
>  #define PETSC_HAVE_MEMALIGN 1
>  #define PETSC_HAVE_GETDOMAINNAME 1
>  #define PETSC_HAVE_TIME 1
>  #define PETSC_HAVE_LSEEK 1
>  #define PETSC_HAVE_SOCKET 1
>  #define PETSC_HAVE_SYSINFO 1
>  #define PETSC_HAVE_READLINK 1
>  #define PETSC_HAVE_REALPATH 1
>  #define PETSC_HAVE_MEMMOVE 1
>  #define PETSC_HAVE__GFORTRAN_IARGC 1
>  #define PETSC_SIGNAL_CAST
>  #define PETSC_HAVE_GETCWD 1
>  #define PETSC_HAVE_VPRINTF 1
>  #define PETSC_HAVE_BZERO 1
>  #define PETSC_HAVE_GETPAGESIZE 1
>  #define PETSC_USE_COMPLEX 1
>  #define PETSC_USE_GDB_DEBUGGER 1
>  #define PETSC_HAVE_GFORTRAN_IARGC 1
>  #define PETSC_USE_DEBUG 1
>  #define PETSC_USE_INFO 1
>  #define PETSC_USE_LOG 1
>  #define PETSC_IS_COLOR_VALUE_TYPE short
>  #define PETSC_USE_CTABLE 1
>  #define PETSC_USE_PROC_FOR_SIZE 1
>  #define PETSC_HAVE_MPI_COMM_C2F 1
>  #define PETSC_HAVE_MPI_COMM_F2C 1
>  #define PETSC_HAVE_MPI_FINT 1
>  #define PETSC_HAVE_MPI_F90MODULE 1
>  #define PETSC_HAVE_MPI_ALLTOALLW 1
>  #define PETSC_HAVE_MPI_COMM_SPAWN 1
>  #define PETSC_HAVE_MPI_WIN_CREATE 1
>  #define PETSC_HAVE_MPI_FINALIZED 1
>  #define HAVE_GZIP 1
>  #define PETSC_BLASLAPACK_UNDERSCORE 1
>  -----------------------------------------
>  Using include paths: -I/home/amit/programs/ParEM/petsc-dev -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include
>  -I/home/amit/programs/ParEM/petsc-dev/include  -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include
>  -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include   -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include
>  -I/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/include
>  ------------------------------------------
>  Using C/C++ compiler: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpicxx
>  C/C++ Compiler version:
>  Using Fortran compiler: gfortran -g
>  Fortran Compiler version:
>  -----------------------------------------
>  Using C/C++ linker:
>  Using Fortran linker: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpif90
>  -----------------------------------------
>  Using libraries: -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib
>  -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc        -lX11
>  -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lsuperlu_dist_2.2
>  -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lparmetis -lmetis
>  -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lcmumps -ldmumps
>  -lsmumps -lzmumps -lpord -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib
>  -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lscalapack -Wl,-rpath,/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib
>  -L/home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib -lblacs  -llapack -lblas -L/usr/lib/gcc/i486-linux-gnu/4.1.3 -L/lib -lgcc_s
>  -lgfortranbegin -lgfortran -lm -L/usr/lib/gcc/i486-linux-gnu/4.2.1 -lm -lstdc++ -lstdc++ -lgcc_s
>  ------------------------------------------
>  Using mpiexec: /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/bin/mpiexec
>  ==========================================
>  /bin/rm -f -f /home/amit/programs/ParEM/petsc-dev/linux-gnu-c-debug/lib/libpetsc*.*
>  BEGINNING TO COMPILE LIBRARIES IN ALL DIRECTORIES
>  =========================================
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/inline
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/socket
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/socket/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/ascii/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/binary/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/string
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/string/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/draw/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/impls/vu
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/viewer/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/x
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/x/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/impls/ps
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/draw/utils/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/error/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/dll
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/dll/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/fileio/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/memory/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/objects/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/time
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/time/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/utils/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/utils
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/utils/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/plog/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls/rand
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/random/impls/rand48
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/bag/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/verbose/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src/fsrc
>  make[8]: *** No rule to make target `libf'.  Stop.
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-mod
>  make[7]: *** No rule to make target `petscmod.o'.  Stop.
>  make[6]: *** [buildmod] Error 2
>  make[5]: [libfast] Error 2 (ignored)
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/utils/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/interface/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/seq/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/mpi/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/shared
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/impls/shared/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/vec/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/interface/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/general
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/general/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/stride
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/stride/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/block/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/impls/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/is/utils/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls/constant
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/impls/string
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/pf/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/vec/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/interface/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/seq
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/seq/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/dense/mpi/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/csrperm
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/crl
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/seq/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/superlu_dist
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/mumps
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/csrperm
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/crl
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/mpi/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/aij/aij
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/shell/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/seq/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/baij/mpi/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj/mpi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/adj/mpi/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/maij
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/is
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/is/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/seq/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/mpi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/sbaij/mpi/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/normal
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/normal/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/lrc
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/lrc/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/scatter
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/scatter/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/blockmat
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/blockmat/seq
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/composite
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/composite/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/impls/mffd/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/utils/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/matfd/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls/pmetis
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/impls/pmetis/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/partition/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/order/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/color
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/color/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/mat/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/jacobi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/jacobi/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/none
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/sor
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/sor/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/shell
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/shell/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/bjacobi/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/dmmg
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/dmmg/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mg/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/eisens
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/eisens/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asm/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/ksp
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/ksp/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/composite/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/redundant
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/redundant/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/is
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/is/nn
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/pbjacobi
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/mat
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/fieldsplit
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/fieldsplit/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/lu/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ilu
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ilu/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/icc
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/cholesky
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/cholesky/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/factor/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/galerkin
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/galerkin/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/openmp
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/asa
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/impls/cp
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/pc/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface
>  iguess.c: In function ???PetscErrorCode KSPFischerGuessFormGuess_Method1(KSPFischerGuess_Method1*, _p_Vec*, _p_Vec*)???:
>  iguess.c:79: warning: cannot pass objects of non-POD type ???struct std::complex<double>??? through ???...???; call will abort at runtime
>  iguess.c: In function ???PetscErrorCode KSPFischerGuessFormGuess_Method2(KSPFischerGuess_Method2*, _p_Vec*, _p_Vec*)???:
>  iguess.c:198: warning: cannot pass objects of non-POD type ???struct std::complex<double>??? through ???...???; call will abort at runtime
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/interface/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cr
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgs
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgsl
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bcgsl/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/cgne
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/gltr
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/gltr/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/nash
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/nash/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/stcg
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/stcg/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cg/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cgs
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/lgmres
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/fgmres/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/gmres/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cheby
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/cheby/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/rich
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/rich/ftn-autolibfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/lsqr
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/preonly
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/tcqmr
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/tfqmr
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/qcg
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/qcg/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/bicg
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/minres
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/symmlq
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/impls/lcd
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ksp/ksp/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/interface/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/mf
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/mf/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/ls/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/tr
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/test
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/impls/picard
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/examples/tutorials/ex10d
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/utils/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/snes/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/euler
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/rk
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/explicit/rk/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit/beuler
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/implicit/cn
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/pseudo
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/impls/pseudo/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/ts/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/interface/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/basic
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/basic/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/impls/mapping/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ao/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/src/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/examples/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/da/utils/f90-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/examples
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/examples/tests
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/adda/ftn-auto
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/dm/ftn-custom
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/contrib
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/contrib/fun3d
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/benchmarks
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/fortran
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/fortran/fsrc
>  make[7]: *** No rule to make target `libf'.  Stop.
>  libfast in: /home/amit/programs/ParEM/petsc-dev/src/docs
>  libfast in: /home/amit/programs/ParEM/petsc-dev/include
>  libfast in: /home/amit/programs/ParEM/petsc-dev/include/finclude
>  libfast in: /home/amit/programs/ParEM/petsc-dev/tutorials
>  libfast in: /home/amit/programs/ParEM/petsc-dev/tutorials/multiphysics
>  Completed building libraries
>  =========================================
>  Shared libraries disabled
>  ********************************************************************
>   Error during compile, check linux-gnu-c-debug/conf/make.log
>   Send it and linux-gnu-c-debug/conf/configure.log to petsc-maint at mcs.anl.gov
>  ********************************************************************
>  make: [all] Error 1 (ignored)
>  Running test examples to verify correct installation
>  make[2]: [ex19.PETSc] Error 2 (ignored)
>  make[2]: [ex5f.PETSc] Error 2 (ignored)
>  --------------Error detected during compile or link!-----------------------
>  See http://www.mcs.anl.gov/petsc/petsc-2/documentation/troubleshooting.html
>  gfortran -I/home/amit/programs/ParEM/petsc-dev/include/finclude   -c -o ex5f.o ex5f.F
>  In file included from ex5f.F:43:
>  ex5f.h:32: error: include/finclude/petsc.h: No such file or directory
>  ex5f.h:33: error: include/finclude/petscvec.h: No such file or directory
>  ex5f.h:34: error: include/finclude/petscda.h: No such file or directory
>  ex5f.h:35: error: include/finclude/petscis.h: No such file or directory
>  ex5f.h:36: error: include/finclude/petscmat.h: No such file or directory
>  ex5f.h:37: error: include/finclude/petscksp.h: No such file or directory
>  ex5f.h:38: error: include/finclude/petscpc.h: No such file or directory
>  ex5f.h:39: error: include/finclude/petscsnes.h: No such file or directory
>  make[3]: *** [ex5f.o] Error 1
>  Completed test examples
>
>
>  I guess the tests fail because the program looks for
>  include/finclude/petsc.h in /include/finclude. What about libf ?
>
>
>
>
>
>
>              "Matthew Knepley"
>              <knepley at gmail.co
>
>              m>                                                         To
>              Sent by:                  petsc-users at mcs.anl.gov
>              owner-petsc-users                                          cc
>              @mcs.anl.gov
>              No Phone Info                                         Subject
>              Available                 Re: Multilevel solver
>
>
>              04/23/2008 02:43
>
>
>              PM
>
>
>              Please respond to
>              petsc-users at mcs.a
>                   nl.gov
>
>
>
>
>
>
>  On Wed, Apr 23, 2008 at 1:32 PM,  <Amit.Itagi at seagate.com> wrote:
>  > Barry,
>  >
>  >  Is the installation of petsc-dev different from the installation of the
>  >  2.3.3 release ? I ran the config. But the folder tree seems to be
>  >  different. Hence, make is giving problems.
>
>  1) Always always send the error log. I cannot tell anything from the
>  description "problems".
>
>  2) Some things have moved, but of course, make will work with the new
>  organization.
>
>   Matt
>
>  >  Amit
>  >
>  >              Barry Smith
>  >              <bsmith at mcs.anl.g
>  >              ov>
>  To
>  >              Sent by:                  petsc-users at mcs.anl.gov
>  >              owner-petsc-users
>  cc
>  >              @mcs.anl.gov
>  >              No Phone Info
>  Subject
>  >              Available                 Re: Multilevel solver
>  >
>  >
>  >              04/22/2008 10:08
>  >              PM
>  >
>  >
>  >              Please respond to
>  >              petsc-users at mcs.a
>  >                   nl.gov
>  >
>  >
>  >
>  >
>  >
>  >
>  >   Amit,
>  >
>  >      Using a a PCSHELL should be fine (it can be used with GMRES),
>  >  my guess is there is a memory corruption error somewhere that is
>  >  causing the crash. This could be tracked down with www.valgrind.com
>  >
>  >     Another way to you could implement this is with some very recent
>  >  additions I made to PCFIELDSPLIT that are in petsc-dev
>  >  (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
>  >  With this you would chose
>  >  PCSetType(pc,PCFIELDSPLIT
>  >  PCFieldSplitSetIS(pc,is1
>  >  PCFieldSplitSetIS(pc,is2
>  >  PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  >  to use LU on A11 use the command line options
>  >  -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
>  >  and SOR on A22
>  >  -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
>  >  fieldsplit_1_pc_sor_lits <lits> where
>  >     <its> is the number of iterations you want to use block A22
>  >
>  >  is1 is the IS that contains the indices for all the vector entries in
>  >  the 1 block while is2 is all indices in the
>  >  vector for the 2 block. You can use ISCreateGeneral() to create these.
>  >
>  >    Probably it is easiest just to try this out.
>  >
>  >    Barry
>  >
>  >
>  >  On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:
>  >
>  >  >
>  >  > Hi,
>  >  >
>  >  > I am trying to implement a multilevel method for an EM problem. The
>  >  > reference is : "Comparison of hierarchical basis functions for
>  >  > efficient
>  >  > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
>  >  > IET
>  >  > Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>  >  >
>  >  > Here is the summary:
>  >  >
>  >  > The matrix equation Ax=b is solved using GMRES with a multilevel
>  >  > pre-conditioner. A has a block structure.
>  >  >
>  >  > A11    A12       *         x1  =  b1
>  >  > A21    A22                  x2       b2
>  >  >
>  >  > A11 is mxm and A33 is nxn, where m is not equal to n.
>  >  >
>  >  > Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
>  >  > superLU or
>  >  > MUMPS)
>  >  >
>  >  > Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
>  >  > a SOR
>  >  > solver or a parallel LU)
>  >  >
>  >  > Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>  >  >
>  >  > This gives the approximate solution to
>  >  >
>  >  > A11     A12     *      e1   =  b1
>  >  > A21     A22             e2       b2
>  >  >
>  >  > and is used as the pre-conditioner for the GMRES.
>  >  >
>  >  >
>  >  > Which PetSc method can implement this pre-conditioner ? I tried a
>  >  > PCSHELL
>  >  > type PC. With Hong's help, I also got the parallel LU to work
>  >  > withSuperLU/MUMPS. My program runs successfully on multiple
>  >  > processes on a
>  >  > single machine. But when I submit the program over multiple
>  >  > machines, I get
>  >  > a crash in the PCApply routine after several GMRES iterations. I
>  >  > think this
>  >  > has to do with using PCSHELL with GMRES (which is not a good idea). Is
>  >  > there a different way to implement this ? Does this resemble the usage
>  >  > pattern of one of the AMG preconditioners ?
>  >  >
>  >  >
>  >  > Thanks
>  >  >
>  >  > Rgds,
>  >  > Amit
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>  --
>  What most experimenters take for granted before they begin their
>  experiments is infinitely more interesting than any results to which
>  their experiments lead.
>  -- Norbert Wiener
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From balay at mcs.anl.gov  Wed Apr 23 15:28:46 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Wed, 23 Apr 2008 15:28:46 -0500 (CDT)
Subject: Multilevel solver
In-Reply-To: <OFCE5977DD.F80FD1E2-ON85257434.006E043A-85257434.006F1177@seagate.com>
References: <OFCE5977DD.F80FD1E2-ON85257434.006E043A-85257434.006F1177@seagate.com>
Message-ID: <alpine.LFD.1.10.0804231524370.19141@asterix.localdomain>

 On Wed, 23 Apr 2008, Amit.Itagi at seagate.com wrote:

> Using configure Options: --PETSC_ARCH=linux-gnu-c-debug
> --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx
> --with-mpi=1 --download-mpich=1 --with-metis=1 --download-metis=1
> --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1
> --download-superlu_dist=1 --with-mumps=1 --download-blacs=1
> --download-scalapack=1 --download-mumps=1 COPTFLAGS="-O3 -march=p4
> -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe
> -fomit-frame-pointer -finline-functions -msse2" CXXOPTFLAGS="-O3
> -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe
> -fomit-frame-pointer -finline-functions -msse2" --with-shared=0


For one - when debugging - you should not use optimization flags "-O3 etc.."

Can you send the corresponding  configure.log to petsc-maint at mcs.anl.gov?

Also - what do you have for 'hg status'

Satish


From balay at mcs.anl.gov  Wed Apr 23 15:31:51 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Wed, 23 Apr 2008 15:31:51 -0500 (CDT)
Subject: Multilevel solver
In-Reply-To: <a9f269830804231320y4b58f25dq12d8ebab5978c91a@mail.gmail.com>
References: <a9f269830804231143u4487eafcs70d56eac80d997c4@mail.gmail.com>  <OFCE5977DD.F80FD1E2-ON85257434.006E043A-85257434.006F1177@seagate.com> <a9f269830804231320y4b58f25dq12d8ebab5978c91a@mail.gmail.com>
Message-ID: <alpine.LFD.1.10.0804231530540.19141@asterix.localdomain>

On Wed, 23 Apr 2008, Matthew Knepley wrote:

> On Wed, Apr 23, 2008 at 3:05 PM,  <Amit.Itagi at seagate.com> wrote:
> > Here is my make log.
> 
> When you clone petsc-dev, you need to run
> 
>   make allfortranstubs
> 
> before 'make'. The dev docs will be fixed,

Matt,

Its not fortranstubs issue. For one configure should regenerate
them. However


> >  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-src/fsrc
> >  make[8]: *** No rule to make target `libf'.  Stop.
> >  libfast in: /home/amit/programs/ParEM/petsc-dev/src/sys/f90-mod
> >  make[7]: *** No rule to make target `petscmod.o'.  Stop.
> >  make[6]: *** [buildmod] Error 2
> >  make[5]: [libfast] Error 2 (ignored)

The locations are not ftn-auto [which would correspond to the stubs].

Satish


From balay at mcs.anl.gov  Wed Apr 23 15:33:31 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Wed, 23 Apr 2008 15:33:31 -0500 (CDT)
Subject: Multilevel solver
In-Reply-To: <alpine.LFD.1.10.0804231524370.19141@asterix.localdomain>
References: <OFCE5977DD.F80FD1E2-ON85257434.006E043A-85257434.006F1177@seagate.com> <alpine.LFD.1.10.0804231524370.19141@asterix.localdomain>
Message-ID: <alpine.LFD.1.10.0804231531541.19141@asterix.localdomain>

On Wed, 23 Apr 2008, Satish Balay wrote:

>  On Wed, 23 Apr 2008, Amit.Itagi at seagate.com wrote:
> 
> > Using configure Options: --PETSC_ARCH=linux-gnu-c-debug
> > --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx
> > --with-mpi=1 --download-mpich=1 --with-metis=1 --download-metis=1
> > --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1
> > --download-superlu_dist=1 --with-mumps=1 --download-blacs=1
> > --download-scalapack=1 --download-mumps=1 COPTFLAGS="-O3 -march=p4
> > -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe
> > -fomit-frame-pointer -finline-functions -msse2" CXXOPTFLAGS="-O3
> > -march=p4 -mtune=p4 -ffast-math -malign-double -funroll-loops -pipe
> > -fomit-frame-pointer -finline-functions -msse2" --with-shared=0
> 
> 
> For one - when debugging - you should not use optimization flags "-O3 etc.."
> 
> Can you send the corresponding  configure.log to petsc-maint at mcs.anl.gov?
> 
> Also - what do you have for 'hg status'

Ah - I think you pulled petsc-dev - but not BuildSystem. If you are
using petsc-dev - you should be pulling both.

http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html

Satish


From nliu at fit.edu  Wed Apr 23 22:04:01 2008
From: nliu at fit.edu (Ningyu Liu)
Date: Wed, 23 Apr 2008 23:04:01 -0400
Subject: Question on TS_EULER
Message-ID: <7091BC9B-1920-4E22-A703-354D87BEA388@fit.edu>

Hello,

Is there a way by which the timestep of the explicit forward Euler  
method can be modified during the iterations? Looking at the source  
code of the method, the timestep dt is set when entering the function  
TSStep_Euler(). The iteration proceeds with this fixed timestep even  
calling TSSetTimeStep() in a monitoring function. I personally find  
it's a bit confusing. The actual solution is obtained with fixed  
timestep. However,  the time returned from calling TSGetTime() takes  
into account any modifications made by the user. Thanks.

Regards,

Ningyu 
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080423/f0797c4c/attachment.htm>

From zonexo at gmail.com  Thu Apr 24 04:11:25 2008
From: zonexo at gmail.com (Ben Tay)
Date: Thu, 24 Apr 2008 17:11:25 +0800
Subject: Using PETSc libraries with MS Compute cluster and MS MPI
Message-ID: <48104EBD.3080104@gmail.com>

Hi,

I'm trying to run my mpi code on the MS Compute cluster which my school 
just installed. Unfortunately, it failed without giving any error msg. I 
am using just a test example ex2f.

I read in the MS website that there is no need to use MS MPI to compile 
the code or library.

Anyway, I also tried to compile PETSc with MS MPI but I'm not able to 
get pass ./configure. It always complains that there is something wrong 
with the MS MPI.

Is there anyone who has experience in these?

Thank you very much.

Regards.


From knepley at gmail.com  Thu Apr 24 08:54:07 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Thu, 24 Apr 2008 08:54:07 -0500
Subject: Question on TS_EULER
In-Reply-To: <7091BC9B-1920-4E22-A703-354D87BEA388@fit.edu>
References: <7091BC9B-1920-4E22-A703-354D87BEA388@fit.edu>
Message-ID: <a9f269830804240654g7302d594v559a8b9b0362764b@mail.gmail.com>

That is a bug. I have just fixed it in petsc-dev. You can easily fix it in
your copy (if you are using the release) by changing line 41 to

    ierr = VecAXPY(sol,ts->time_step,update);CHKERRQ(ierr);

    Matt

On Wed, Apr 23, 2008 at 10:04 PM, Ningyu Liu <nliu at fit.edu> wrote:
> Hello,
>
> Is there a way by which the timestep of the explicit forward Euler method
> can be modified during the iterations? Looking at the source code of the
> method, the timestep dt is set when entering the function TSStep_Euler().
> The iteration proceeds with this fixed timestep even calling TSSetTimeStep()
> in a monitoring function. I personally find it's a bit confusing. The actual
> solution is obtained with fixed timestep. However,  the time returned from
> calling TSGetTime() takes into account any modifications made by the user.
> Thanks.
>
> Regards,
>
> Ningyu


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From Amit.Itagi at seagate.com  Thu Apr 24 08:58:48 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Thu, 24 Apr 2008 09:58:48 -0400
Subject: Multilevel solver
In-Reply-To: <F67BAE01-DA10-4F46-A696-7FE88BBDDC03@mcs.anl.gov>
Message-ID: <OF79614B7C.806F3CC5-ON85257435.004B7368-85257435.004D8945@seagate.com>

Barry,

I have been trying out the PCFIELDSPLIT. I have not yet gotten it to work.
I have some follow up questions which might help solve my problem.

Consider the simple case of a 4x4 matrix equation being solved on two
processes. I have vector elements 0 and 1 belonging to rank 0, and elements
2 and 3 belonging to rank 1.

1) For my example, can the index sets have staggered indices i.e. is1-> 0,2
and  is2->1,3  (each is spans across ranks) ?

2) When I provide the -field_split_<n>_pc_type option on the command line,
is the index <n> in the same order that the PCFieldSplitSetIS function
called in ?
     So if I have  PCFieldSplitSetIS(pc,is2) before
PCFieldSplitSetIS(pc,is1), will -field_split_0_... correspond to is2 and
-field_split_1_... to is1 ?

3) Since I want to set PC type to lu for field 0, and I want to use MUMPS
for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In this
case, will a second copy of the submatrix be generated - one of type MUMPS
for the PC and the other of the original MATAIJ type for the KSP ?

4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE ?

Thanks

Rgds,
Amit


             Barry Smith                                                   
             <bsmith at mcs.anl.g                                             
             ov>                                                        To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: Multilevel solver               
                                                                           
                                                                           
             04/22/2008 10:08                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
  Amit,

     Using a a PCSHELL should be fine (it can be used with GMRES),
my guess is there is a memory corruption error somewhere that is
causing the crash. This could be tracked down with www.valgrind.com

    Another way to you could implement this is with some very recent
additions I made to PCFIELDSPLIT that are in petsc-dev
(http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
With this you would chose
PCSetType(pc,PCFIELDSPLIT
PCFieldSplitSetIS(pc,is1
PCFieldSplitSetIS(pc,is2
PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
to use LU on A11 use the command line options
-fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
and SOR on A22
-fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
fieldsplit_1_pc_sor_lits <lits> where
    <its> is the number of iterations you want to use block A22

is1 is the IS that contains the indices for all the vector entries in
the 1 block while is2 is all indices in the
vector for the 2 block. You can use ISCreateGeneral() to create these.

   Probably it is easiest just to try this out.

   Barry


On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:

>
> Hi,
>
> I am trying to implement a multilevel method for an EM problem. The
> reference is : "Comparison of hierarchical basis functions for
> efficient
> multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
> IET
> Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>
> Here is the summary:
>
> The matrix equation Ax=b is solved using GMRES with a multilevel
> pre-conditioner. A has a block structure.
>
> A11    A12       *         x1  =  b1
> A21    A22                  x2       b2
>
> A11 is mxm and A33 is nxn, where m is not equal to n.
>
> Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
> superLU or
> MUMPS)
>
> Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
> a SOR
> solver or a parallel LU)
>
> Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>
> This gives the approximate solution to
>
> A11     A12     *      e1   =  b1
> A21     A22             e2       b2
>
> and is used as the pre-conditioner for the GMRES.
>
>
> Which PetSc method can implement this pre-conditioner ? I tried a
> PCSHELL
> type PC. With Hong's help, I also got the parallel LU to work
> withSuperLU/MUMPS. My program runs successfully on multiple
> processes on a
> single machine. But when I submit the program over multiple
> machines, I get
> a crash in the PCApply routine after several GMRES iterations. I
> think this
> has to do with using PCSHELL with GMRES (which is not a good idea). Is
> there a different way to implement this ? Does this resemble the usage
> pattern of one of the AMG preconditioners ?
>
>
> Thanks
>
> Rgds,
> Amit
>


From knepley at gmail.com  Thu Apr 24 09:19:47 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Thu, 24 Apr 2008 09:19:47 -0500
Subject: Multilevel solver
In-Reply-To: <OF79614B7C.806F3CC5-ON85257435.004B7368-85257435.004D8945@seagate.com>
References: <F67BAE01-DA10-4F46-A696-7FE88BBDDC03@mcs.anl.gov>
	 <OF79614B7C.806F3CC5-ON85257435.004B7368-85257435.004D8945@seagate.com>
Message-ID: <a9f269830804240719h7759db85v7e1ce7114a04c98e@mail.gmail.com>

On Thu, Apr 24, 2008 at 8:58 AM,  <Amit.Itagi at seagate.com> wrote:
> Barry,
>
>  I have been trying out the PCFIELDSPLIT. I have not yet gotten it to work.
>  I have some follow up questions which might help solve my problem.
>
>  Consider the simple case of a 4x4 matrix equation being solved on two
>  processes. I have vector elements 0 and 1 belonging to rank 0, and elements
>  2 and 3 belonging to rank 1.
>
>  1) For my example, can the index sets have staggered indices i.e. is1-> 0,2
>  and  is2->1,3  (each is spans across ranks) ?

Yes.

>  2) When I provide the -field_split_<n>_pc_type option on the command line,
>  is the index <n> in the same order that the PCFieldSplitSetIS function
>  called in ?
>      So if I have  PCFieldSplitSetIS(pc,is2) before
>  PCFieldSplitSetIS(pc,is1), will -field_split_0_... correspond to is2 and
>  -field_split_1_... to is1 ?

Yes.

>  3) Since I want to set PC type to lu for field 0, and I want to use MUMPS
>  for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In this
>  case, will a second copy of the submatrix be generated - one of type MUMPS
>  for the PC and the other of the original MATAIJ type for the KSP ?

I will have to check. However if we are consistent, then it should be

  -field_split_0_mat_type aijmumps

>  4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE ?

It is just the composition of the preconditioners, which is what you want here.

  Matt

>  Thanks
>
>  Rgds,
>  Amit
>
>
>
>
>              Barry Smith
>              <bsmith at mcs.anl.g
>              ov>                                                        To
>              Sent by:                  petsc-users at mcs.anl.gov
>              owner-petsc-users                                          cc
>              @mcs.anl.gov
>              No Phone Info                                         Subject
>              Available                 Re: Multilevel solver
>
>
>              04/22/2008 10:08
>              PM
>
>
>              Please respond to
>              petsc-users at mcs.a
>                   nl.gov
>
>
>
>
>
>
>   Amit,
>
>      Using a a PCSHELL should be fine (it can be used with GMRES),
>  my guess is there is a memory corruption error somewhere that is
>  causing the crash. This could be tracked down with www.valgrind.com
>
>     Another way to you could implement this is with some very recent
>  additions I made to PCFIELDSPLIT that are in petsc-dev
>  (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
>  With this you would chose
>  PCSetType(pc,PCFIELDSPLIT
>  PCFieldSplitSetIS(pc,is1
>  PCFieldSplitSetIS(pc,is2
>  PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  to use LU on A11 use the command line options
>  -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
>  and SOR on A22
>  -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
>  fieldsplit_1_pc_sor_lits <lits> where
>     <its> is the number of iterations you want to use block A22
>
>  is1 is the IS that contains the indices for all the vector entries in
>  the 1 block while is2 is all indices in the
>  vector for the 2 block. You can use ISCreateGeneral() to create these.
>
>    Probably it is easiest just to try this out.
>
>    Barry
>
>
>  On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:
>
>  >
>  > Hi,
>  >
>  > I am trying to implement a multilevel method for an EM problem. The
>  > reference is : "Comparison of hierarchical basis functions for
>  > efficient
>  > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
>  > IET
>  > Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>  >
>  > Here is the summary:
>  >
>  > The matrix equation Ax=b is solved using GMRES with a multilevel
>  > pre-conditioner. A has a block structure.
>  >
>  > A11    A12       *         x1  =  b1
>  > A21    A22                  x2       b2
>  >
>  > A11 is mxm and A33 is nxn, where m is not equal to n.
>  >
>  > Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
>  > superLU or
>  > MUMPS)
>  >
>  > Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
>  > a SOR
>  > solver or a parallel LU)
>  >
>  > Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>  >
>  > This gives the approximate solution to
>  >
>  > A11     A12     *      e1   =  b1
>  > A21     A22             e2       b2
>  >
>  > and is used as the pre-conditioner for the GMRES.
>  >
>  >
>  > Which PetSc method can implement this pre-conditioner ? I tried a
>  > PCSHELL
>  > type PC. With Hong's help, I also got the parallel LU to work
>  > withSuperLU/MUMPS. My program runs successfully on multiple
>  > processes on a
>  > single machine. But when I submit the program over multiple
>  > machines, I get
>  > a crash in the PCApply routine after several GMRES iterations. I
>  > think this
>  > has to do with using PCSHELL with GMRES (which is not a good idea). Is
>  > there a different way to implement this ? Does this resemble the usage
>  > pattern of one of the AMG preconditioners ?
>  >
>  >
>  > Thanks
>  >
>  > Rgds,
>  > Amit
>  >
>
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From Amit.Itagi at seagate.com  Thu Apr 24 11:07:08 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Thu, 24 Apr 2008 12:07:08 -0400
Subject: Multilevel solver
In-Reply-To: <a9f269830804240719h7759db85v7e1ce7114a04c98e@mail.gmail.com>
Message-ID: <OFBD277C20.FAB95B27-ON85257435.00586B1A-85257435.00594962@seagate.com>

                                                                           
             "Matthew Knepley"                                             
             <knepley at gmail.co                                             
             m>                                                         To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: Multilevel solver               
                                                                           
                                                                           
             04/24/2008 10:19                                              
             AM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
>  3) Since I want to set PC type to lu for field 0, and I want to use
MUMPS
>  for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In
this
>  case, will a second copy of the submatrix be generated - one of type
MUMPS
>  for the PC and the other of the original MATAIJ type for the KSP ?

I will have to check. However if we are consistent, then it should be

  -field_split_0_mat_type aijmumps


Can these options be set inside the code, instead of on the command line ?


>  4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
?

It is just the composition of the preconditioners, which is what you want
here.

  Matt

>  Thanks
>
>  Rgds,
>  Amit
>
>
>
>
>              Barry Smith
>              <bsmith at mcs.anl.g
>              ov>
To
>              Sent by:                  petsc-users at mcs.anl.gov
>              owner-petsc-users
cc
>              @mcs.anl.gov
>              No Phone Info
Subject
>              Available                 Re: Multilevel solver
>
>
>              04/22/2008 10:08
>              PM
>
>
>              Please respond to
>              petsc-users at mcs.a
>                   nl.gov
>
>
>
>
>
>
>   Amit,
>
>      Using a a PCSHELL should be fine (it can be used with GMRES),
>  my guess is there is a memory corruption error somewhere that is
>  causing the crash. This could be tracked down with www.valgrind.com
>
>     Another way to you could implement this is with some very recent
>  additions I made to PCFIELDSPLIT that are in petsc-dev
>  (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
>  With this you would chose
>  PCSetType(pc,PCFIELDSPLIT
>  PCFieldSplitSetIS(pc,is1
>  PCFieldSplitSetIS(pc,is2
>  PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  to use LU on A11 use the command line options
>  -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
>  and SOR on A22
>  -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
>  fieldsplit_1_pc_sor_lits <lits> where
>     <its> is the number of iterations you want to use block A22
>
>  is1 is the IS that contains the indices for all the vector entries in
>  the 1 block while is2 is all indices in the
>  vector for the 2 block. You can use ISCreateGeneral() to create these.
>
>    Probably it is easiest just to try this out.
>
>    Barry
>
>
>  On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:
>
>  >
>  > Hi,
>  >
>  > I am trying to implement a multilevel method for an EM problem. The
>  > reference is : "Comparison of hierarchical basis functions for
>  > efficient
>  > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
>  > IET
>  > Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>  >
>  > Here is the summary:
>  >
>  > The matrix equation Ax=b is solved using GMRES with a multilevel
>  > pre-conditioner. A has a block structure.
>  >
>  > A11    A12       *         x1  =  b1
>  > A21    A22                  x2       b2
>  >
>  > A11 is mxm and A33 is nxn, where m is not equal to n.
>  >
>  > Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
>  > superLU or
>  > MUMPS)
>  >
>  > Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
>  > a SOR
>  > solver or a parallel LU)
>  >
>  > Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>  >
>  > This gives the approximate solution to
>  >
>  > A11     A12     *      e1   =  b1
>  > A21     A22             e2       b2
>  >
>  > and is used as the pre-conditioner for the GMRES.
>  >
>  >
>  > Which PetSc method can implement this pre-conditioner ? I tried a
>  > PCSHELL
>  > type PC. With Hong's help, I also got the parallel LU to work
>  > withSuperLU/MUMPS. My program runs successfully on multiple
>  > processes on a
>  > single machine. But when I submit the program over multiple
>  > machines, I get
>  > a crash in the PCApply routine after several GMRES iterations. I
>  > think this
>  > has to do with using PCSHELL with GMRES (which is not a good idea). Is
>  > there a different way to implement this ? Does this resemble the usage
>  > pattern of one of the AMG preconditioners ?
>  >
>  >
>  > Thanks
>  >
>  > Rgds,
>  > Amit
>  >
>
>
>
>


--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From knepley at gmail.com  Thu Apr 24 11:28:42 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Thu, 24 Apr 2008 11:28:42 -0500
Subject: Multilevel solver
In-Reply-To: <OFBD277C20.FAB95B27-ON85257435.00586B1A-85257435.00594962@seagate.com>
References: <a9f269830804240719h7759db85v7e1ce7114a04c98e@mail.gmail.com>
	 <OFBD277C20.FAB95B27-ON85257435.00586B1A-85257435.00594962@seagate.com>
Message-ID: <a9f269830804240928v71a6ae3as584b36c5d4752b38@mail.gmail.com>

On Thu, Apr 24, 2008 at 11:07 AM,  <Amit.Itagi at seagate.com> wrote:
>
>              "Matthew Knepley"
>              <knepley at gmail.co
>
>              m>                                                         To
>              Sent by:                  petsc-users at mcs.anl.gov
>              owner-petsc-users                                          cc
>              @mcs.anl.gov
>              No Phone Info                                         Subject
>              Available                 Re: Multilevel solver
>
>
>              04/24/2008 10:19
>              AM
>
>
>
>              Please respond to
>              petsc-users at mcs.a
>                   nl.gov
>
>
>
>
>
>
>
>
>
>
> >  3) Since I want to set PC type to lu for field 0, and I want to use
>  MUMPS
>  >  for parallel LU, where do I set the submatrix type to MATAIJMUMPS ? In
>  this
>  >  case, will a second copy of the submatrix be generated - one of type
>  MUMPS
>  >  for the PC and the other of the original MATAIJ type for the KSP ?
>
>  I will have to check. However if we are consistent, then it should be
>
>   -field_split_0_mat_type aijmumps
>
>
>  Can these options be set inside the code, instead of on the command line ?

Yes, however it becomes more complicated. You must pull these objects out of
the FieldSplitPC (not that hard), but you must also be careful to set the values
after options are read in, but before higher level things are
initialized (like the
outer solver), so it can be somewhat delicate. I would not advise hardcoding
things which you are likely to change based upon architecture, problem, etc.
However, if you want to, the easiest way is to use PetscSetOption().

  Matt

>  >  4) How is the PC applied when I do PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  ?
>
>  It is just the composition of the preconditioners, which is what you want
>  here.
>
>   Matt
>
>  >  Thanks
>  >
>  >  Rgds,
>  >  Amit
>  >
>  >
>  >
>  >
>  >              Barry Smith
>  >              <bsmith at mcs.anl.g
>  >              ov>
>  To
>  >              Sent by:                  petsc-users at mcs.anl.gov
>  >              owner-petsc-users
>  cc
>  >              @mcs.anl.gov
>  >              No Phone Info
>  Subject
>  >              Available                 Re: Multilevel solver
>  >
>  >
>  >              04/22/2008 10:08
>  >              PM
>  >
>  >
>  >              Please respond to
>  >              petsc-users at mcs.a
>  >                   nl.gov
>  >
>  >
>  >
>  >
>  >
>  >
>  >   Amit,
>  >
>  >      Using a a PCSHELL should be fine (it can be used with GMRES),
>  >  my guess is there is a memory corruption error somewhere that is
>  >  causing the crash. This could be tracked down with www.valgrind.com
>  >
>  >     Another way to you could implement this is with some very recent
>  >  additions I made to PCFIELDSPLIT that are in petsc-dev
>  >  (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
>  >  With this you would chose
>  >  PCSetType(pc,PCFIELDSPLIT
>  >  PCFieldSplitSetIS(pc,is1
>  >  PCFieldSplitSetIS(pc,is2
>  >  PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  >  to use LU on A11 use the command line options
>  >  -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
>  >  and SOR on A22
>  >  -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
>  >  fieldsplit_1_pc_sor_lits <lits> where
>  >     <its> is the number of iterations you want to use block A22
>  >
>  >  is1 is the IS that contains the indices for all the vector entries in
>  >  the 1 block while is2 is all indices in the
>  >  vector for the 2 block. You can use ISCreateGeneral() to create these.
>  >
>  >    Probably it is easiest just to try this out.
>  >
>  >    Barry
>  >
>  >
>  >  On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:
>  >
>  >  >
>  >  > Hi,
>  >  >
>  >  > I am trying to implement a multilevel method for an EM problem. The
>  >  > reference is : "Comparison of hierarchical basis functions for
>  >  > efficient
>  >  > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
>  >  > IET
>  >  > Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>  >  >
>  >  > Here is the summary:
>  >  >
>  >  > The matrix equation Ax=b is solved using GMRES with a multilevel
>  >  > pre-conditioner. A has a block structure.
>  >  >
>  >  > A11    A12       *         x1  =  b1
>  >  > A21    A22                  x2       b2
>  >  >
>  >  > A11 is mxm and A33 is nxn, where m is not equal to n.
>  >  >
>  >  > Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
>  >  > superLU or
>  >  > MUMPS)
>  >  >
>  >  > Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
>  >  > a SOR
>  >  > solver or a parallel LU)
>  >  >
>  >  > Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>  >  >
>  >  > This gives the approximate solution to
>  >  >
>  >  > A11     A12     *      e1   =  b1
>  >  > A21     A22             e2       b2
>  >  >
>  >  > and is used as the pre-conditioner for the GMRES.
>  >  >
>  >  >
>  >  > Which PetSc method can implement this pre-conditioner ? I tried a
>  >  > PCSHELL
>  >  > type PC. With Hong's help, I also got the parallel LU to work
>  >  > withSuperLU/MUMPS. My program runs successfully on multiple
>  >  > processes on a
>  >  > single machine. But when I submit the program over multiple
>  >  > machines, I get
>  >  > a crash in the PCApply routine after several GMRES iterations. I
>  >  > think this
>  >  > has to do with using PCSHELL with GMRES (which is not a good idea). Is
>  >  > there a different way to implement this ? Does this resemble the usage
>  >  > pattern of one of the AMG preconditioners ?
>  >  >
>  >  >
>  >  > Thanks
>  >  >
>  >  > Rgds,
>  >  > Amit
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>  --
>  What most experimenters take for granted before they begin their
>  experiments is infinitely more interesting than any results to which
>  their experiments lead.
>  -- Norbert Wiener
>
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From Amit.Itagi at seagate.com  Thu Apr 24 12:01:44 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Thu, 24 Apr 2008 13:01:44 -0400
Subject: Multilevel solver
In-Reply-To: <a9f269830804240928v71a6ae3as584b36c5d4752b38@mail.gmail.com>
Message-ID: <OF56697D48.F5FFF0B5-ON85257435.005CE3FC-85257435.005E48F6@seagate.com>

Matt,

So putting eveything together, I wrote this simple 4x4 example (to be run
on two processes).


=========================================================================================

#include <iostream>
#include <complex>
#include <stdlib.h>
#include "petsc.h"
#include "petscmat.h"
#include "petscvec.h"
#include "petscksp.h"

using namespace std;

int main( int argc, char *argv[] ) {

  int rank, size;
  Mat A;
  PetscErrorCode ierr;
  int nrow, ncol, loc;
  Vec x, b;
  KSP solver;
  PC prec;
  IS is1, is2;

  PetscScalar val;

  // Matrix dimensions
  nrow=4;
  ncol=4;

  // Number of non-zeros in each row
  int d_nnz1[2], d_nnz2[2], o_nnz1[2],o_nnz2[2];

  d_nnz1[0]=2;
  o_nnz1[0]=2;
  d_nnz1[1]=2;
  o_nnz1[1]=2;

  d_nnz2[0]=2;
  o_nnz2[0]=2;
  d_nnz2[1]=2;
  o_nnz2[1]=2;


  ierr=PetscInitialize(&argc,&argv,PETSC_NULL,PETSC_NULL); CHKERRQ(ierr);
  ierr=MPI_Comm_size(PETSC_COMM_WORLD,&size); CHKERRQ(ierr);
  ierr=MPI_Comm_rank(PETSC_COMM_WORLD,&rank); CHKERRQ(ierr);


  // Matrix assembly
  if(rank==0) {
    MatCreateMPIAIJ(PETSC_COMM_WORLD,2,2,4,4,0,d_nnz1,0,o_nnz1,&A);
    val=complex<double>(2.0,3.0);
    ierr=MatSetValue(A,0,0,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(5.0,-1.0);
    ierr=MatSetValue(A,0,1,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(1.0,2.0);
    ierr=MatSetValue(A,0,2,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(1.0,-1.0);
    ierr=MatSetValue(A,0,3,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(5.0,-1.0);
    ierr=MatSetValue(A,1,0,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(2.0,0.0);
    ierr=MatSetValue(A,1,1,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(3.0,0.0);
    ierr=MatSetValue(A,1,2,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(1.0,0.0);
    ierr=MatSetValue(A,1,3,val,INSERT_VALUES);CHKERRQ(ierr);
  }
  else if(rank==1) {
    MatCreateMPIAIJ(PETSC_COMM_WORLD,2,2,4,4,0,d_nnz2,0,o_nnz2,&A);
    val=complex<double>(1.0,2.0);
    ierr=MatSetValue(A,2,0,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(3.0,0.0);
    ierr=MatSetValue(A,2,1,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(0.0,2.0);
    ierr=MatSetValue(A,2,2,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(1.0,0.0);
    ierr=MatSetValue(A,2,3,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(1.0,-1.0);
    ierr=MatSetValue(A,3,0,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(1.0,0.0);
    ierr=MatSetValue(A,3,1,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(1.0,0.0);
    ierr=MatSetValue(A,3,2,val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(2.0,0.0);
    ierr=MatSetValue(A,3,3,val,INSERT_VALUES);CHKERRQ(ierr);
  }
  else {
    MatCreateMPIAIJ(PETSC_COMM_WORLD,0,0,4,4,0,PETSC_NULL,0,PETSC_NULL,&A);
  }


  ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
  ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);


  ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Defined matrix\n"); CHKERRQ(ierr);
  ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr);
  ierr=MatView(A,PETSC_VIEWER_STDOUT_WORLD); CHKERRQ(ierr);


  // Vector assembly

  // Allocate memory for the vectors
  if(rank==0) {
    ierr=VecCreateMPI(PETSC_COMM_WORLD,2,4,&x); CHKERRQ(ierr);
    val=complex<double>(1.0,0.0);
    loc=0;
    ierr=VecSetValues(x,1,&loc,&val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(-1.0,0.0);
    loc=1;
    ierr=VecSetValues(x,1,&loc,&val,INSERT_VALUES);CHKERRQ(ierr);
  }
  else if(rank==1) {
    ierr=VecCreateMPI(PETSC_COMM_WORLD,2,4,&x); CHKERRQ(ierr);
    val=complex<double>(1.0,1.0);
    loc=2;
    ierr=VecSetValues(x,1,&loc,&val,INSERT_VALUES);CHKERRQ(ierr);
    val=complex<double>(1.0,0.0);
    loc=3;
    ierr=VecSetValues(x,1,&loc,&val,INSERT_VALUES);CHKERRQ(ierr);
  }
  else {
    ierr=VecCreateMPI(PETSC_COMM_WORLD,0,4,&x); CHKERRQ(ierr);
  }
  ierr=VecAssemblyBegin(x); CHKERRQ(ierr);
  ierr=VecAssemblyEnd(x); CHKERRQ(ierr);


  ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Defined vector\n"); CHKERRQ(ierr);
  ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr);
  ierr=VecView(x,PETSC_VIEWER_STDOUT_WORLD); CHKERRQ(ierr);

  // Storage for the solution
  VecDuplicate(x,&b);

  // Create the Field Split index sets
  PetscInt idxA[2], idxB[2];

  idxA[0]=0;
  idxA[1]=1;
  idxB[0]=2;
  idxB[1]=3;

  ierr=ISCreateGeneral(PETSC_COMM_WORLD,2,idxA,&is1); CHKERRQ(ierr);
  ierr=ISCreateGeneral(PETSC_COMM_WORLD,2,idxB,&is2); CHKERRQ(ierr);


  // Krylov Solver

  ierr=KSPCreate(PETSC_COMM_WORLD,&solver); CHKERRQ(ierr);
  ierr=KSPSetOperators(solver,A,A,SAME_NONZERO_PATTERN); CHKERRQ(ierr);
  ierr=KSPSetType(solver,KSPGMRES); CHKERRQ(ierr);

  // Pre-conditioner
  ierr=KSPGetPC(solver,&prec); CHKERRQ(ierr);

   ierr=PCSetType(prec,PCFIELDSPLIT); CHKERRQ(ierr);


  ierr=PCFieldSplitSetIS(prec,is1); CHKERRQ(ierr);
  ierr=PCFieldSplitSetIS(prec,is2); CHKERRQ(ierr);
  ierr=PCFieldSplitSetType(prec,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE); CHKERRQ(ierr);

  ierr=PCSetFromOptions(prec); CHKERRQ(ierr);
  ierr=KSPSetFromOptions(solver); CHKERRQ(ierr);

  ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Set KSP/PC options\n"); CHKERRQ(ierr);
  ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr);


  ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Solving the equation\n"); CHKERRQ(ierr);
  ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr);

  ierr=KSPSolve(solver,x,b); CHKERRQ(ierr);

  ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"Solving over\n"); CHKERRQ(ierr);
  ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr);


  ierr=PetscSynchronizedPrintf(PETSC_COMM_WORLD,"\nThe solution\n"); CHKERRQ(ierr);
  ierr=PetscSynchronizedFlush(PETSC_COMM_WORLD); CHKERRQ(ierr);
  ierr=VecView(b,PETSC_VIEWER_STDOUT_WORLD); CHKERRQ(ierr);

 //  Clean up
  ierr=VecDestroy(x); CHKERRQ(ierr);
  ierr=VecDestroy(b); CHKERRQ(ierr);
  ierr=MatDestroy(A); CHKERRQ(ierr);
  ierr=KSPDestroy(solver); CHKERRQ(ierr);
  ierr=ISDestroy(is1); CHKERRQ(ierr);
  ierr=ISDestroy(is2); CHKERRQ(ierr);

  ierr=PetscFinalize(); CHKERRQ(ierr);
  // Finalize


  return 0;

}


==============================================================================================

I run the program with

mpiexec -np 2 ./main -fieldsplit_0_pc_type sor -fieldsplit_0_ksp_type_preonly -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type_preonly > &err

In the KSPSolve step, I get an error. Here is the output of my run.

=================================================================================================

Defined matrix
Defined matrix
row 0: (0, 2 + 3 i) (1, 5 - 1 i) (2, 1 + 2 i) (3, 1 - 1 i)
row 1: (0, 5 - 1 i) (1, 2)  (2, 3)  (3, 1)
row 2: (0, 1 + 2 i) (1, 3)  (2, 0 + 2 i) (3, 1)
row 3: (0, 1 - 1 i) (1, 1)  (2, 1)  (3, 2)
Defined vector
Defined vector
Process [0]
1
-1
Process [1]
1 + 1 i
1
Set KSP/PC options
Set KSP/PC options
Solving the equation
Solving the equation
[1]PETSC ERROR: --------------------- Error Message ------------------------------------
[1]PETSC ERROR: Nonconforming object sizes!
[1]PETSC ERROR: Local column sizes 0 do not add up to total number of columns 4!
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Petsc Development Version 2.3.3, Patch 12, unknown HG revision: unknown
[1]PETSC ERROR: See docs/changes/index.html for recent updates.
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[1]PETSC ERROR: See docs/index.html for manual pages.
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: ./main on a linux-gnu named tabla by amit Thu Apr 24 13:04:57 2008
[1]PETSC ERROR: Libraries linked from /home/amit/programs/ParEM/petsc-dev/lib
[1]PETSC ERROR: Configure run at Wed Apr 23 22:02:21 2008
[1]PETSC ERROR: Configure options --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx --with-mpi=1 --download-mpich=1 --with-metis=1
--download-metis=1 --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 --download-superlu_dist=1 --with-mumps=1 --download-blacs=1
--download-scalapack=1 --download-mumps=1 --with-shared=0
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: MatGetSubMatrix_MPIAIJ() line 2974 in src/mat/impls/aij/mpi/mpiaij.c
[1]PETSC ERROR: MatGetSubMatrix() line 5956 in src/mat/interface/matrix.c
[1]PETSC ERROR: PCSetUp_FieldSplit() line 177 in src/ksp/pc/impls/fieldsplit/fieldsplit.c
[1]PETSC ERROR: PCSetUp() line 788 in src/ksp/pc/interface/precon.c
[1]PETSC ERROR: KSPSetUp() line 234 in src/ksp/ksp/interface/itfunc.c
[1]PETSC ERROR: KSPSolve() line 350 in src/ksp/ksp/interface/itfunc.c
[1]PETSC ERROR: User provided function() line 175 in main.cpp
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 13 Broken Pipe: Likely while reading or writing to a socket
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[0]PETSC ERROR: or try http://valgrind.org on
linux or man libgmalloc on Apple to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] MatSetType line 45 src/mat/interface/matreg.c
[0]PETSC ERROR: [0] PCSetUp line 765 src/ksp/pc/interface/precon.c
[0]PETSC ERROR: [0] KSPSetUp line 183 src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: [0] KSPSolve line 305 src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: --------------------- Error Message ------------------------------------
[0]PETSC ERROR: Signal received!
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Development Version 2.3.3, Patch 12, unknown HG revision: unknown
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: ./main on a linux-gnu named tabla by amit Thu Apr 24 13:04:57 2008
[0]PETSC ERROR: Libraries linked from /home/amit/programs/ParEM/petsc-dev/lib
[0]PETSC ERROR: Configure run at Wed Apr 23 22:02:21 2008
[0]PETSC ERROR: Configure options --with-scalar-type=complex --with-debugging=yes --with-clanguage=cxx --with-mpi=1 --download-mpich=1 --with-metis=1
--download-metis=1 --with-parmetis=1 --download-parmetis=1 --with-superlu_dist=1 --download-superlu_dist=1 --with-mumps=1 --download-blacs=1
--download-scalapack=1 --download-mumps=1 --with-shared=0
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: User provided function() line 0 in unknown directory unknown file
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0


What am I doing wrong ?

Thanks

Rgds,
Amit


             "Matthew Knepley"                                             
             <knepley at gmail.co                                             
             m>                                                         To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: Multilevel solver               
                                                                           
                                                                           
             04/24/2008 12:28                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
On Thu, Apr 24, 2008 at 11:07 AM,  <Amit.Itagi at seagate.com> wrote:
>
>              "Matthew Knepley"
>              <knepley at gmail.co
>
>              m>
To
>              Sent by:                  petsc-users at mcs.anl.gov
>              owner-petsc-users
cc
>              @mcs.anl.gov
>              No Phone Info
Subject
>              Available                 Re: Multilevel solver
>
>
>              04/24/2008 10:19
>              AM
>
>
>
>              Please respond to
>              petsc-users at mcs.a
>                   nl.gov
>
>
>
>
>
>
>
>
>
>
> >  3) Since I want to set PC type to lu for field 0, and I want to use
>  MUMPS
>  >  for parallel LU, where do I set the submatrix type to MATAIJMUMPS ?
In
>  this
>  >  case, will a second copy of the submatrix be generated - one of type
>  MUMPS
>  >  for the PC and the other of the original MATAIJ type for the KSP ?
>
>  I will have to check. However if we are consistent, then it should be
>
>   -field_split_0_mat_type aijmumps
>
>
>  Can these options be set inside the code, instead of on the command line
?

Yes, however it becomes more complicated. You must pull these objects out
of
the FieldSplitPC (not that hard), but you must also be careful to set the
values
after options are read in, but before higher level things are
initialized (like the
outer solver), so it can be somewhat delicate. I would not advise
hardcoding
things which you are likely to change based upon architecture, problem,
etc.
However, if you want to, the easiest way is to use PetscSetOption().

  Matt

>  >  4) How is the PC applied when I do
PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  ?
>
>  It is just the composition of the preconditioners, which is what you
want
>  here.
>
>   Matt
>
>  >  Thanks
>  >
>  >  Rgds,
>  >  Amit
>  >
>  >
>  >
>  >
>  >              Barry Smith
>  >              <bsmith at mcs.anl.g
>  >              ov>
>  To
>  >              Sent by:                  petsc-users at mcs.anl.gov
>  >              owner-petsc-users
>  cc
>  >              @mcs.anl.gov
>  >              No Phone Info
>  Subject
>  >              Available                 Re: Multilevel solver
>  >
>  >
>  >              04/22/2008 10:08
>  >              PM
>  >
>  >
>  >              Please respond to
>  >              petsc-users at mcs.a
>  >                   nl.gov
>  >
>  >
>  >
>  >
>  >
>  >
>  >   Amit,
>  >
>  >      Using a a PCSHELL should be fine (it can be used with GMRES),
>  >  my guess is there is a memory corruption error somewhere that is
>  >  causing the crash. This could be tracked down with www.valgrind.com
>  >
>  >     Another way to you could implement this is with some very recent
>  >  additions I made to PCFIELDSPLIT that are in petsc-dev
>  >  (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
>  >  With this you would chose
>  >  PCSetType(pc,PCFIELDSPLIT
>  >  PCFieldSplitSetIS(pc,is1
>  >  PCFieldSplitSetIS(pc,is2
>  >  PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
>  >  to use LU on A11 use the command line options
>  >  -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
>  >  and SOR on A22
>  >  -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
>  >  fieldsplit_1_pc_sor_lits <lits> where
>  >     <its> is the number of iterations you want to use block A22
>  >
>  >  is1 is the IS that contains the indices for all the vector entries in
>  >  the 1 block while is2 is all indices in the
>  >  vector for the 2 block. You can use ISCreateGeneral() to create
these.
>  >
>  >    Probably it is easiest just to try this out.
>  >
>  >    Barry
>  >
>  >
>  >  On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:
>  >
>  >  >
>  >  > Hi,
>  >  >
>  >  > I am trying to implement a multilevel method for an EM problem. The
>  >  > reference is : "Comparison of hierarchical basis functions for
>  >  > efficient
>  >  > multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
>  >  > IET
>  >  > Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>  >  >
>  >  > Here is the summary:
>  >  >
>  >  > The matrix equation Ax=b is solved using GMRES with a multilevel
>  >  > pre-conditioner. A has a block structure.
>  >  >
>  >  > A11    A12       *         x1  =  b1
>  >  > A21    A22                  x2       b2
>  >  >
>  >  > A11 is mxm and A33 is nxn, where m is not equal to n.
>  >  >
>  >  > Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
>  >  > superLU or
>  >  > MUMPS)
>  >  >
>  >  > Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
>  >  > a SOR
>  >  > solver or a parallel LU)
>  >  >
>  >  > Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>  >  >
>  >  > This gives the approximate solution to
>  >  >
>  >  > A11     A12     *      e1   =  b1
>  >  > A21     A22             e2       b2
>  >  >
>  >  > and is used as the pre-conditioner for the GMRES.
>  >  >
>  >  >
>  >  > Which PetSc method can implement this pre-conditioner ? I tried a
>  >  > PCSHELL
>  >  > type PC. With Hong's help, I also got the parallel LU to work
>  >  > withSuperLU/MUMPS. My program runs successfully on multiple
>  >  > processes on a
>  >  > single machine. But when I submit the program over multiple
>  >  > machines, I get
>  >  > a crash in the PCApply routine after several GMRES iterations. I
>  >  > think this
>  >  > has to do with using PCSHELL with GMRES (which is not a good idea).
Is
>  >  > there a different way to implement this ? Does this resemble the
usage
>  >  > pattern of one of the AMG preconditioners ?
>  >  >
>  >  >
>  >  > Thanks
>  >  >
>  >  > Rgds,
>  >  > Amit
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>  --
>  What most experimenters take for granted before they begin their
>  experiments is infinitely more interesting than any results to which
>  their experiments lead.
>  -- Norbert Wiener
>
>
>
>


--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From bsmith at mcs.anl.gov  Thu Apr 24 12:13:28 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Thu, 24 Apr 2008 12:13:28 -0500
Subject: Multilevel solver
In-Reply-To: <OF79614B7C.806F3CC5-ON85257435.004B7368-85257435.004D8945@seagate.com>
References: <OF79614B7C.806F3CC5-ON85257435.004B7368-85257435.004D8945@seagate.com>
Message-ID: <A3A323C1-3A72-4E3C-9AAD-404525674361@mcs.anl.gov>


On Apr 24, 2008, at 8:58 AM, Amit.Itagi at seagate.com wrote:

> Barry,
>
> I have been trying out the PCFIELDSPLIT. I have not yet gotten it to  
> work.
> I have some follow up questions which might help solve my problem.
>
> Consider the simple case of a 4x4 matrix equation being solved on two
> processes. I have vector elements 0 and 1 belonging to rank 0, and  
> elements
> 2 and 3 belonging to rank 1.
>
> 1) For my example, can the index sets have staggered indices i.e.  
> is1-> 0,2
> and  is2->1,3  (each is spans across ranks) ?
>
> 2) When I provide the -field_split_<n>_pc_type option on the command  
> line,
                                                 ^^^^^
                            There is no underscore here because the PC  
name is fieldsplit
and we never split the names into pieces.

>
> is the index <n> in the same order that the PCFieldSplitSetIS function
> called in ?
>     So if I have  PCFieldSplitSetIS(pc,is2) before
> PCFieldSplitSetIS(pc,is1), will -field_split_0_... correspond to is2  
> and
> -field_split_1_... to is1 ?

    You can list them in any order you want but the order determines  
how the
multiplicative versions are applied. They are applied always started  
from zero
(the first one you put in).
>
>
> 3) Since I want to set PC type to lu for field 0, and I want to use  
> MUMPS
> for parallel LU, where do I set the submatrix type to MATAIJMUMPS ?  
> In this
> case, will a second copy of the submatrix be generated - one of type  
> MUMPS
> for the PC and the other of the original MATAIJ type for the KSP ?

     This issue will be fixed in a few days. I think you need to start  
with an entire
matrix that is aijmumps and then the subs will also be.

    Barry

>
>
> 4) How is the PC applied when I do  
> PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE ?
>
> Thanks
>
> Rgds,
> Amit
>
>
>
>
>             Barry Smith
>             <bsmith at mcs.anl.g
>              
> ov>                                                        To
>             Sent by:                  petsc-users at mcs.anl.gov
>             owner-petsc- 
> users                                          cc
>             @mcs.anl.gov
>             No Phone Info                                          
> Subject
>             Available                 Re: Multilevel solver
>
>
>             04/22/2008 10:08
>             PM
>
>
>             Please respond to
>             petsc-users at mcs.a
>                  nl.gov
>
>
>
>
>
>
>  Amit,
>
>     Using a a PCSHELL should be fine (it can be used with GMRES),
> my guess is there is a memory corruption error somewhere that is
> causing the crash. This could be tracked down with www.valgrind.com
>
>    Another way to you could implement this is with some very recent
> additions I made to PCFIELDSPLIT that are in petsc-dev
> (http://www-unix.mcs.anl.gov/petsc/petsc-as/developers/index.html)
> With this you would chose
> PCSetType(pc,PCFIELDSPLIT
> PCFieldSplitSetIS(pc,is1
> PCFieldSplitSetIS(pc,is2
> PCFieldSplitSetType(pc,PC_COMPOSITE_SYMMETRIC_MULTIPLICATIVE
> to use LU on A11 use the command line options
> -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_type preonly
> and SOR on A22
> -fieldsplit_1_pc_type sor -fieldsplit_1_ksp_type preonly -
> fieldsplit_1_pc_sor_lits <lits> where
>    <its> is the number of iterations you want to use block A22
>
> is1 is the IS that contains the indices for all the vector entries in
> the 1 block while is2 is all indices in the
> vector for the 2 block. You can use ISCreateGeneral() to create these.
>
>   Probably it is easiest just to try this out.
>
>   Barry
>
>
> On Apr 22, 2008, at 8:45 PM, Amit.Itagi at seagate.com wrote:
>
>>
>> Hi,
>>
>> I am trying to implement a multilevel method for an EM problem. The
>> reference is : "Comparison of hierarchical basis functions for
>> efficient
>> multilevel solvers", P. Ingelstrom, V. Hill and R. Dyczij-Edlinger,
>> IET
>> Sci. Meas. Technol. 2007, 1(1), pp 48-52.
>>
>> Here is the summary:
>>
>> The matrix equation Ax=b is solved using GMRES with a multilevel
>> pre-conditioner. A has a block structure.
>>
>> A11    A12       *         x1  =  b1
>> A21    A22                  x2       b2
>>
>> A11 is mxm and A33 is nxn, where m is not equal to n.
>>
>> Step 1  :      Solve  A11 *  e1   = b1     (parallel LU using
>> superLU or
>> MUMPS)
>>
>> Step 2:        Solve   A22 * e2    =b2-A21*e1    (might either user
>> a SOR
>> solver or a parallel LU)
>>
>> Step 3:        Solve   A11* e1 = b1-A12*e2   (parallel LU)
>>
>> This gives the approximate solution to
>>
>> A11     A12     *      e1   =  b1
>> A21     A22             e2       b2
>>
>> and is used as the pre-conditioner for the GMRES.
>>
>>
>> Which PetSc method can implement this pre-conditioner ? I tried a
>> PCSHELL
>> type PC. With Hong's help, I also got the parallel LU to work
>> withSuperLU/MUMPS. My program runs successfully on multiple
>> processes on a
>> single machine. But when I submit the program over multiple
>> machines, I get
>> a crash in the PCApply routine after several GMRES iterations. I
>> think this
>> has to do with using PCSHELL with GMRES (which is not a good idea).  
>> Is
>> there a different way to implement this ? Does this resemble the  
>> usage
>> pattern of one of the AMG preconditioners ?
>>
>>
>> Thanks
>>
>> Rgds,
>> Amit
>>
>
>
>


From bsmith at mcs.anl.gov  Thu Apr 24 12:19:58 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Thu, 24 Apr 2008 12:19:58 -0500
Subject: Using PETSc libraries with MS Compute cluster and MS MPI
In-Reply-To: <48104EBD.3080104@gmail.com>
References: <48104EBD.3080104@gmail.com>
Message-ID: <E1525E07-B26D-4180-8DDB-C691E477DE8E@mcs.anl.gov>


    Ben,

      This is an error in our config/configure.py model (and  
autoconf's as well) that does not properly
do the library checks under certain uncommon circumstances, it is not  
so easy for us to fix since we
do not have access to the Microsoft cluster environment.

    Barry

On Apr 24, 2008, at 4:11 AM, Ben Tay wrote:

> Hi,
>
> I'm trying to run my mpi code on the MS Compute cluster which my  
> school just installed. Unfortunately, it failed without giving any  
> error msg. I am using just a test example ex2f.
>
> I read in the MS website that there is no need to use MS MPI to  
> compile the code or library.
>
> Anyway, I also tried to compile PETSc with MS MPI but I'm not able  
> to get pass ./configure. It always complains that there is something  
> wrong with the MS MPI.
>
> Is there anyone who has experience in these?
>
> Thank you very much.
>
> Regards.
>


From tribur at vision.ee.ethz.ch  Thu Apr 24 16:32:08 2008
From: tribur at vision.ee.ethz.ch (tribur at vision.ee.ethz.ch)
Date: Thu, 24 Apr 2008 23:32:08 +0200
Subject: Schur system + MatShell
Message-ID: <20080424233208.9m3yc35qg40s0kgk@email.ee.ethz.ch>

Dear,


> On Tue, 22 Apr 2008, Matthew Knepley wrote:
>> Did you verify that the Schur complement matrix was properly 
>> preallocated before
>> assembly? This is the likely source of time. You can run with -info 
>> and search
>> for "malloc" in the output.

Preallocation doesn't make sense in case of MATDENSE, does it?


> Isn't this using MATDENSE? If that the case - then I think the problem
> is due to wrong partitioning - causing communiation during
> MatAssembly().
>
> -info should clearly show the communication part aswell.
>
> The fix would be to specify the local partition sizes for this matrix
> - and not use PETSC_DECIDE.
>
> Satish

Hm, I think communication during MatAssembly() is necessary, because 
the global Schur complement is obtained by summing up elements of the 
local ones. This also means that the sum of the sizes of the local 
complements is greater than the size of the global Schur complement. 
Therefore, I can not specify the local partition sizes according to the 
real sizes of the local Schur complements, otherwise the global size 
was an unrealistic number (in PETSc the global size is ALWAYS the sum 
of the local ones, isn't it?). Do you know what I mean? Is there 
another possibility of partitioning?

Anyway, I got the thing in MATSHELL-format running, and it's really 
much faster: In an unstructured mesh of 321493 nodes, partitioned into 
7 subdomains with 25577 interface nodes (= size of global Schur 
complement), e.g., the solving of the Schur complement takes now 3 min 
instead of 38 min for the assembling+solving using MATDENSE.

Thank you again for your help and attention,

Kathrin


From mossaiby at yahoo.com  Thu Apr 24 16:29:01 2008
From: mossaiby at yahoo.com (Farshid Mossaiby)
Date: Thu, 24 Apr 2008 14:29:01 -0700 (PDT)
Subject: Using PETSc libraries with MS Compute cluster and MS MPI
In-Reply-To: <E1525E07-B26D-4180-8DDB-C691E477DE8E@mcs.anl.gov>
Message-ID: <292717.24603.qm@web52209.mail.re2.yahoo.com>

Ben,

Have you used headers and libraries from Compute
Cluster SDK for your configure? I doubt you can run
programs compiled with MPICH, for example, on WCCS.

I am going to try this in the next step of my work, so
please share your findings.

Regards,
Farshid Mossaiby

--- Barry Smith <bsmith at mcs.anl.gov> wrote:

> 
>     Ben,
> 
>       This is an error in our config/configure.py
> model (and  
> autoconf's as well) that does not properly
> do the library checks under certain uncommon
> circumstances, it is not  
> so easy for us to fix since we
> do not have access to the Microsoft cluster
> environment.
> 
>     Barry
> 
> On Apr 24, 2008, at 4:11 AM, Ben Tay wrote:
> 
> > Hi,
> >
> > I'm trying to run my mpi code on the MS Compute
> cluster which my  
> > school just installed. Unfortunately, it failed
> without giving any  
> > error msg. I am using just a test example ex2f.
> >
> > I read in the MS website that there is no need to
> use MS MPI to  
> > compile the code or library.
> >
> > Anyway, I also tried to compile PETSc with MS MPI
> but I'm not able  
> > to get pass ./configure. It always complains that
> there is something  
> > wrong with the MS MPI.
> >
> > Is there anyone who has experience in these?
> >
> > Thank you very much.
> >
> > Regards.
> >
> 
> 


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From recrusader at gmail.com  Sun Apr 27 17:19:56 2008
From: recrusader at gmail.com (Yujie)
Date: Sun, 27 Apr 2008 15:19:56 -0700
Subject: how to get the explicit matrix of the preconditioner
Message-ID: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com>

Hi, everyone

How to get the matrix "M_{L}^{-1}" in "M_{L}^{-1}Ax=M_{L}^{-1}b". I have
checked the function "PCGetOperators()". It only gets the matrix "pmat" used
to obtain the preconditioning matrix. thanks a lot.

Regards,
Yujie
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080427/6585175c/attachment.htm>

From mossaiby at yahoo.com  Sun Apr 27 05:01:10 2008
From: mossaiby at yahoo.com (Farshid Mossaiby)
Date: Sun, 27 Apr 2008 03:01:10 -0700 (PDT)
Subject: Compiling PETSc with Visual Studio 2008
In-Reply-To: <alpine.LFD.1.10.0804231023220.3971@asterix.localdomain>
Message-ID: <30962.99968.qm@web52206.mail.re2.yahoo.com>

Hi,

Configure says it cannot make ParMetis with the option
--download-parmetis. Is this related to Visual Studio
2008 compiler I use, or something else is wrong? Here
is the message:

Error running make on ParMetis: Could not execute 'cd
/home/Administrator/petsc-2.3.3-p12/externalpackages/ParMetis-dev;
make clean; make lib; make minstall; make clean':
make: *** No rule to make target `clean'.  Stop.
make: *** No rule to make target `lib'.  Stop.
make: *** No rule to make target `minstall'.  Stop.
make: *** No rule to make target `clean'.  Stop.
*********************************************************************************

Regards,
Farshid Mossaiby

P.S. I hope this is appropriate place to ask this.


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From dave.mayhem23 at gmail.com  Mon Apr 28 02:11:48 2008
From: dave.mayhem23 at gmail.com (Dave May)
Date: Mon, 28 Apr 2008 17:11:48 +1000
Subject: how to get the explicit matrix of the preconditioner
In-Reply-To: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com>
References: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com>
Message-ID: <956373f0804280011g251ce489l94b6a51fec88d14e@mail.gmail.com>

Hi,
  You can use PetscErrorCode PCComputeExplicitOperator(PC pc,Mat *mat)

See,
http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/PC/PCComputeExplicitOperator.html


On Mon, Apr 28, 2008 at 8:19 AM, Yujie <recrusader at gmail.com> wrote:

> Hi, everyone
>
> How to get the matrix "M_{L}^{-1}" in "M_{L}^{-1}Ax=M_{L}^{-1}b". I have
> checked the function "PCGetOperators()". It only gets the matrix "pmat" used
> to obtain the preconditioning matrix. thanks a lot.
>
> Regards,
> Yujie
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080428/5b9ca805/attachment.htm>

From knepley at gmail.com  Mon Apr 28 07:39:23 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Mon, 28 Apr 2008 07:39:23 -0500
Subject: Compiling PETSc with Visual Studio 2008
In-Reply-To: <30962.99968.qm@web52206.mail.re2.yahoo.com>
References: <alpine.LFD.1.10.0804231023220.3971@asterix.localdomain>
	 <30962.99968.qm@web52206.mail.re2.yahoo.com>
Message-ID: <a9f269830804280539r416f34acpd519e31c2fbefd1f@mail.gmail.com>

On Sun, Apr 27, 2008 at 5:01 AM, Farshid Mossaiby <mossaiby at yahoo.com> wrote:
> Hi,
>
>  Configure says it cannot make ParMetis with the option
>  --download-parmetis. Is this related to Visual Studio
>  2008 compiler I use, or something else is wrong? Here
>  is the message:
>
>  Error running make on ParMetis: Could not execute 'cd
>  /home/Administrator/petsc-2.3.3-p12/externalpackages/ParMetis-dev;
>  make clean; make lib; make minstall; make clean':
>  make: *** No rule to make target `clean'.  Stop.
>  make: *** No rule to make target `lib'.  Stop.
>  make: *** No rule to make target `minstall'.  Stop.
>  make: *** No rule to make target `clean'.  Stop.
>  *********************************************************************************
>
>  Regards,
>  Farshid Mossaiby
>
>  P.S. I hope this is appropriate place to ask this.

1) No, this belongs on petsc-maint

2) We cannot tell anything without the configure.log file

   Matt

>       ____________________________________________________________________________________
>  Be a better friend, newshound, and
>  know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From recrusader at gmail.com  Mon Apr 28 10:48:05 2008
From: recrusader at gmail.com (Yujie)
Date: Mon, 28 Apr 2008 08:48:05 -0700
Subject: how to get the explicit matrix of the preconditioner
In-Reply-To: <956373f0804280011g251ce489l94b6a51fec88d14e@mail.gmail.com>
References: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com>
	 <956373f0804280011g251ce489l94b6a51fec88d14e@mail.gmail.com>
Message-ID: <7ff0ee010804280848n39539f8axeab203197d9bb2d4@mail.gmail.com>

Thank you, Dave. I am wondering whether this function is ok to external
packages, such as Hypre? thanks a lot.

Regards,
Yujie

On 4/28/08, Dave May <dave.mayhem23 at gmail.com> wrote:
>
> Hi,
>   You can use PetscErrorCode PCComputeExplicitOperator(PC pc,Mat *mat)
>
> See,
>
> http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/PC/PCComputeExplicitOperator.html
>
>
>
> On Mon, Apr 28, 2008 at 8:19 AM, Yujie <recrusader at gmail.com> wrote:
>
> > Hi, everyone
> >
> > How to get the matrix "M_{L}^{-1}" in "M_{L}^{-1}Ax=M_{L}^{-1}b". I have
> > checked the function "PCGetOperators()". It only gets the matrix "pmat" used
> > to obtain the preconditioning matrix. thanks a lot.
> >
> > Regards,
> > Yujie
> >
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080428/a387d0c4/attachment.htm>

From knepley at gmail.com  Mon Apr 28 11:09:25 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Mon, 28 Apr 2008 11:09:25 -0500
Subject: how to get the explicit matrix of the preconditioner
In-Reply-To: <7ff0ee010804280848n39539f8axeab203197d9bb2d4@mail.gmail.com>
References: <7ff0ee010804271519p1413701ej2509c14c8143228d@mail.gmail.com>
	 <956373f0804280011g251ce489l94b6a51fec88d14e@mail.gmail.com>
	 <7ff0ee010804280848n39539f8axeab203197d9bb2d4@mail.gmail.com>
Message-ID: <a9f269830804280909p6680a601j2666178d0cd26b31@mail.gmail.com>

On Mon, Apr 28, 2008 at 10:48 AM, Yujie <recrusader at gmail.com> wrote:
> Thank you, Dave. I am wondering whether this function is ok to external
> packages, such as Hypre? thanks a lot.

Yes, since it just calls PCApply() for each basis vector.

  Matt

> Regards,
> Yujie
>
> On 4/28/08, Dave May <dave.mayhem23 at gmail.com> wrote:
> > Hi,
> >   You can use PetscErrorCode PCComputeExplicitOperator(PC pc,Mat *mat)
> >
> > See,
> >
> http://www-unix.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/PC/PCComputeExplicitOperator.html
> >
> >
> >
> >
> >
> > On Mon, Apr 28, 2008 at 8:19 AM, Yujie <recrusader at gmail.com> wrote:
> >
> > > Hi, everyone
> > >
> > > How to get the matrix "M_{L}^{-1}" in "M_{L}^{-1}Ax=M_{L}^{-1}b". I have
> checked the function "PCGetOperators()". It only gets the matrix "pmat" used
> to obtain the preconditioning matrix. thanks a lot.
> > >
> > > Regards,
> > > Yujie
> > >
> > >
> > >
> > >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From Amit.Itagi at seagate.com  Mon Apr 28 13:15:34 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Mon, 28 Apr 2008 14:15:34 -0400
Subject: Shared libraries
Message-ID: <OF70F0D897.265F2CAC-ON85257439.0063E59A-85257439.00650DED@seagate.com>


Hi,

I am trying to recompile my PetSc installation to have "--with-shared=1".
Also, I am specifying "--with-blas-lapack-dir=...". The library building
goes through ok. However, in the last step of generating the shared
libraries, I get an error about the lapack library. This step is looking
for liblapack in /usr/local/lib and that lib was not compiled with -fPIC. I
want the program to use the one that I specified in
"--with-blas-lapack-dir=...". Which option (in which file) do I need to
tweak ?

This is in petsc-2.3.3-p8.

Thanks

Rgds,
Amit


From balay at mcs.anl.gov  Mon Apr 28 13:34:23 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Mon, 28 Apr 2008 13:34:23 -0500 (CDT)
Subject: Shared libraries
In-Reply-To: <OF70F0D897.265F2CAC-ON85257439.0063E59A-85257439.00650DED@seagate.com>
References: <OF70F0D897.265F2CAC-ON85257439.0063E59A-85257439.00650DED@seagate.com>
Message-ID: <alpine.LFD.1.10.0804281332280.17589@asterix.localdomain>

On Mon, 28 Apr 2008, Amit.Itagi at seagate.com wrote:

> 
> Hi,
> 
> I am trying to recompile my PetSc installation to have "--with-shared=1".
> Also, I am specifying "--with-blas-lapack-dir=...". The library building
> goes through ok. However, in the last step of generating the shared
> libraries, I get an error about the lapack library. This step is looking
> for liblapack in /usr/local/lib and that lib was not compiled with -fPIC. I
> want the program to use the one that I specified in
> "--with-blas-lapack-dir=...". Which option (in which file) do I need to
> tweak ?

So the primary question is: you are specifying --with-blas-lapack-dir
- but that version of blas-lapack is not picked up by configure.

However its going ahead and using blas from /usr/local/lib - which is
not what you want? [because its not compiled with -fPIC]

Please send the corresponding configure.log to petsc-maint at mcs.anl.gov

Satish


From balay at mcs.anl.gov  Mon Apr 28 13:48:59 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Mon, 28 Apr 2008 13:48:59 -0500 (CDT)
Subject: Compiling PETSc with Visual Studio 2008
In-Reply-To: <30962.99968.qm@web52206.mail.re2.yahoo.com>
References: <30962.99968.qm@web52206.mail.re2.yahoo.com>
Message-ID: <alpine.LFD.1.10.0804281341040.17589@asterix.localdomain>


On Sun, 27 Apr 2008, Farshid Mossaiby wrote:

> Hi,
> 
> Configure says it cannot make ParMetis with the option
> --download-parmetis. Is this related to Visual Studio
> 2008 compiler I use, or something else is wrong? Here
> is the message:

Most externalpackages are never tested by their original authors with
MS compilers. And we have not tried porting them to them.

So most of them won't compile - hence --download-packagename might not
work.

BTW: Currently my test windows box is down - so I can't check if this
is supporsed to work with MS compilers.

> Error running make on ParMetis: Could not execute 'cd
> /home/Administrator/petsc-2.3.3-p12/externalpackages/ParMetis-dev;
> make clean; make lib; make minstall; make clean':
> make: *** No rule to make target `clean'.  Stop.
> make: *** No rule to make target `lib'.  Stop.
> make: *** No rule to make target `minstall'.  Stop.
> make: *** No rule to make target `clean'.  Stop.
> *********************************************************************************
> 
> Regards,
> Farshid Mossaiby
> 
> P.S. I hope this is appropriate place to ask this.

The appropriate thing is to send us the relavent log files
[configure.log etc] - and these can't be sent to petsc-user list [we
don't want to flood petsc-users subscribers with multi-megabyte
logfiles]. So the appropriate place is petsc-maint - with the relavent
logfiles.

Satish


From balay at mcs.anl.gov  Mon Apr 28 22:13:02 2008
From: balay at mcs.anl.gov (Satish Balay)
Date: Mon, 28 Apr 2008 22:13:02 -0500 (CDT)
Subject: Compiling PETSc with Visual Studio 2008
In-Reply-To: <alpine.LFD.1.10.0804281341040.17589@asterix.localdomain>
References: <30962.99968.qm@web52206.mail.re2.yahoo.com> <alpine.LFD.1.10.0804281341040.17589@asterix.localdomain>
Message-ID: <alpine.LFD.1.10.0804282126030.32750@asterix.localdomain>

On Mon, 28 Apr 2008, Satish Balay wrote:

> 
> On Sun, 27 Apr 2008, Farshid Mossaiby wrote:

> > Configure says it cannot make ParMetis with the option
> > --download-parmetis. Is this related to Visual Studio 2008
> > compiler I use, or something else is wrong? Here is the message:
> 
> Most externalpackages are never tested by their original authors with
> MS compilers. And we have not tried porting them to them.
> 
> So most of them won't compile - hence --download-packagename might not
> work.
> 
> BTW: Currently my test windows box is down - so I can't check if this
> is supporsed to work with MS compilers.

Looks like parmetis does compile with MS compilers. Please try try the
attached patch.

cd petsc-2.3.3
patch -Np1 < parmetis-win.patch
rm -rf externalpackage/ParMetis*
./config/configure.py .....

This fix will be in petsc-dev.

Satish
-------------- next part --------------
diff -r 4041e3152979 python/PETSc/packages/ParMetis.py
--- a/python/PETSc/packages/ParMetis.py	Mon Apr 21 11:20:42 2008 -0500
+++ b/python/PETSc/packages/ParMetis.py	Mon Apr 28 21:15:01 2008 -0500
@@ -5,7 +5,7 @@
 class Configure(PETSc.package.Package):
   def __init__(self, framework):
     PETSc.package.Package.__init__(self, framework)
-    self.download     = ['hg://petsc.cs.iit.edu/petsc/ParMetis-dev','ftp://ftp.mcs.anl.gov/pub/petsc/externalpackages/ParMetis-dev-p1.tar.gz']
+    self.download     = ['hg://petsc.cs.iit.edu/petsc/ParMetis-dev','ftp://ftp.mcs.anl.gov/pub/petsc/externalpackages/ParMetis-dev-p2.tar.gz']
     self.functions    = ['ParMETIS_V3_PartKway']
     self.includes     = ['parmetis.h']
     self.liblist      = [['libparmetis.a','libmetis.a']]
@@ -27,8 +27,9 @@
     installDir     = os.path.join(parmetisDir, self.arch.arch)
     makeinc        = os.path.join(parmetisDir,'make.inc')
     installmakeinc = os.path.join(installDir,'make.inc')
-    configheader   = os.path.join(parmetisDir,'ParMETISLib','configureheader.h')
-
+    metisconfigheader    = os.path.join(parmetisDir,'METISLib','configureheader.h')
+    parmetisconfigheader = os.path.join(parmetisDir,'ParMETISLib','configureheader.h')
+    
     # Configure ParMetis 
     if os.path.isfile(makeinc):
       os.unlink(makeinc)
@@ -63,7 +64,8 @@
     
     if not os.path.isfile(installmakeinc) or not (self.getChecksum(installmakeinc) == self.getChecksum(makeinc)):
       self.framework.log.write('Have to rebuild ParMetis, make.inc != '+installmakeinc+'\n')
-      self.framework.outputHeader(configheader)
+      self.framework.outputHeader(metisconfigheader)
+      self.framework.outputHeader(parmetisconfigheader)
       try:
         self.logPrintBox('Compiling & installing Parmetis; this may take several minutes')
         output  = config.base.Configure.executeShellCommand('cd '+parmetisDir+'; make clean; make lib; make minstall; make clean', timeout=2500, log = self.framework.log)[0]

From mossaiby at yahoo.com  Tue Apr 29 01:59:57 2008
From: mossaiby at yahoo.com (Farshid Mossaiby)
Date: Mon, 28 Apr 2008 23:59:57 -0700 (PDT)
Subject: Compiling PETSc with Visual Studio 2008
In-Reply-To: <alpine.LFD.1.10.0804282126030.32750@asterix.localdomain>
Message-ID: <30861.23159.qm@web52208.mail.re2.yahoo.com>

Sorry I saw your email after I sent my log. Thanks for
your help.

Will check and report the results.

Best regards,
Farshid Mossaiby


--- Satish Balay <balay at mcs.anl.gov> wrote:

> On Mon, 28 Apr 2008, Satish Balay wrote:
> 
> > 
> > On Sun, 27 Apr 2008, Farshid Mossaiby wrote:
> 
> > > Configure says it cannot make ParMetis with the
> option
> > > --download-parmetis. Is this related to Visual
> Studio 2008
> > > compiler I use, or something else is wrong? Here
> is the message:
> > 
> > Most externalpackages are never tested by their
> original authors with
> > MS compilers. And we have not tried porting them
> to them.
> > 
> > So most of them won't compile - hence
> --download-packagename might not
> > work.
> > 
> > BTW: Currently my test windows box is down - so I
> can't check if this
> > is supporsed to work with MS compilers.
> 
> Looks like parmetis does compile with MS compilers.
> Please try try the
> attached patch.
> 
> cd petsc-2.3.3
> patch -Np1 < parmetis-win.patch
> rm -rf externalpackage/ParMetis*
> ./config/configure.py .....
> 
> This fix will be in petsc-dev.
> 
> Satish> diff -r 4041e3152979
> python/PETSc/packages/ParMetis.py
> --- a/python/PETSc/packages/ParMetis.py	Mon Apr 21
> 11:20:42 2008 -0500
> +++ b/python/PETSc/packages/ParMetis.py	Mon Apr 28
> 21:15:01 2008 -0500
> @@ -5,7 +5,7 @@
>  class Configure(PETSc.package.Package):
>    def __init__(self, framework):
>      PETSc.package.Package.__init__(self, framework)
> -    self.download     =
>
['hg://petsc.cs.iit.edu/petsc/ParMetis-dev','ftp://ftp.mcs.anl.gov/pub/petsc/externalpackages/ParMetis-dev-p1.tar.gz']
> +    self.download     =
>
['hg://petsc.cs.iit.edu/petsc/ParMetis-dev','ftp://ftp.mcs.anl.gov/pub/petsc/externalpackages/ParMetis-dev-p2.tar.gz']
>      self.functions    = ['ParMETIS_V3_PartKway']
>      self.includes     = ['parmetis.h']
>      self.liblist      =
> [['libparmetis.a','libmetis.a']]
> @@ -27,8 +27,9 @@
>      installDir     = os.path.join(parmetisDir,
> self.arch.arch)
>      makeinc        =
> os.path.join(parmetisDir,'make.inc')
>      installmakeinc =
> os.path.join(installDir,'make.inc')
> -    configheader   =
>
os.path.join(parmetisDir,'ParMETISLib','configureheader.h')
> -
> +    metisconfigheader    =
>
os.path.join(parmetisDir,'METISLib','configureheader.h')
> +    parmetisconfigheader =
>
os.path.join(parmetisDir,'ParMETISLib','configureheader.h')
> +    
>      # Configure ParMetis 
>      if os.path.isfile(makeinc):
>        os.unlink(makeinc)
> @@ -63,7 +64,8 @@
>      
>      if not os.path.isfile(installmakeinc) or not
> (self.getChecksum(installmakeinc) ==
> self.getChecksum(makeinc)):
>        self.framework.log.write('Have to rebuild
> ParMetis, make.inc != '+installmakeinc+'\n')
> -      self.framework.outputHeader(configheader)
> +     
> self.framework.outputHeader(metisconfigheader)
> +     
> self.framework.outputHeader(parmetisconfigheader)
>        try:
>          self.logPrintBox('Compiling & installing
> Parmetis; this may take several minutes')
>          output  =
> config.base.Configure.executeShellCommand('cd
> '+parmetisDir+'; make clean; make lib; make
> minstall; make clean', timeout=2500, log =
> self.framework.log)[0]
> 


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


From Amit.Itagi at seagate.com  Tue Apr 29 08:54:24 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Tue, 29 Apr 2008 09:54:24 -0400
Subject: DA question
In-Reply-To: <47FD2297.1010602@gmail.com>
Message-ID: <OF9E88022D.A050A4AE-ON8525743A.004B45CA-8525743A.004D24A6@seagate.com>

Hi,

I spent some more time understanding DA's, and how DA's should serve my
purpose. Since in the time domain calculation, I will have to scatter from
the global vector to the local vector and vice-versa at every iteration
step, I have some follow-up questions.

1) Does the scattering involve copying the part stored on the local node as
well (i.e. part of the local vector other than the ghost values), or is the
local part just accessed by reference ? In the first scenario, this would
involve allocating twice the storage for the local part. Also, does the
scattering of the local part give a big hit in terms of CPU time ?

2) In the manual, it says "In most cases, several different vectors can
share the same communication information (or, in other words, can share a
given DA)" and "PETSc currently provides no container for multiple arrays
sharing the same distributed array communication; note, however, that the
dof parameter handles many cases of interest". I am a bit confused. Suppose
I have two arrays having the same layout on the regular grid, can I store
the first array data on one vector, and the second array data on the second
vector (and have a DA with dof=1, instead of a DA with dof=2), and be able
to scatter and update the first vector without scattering/updating the
second vector ?

Thanks

Rgds,
Amit

owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:

> Hi Amit,
>
> Why do you need two staggered grids? I do EM finite difference frequency
> domain modeling on a staggered grid using just one DA. Works perfectly
fine.
> There are some grid points that are not used, but you just set them to
zero
> and put a 1 on the diagonal of the coefficient matrix.
>
>
> Randy
>
>
> Amit.Itagi at seagate.com wrote:
> > Hi Berend,
> >
> > A detailed explanation of the finite difference scheme is given here :
> >
> > http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
> >
> >
> > Thanks
> >
> > Rgds,
> > Amit
> >
> >
> >
> >

> >              Berend van Wachem

> >              <berend at chalmers.

> >              se>
To
> >              Sent by:                  petsc-users at mcs.anl.gov

> >              owner-petsc-users
cc
> >              @mcs.anl.gov

> >              No Phone Info
Subject
> >              Available                 Re: DA question

> >

> >

> >              04/09/2008 02:59

> >              PM

> >

> >

> >              Please respond to

> >              petsc-users at mcs.a

> >                   nl.gov

> >

> >

> >
> >
> >
> >
> > Dear Amit,
> >
> > Could you explain how the two grids are attached?
> > I am using multiple DA's for multiple structured grids glued together.
> > I've done the gluing with setting up various IS objects. From the
> > multiple DA's, one global variable vector is formed. Is that what you
> > are looking for?
> >
> > Best regards,
> >
> > Berend.
> >
> >
> > Amit.Itagi at seagate.com wrote:
> >> Hi,
> >>
> >> Is it possible to use DA to perform finite differences on two
staggered
> >> regular grids (as in the electromagnetic finite difference time domain
> >> method) ? Surrounding nodes from one grid are used to update the value
in
> >> the dual grid. In addition, local manipulations need to be done on the
> >> nodal values.
> >>
> >> Thanks
> >>
> >> Rgds,
> >> Amit
> >>
> >
> >
> >
>


From knepley at gmail.com  Tue Apr 29 10:54:32 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 29 Apr 2008 10:54:32 -0500
Subject: DA question
In-Reply-To: <OF9E88022D.A050A4AE-ON8525743A.004B45CA-8525743A.004D24A6@seagate.com>
References: <47FD2297.1010602@gmail.com>
	 <OF9E88022D.A050A4AE-ON8525743A.004B45CA-8525743A.004D24A6@seagate.com>
Message-ID: <a9f269830804290854g7cd6799cj370e5e6e56f6599d@mail.gmail.com>

On Tue, Apr 29, 2008 at 8:54 AM,  <Amit.Itagi at seagate.com> wrote:
> Hi,
>
>  I spent some more time understanding DA's, and how DA's should serve my
>  purpose. Since in the time domain calculation, I will have to scatter from
>  the global vector to the local vector and vice-versa at every iteration
>  step, I have some follow-up questions.
>
>  1) Does the scattering involve copying the part stored on the local node as
>  well (i.e. part of the local vector other than the ghost values), or is the
>  local part just accessed by reference ? In the first scenario, this would

No, you get a separate local vector since we reorder to give contiguous
access.

>  involve allocating twice the storage for the local part. Also, does the

Yes, however unless you run an explicit code at the limit of memory, this
really does not matter.

>  scattering of the local part give a big hit in terms of CPU time ?

Not for these cartesian topologies with small overlap. This is easy to prove.

>  2) In the manual, it says "In most cases, several different vectors can
>  share the same communication information (or, in other words, can share a
>  given DA)" and "PETSc currently provides no container for multiple arrays
>  sharing the same distributed array communication; note, however, that the
>  dof parameter handles many cases of interest". I am a bit confused. Suppose
>  I have two arrays having the same layout on the regular grid, can I store
>  the first array data on one vector, and the second array data on the second
>  vector (and have a DA with dof=1, instead of a DA with dof=2), and be able
>  to scatter and update the first vector without scattering/updating the
>  second vector ?

Yes. You call DAGetGlobalVector() twice, and then when you want one vector
updated, call DALocalToGlobal() or DAGlobalToLocal() with that vector.

  Matt

>  Thanks
>
>  Rgds,
>  Amit
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From Amit.Itagi at seagate.com  Tue Apr 29 12:10:29 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Tue, 29 Apr 2008 13:10:29 -0400
Subject: DA question
In-Reply-To: <a9f269830804290854g7cd6799cj370e5e6e56f6599d@mail.gmail.com>
Message-ID: <OFEDC0CE0E.9FBDE89F-ON8525743A.005E4E81-8525743A.005F185C@seagate.com>

Thanks, Matt.

Rgds,
Amit


             "Matthew Knepley"                                             
             <knepley at gmail.co                                             
             m>                                                         To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: DA question                     
                                                                           
                                                                           
             04/29/2008 11:54                                              
             AM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
On Tue, Apr 29, 2008 at 8:54 AM,  <Amit.Itagi at seagate.com> wrote:
> Hi,
>
>  I spent some more time understanding DA's, and how DA's should serve my
>  purpose. Since in the time domain calculation, I will have to scatter
from
>  the global vector to the local vector and vice-versa at every iteration
>  step, I have some follow-up questions.
>
>  1) Does the scattering involve copying the part stored on the local node
as
>  well (i.e. part of the local vector other than the ghost values), or is
the
>  local part just accessed by reference ? In the first scenario, this
would

No, you get a separate local vector since we reorder to give contiguous
access.

>  involve allocating twice the storage for the local part. Also, does the

Yes, however unless you run an explicit code at the limit of memory, this
really does not matter.

>  scattering of the local part give a big hit in terms of CPU time ?

Not for these cartesian topologies with small overlap. This is easy to
prove.

>  2) In the manual, it says "In most cases, several different vectors can
>  share the same communication information (or, in other words, can share
a
>  given DA)" and "PETSc currently provides no container for multiple
arrays
>  sharing the same distributed array communication; note, however, that
the
>  dof parameter handles many cases of interest". I am a bit confused.
Suppose
>  I have two arrays having the same layout on the regular grid, can I
store
>  the first array data on one vector, and the second array data on the
second
>  vector (and have a DA with dof=1, instead of a DA with dof=2), and be
able
>  to scatter and update the first vector without scattering/updating the
>  second vector ?

Yes. You call DAGetGlobalVector() twice, and then when you want one vector
updated, call DALocalToGlobal() or DAGlobalToLocal() with that vector.

  Matt

>  Thanks
>
>  Rgds,
>  Amit
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From bsmith at mcs.anl.gov  Tue Apr 29 12:28:17 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Tue, 29 Apr 2008 12:28:17 -0500
Subject: DA question
In-Reply-To: <OF9E88022D.A050A4AE-ON8525743A.004B45CA-8525743A.004D24A6@seagate.com>
References: <OF9E88022D.A050A4AE-ON8525743A.004B45CA-8525743A.004D24A6@seagate.com>
Message-ID: <B977A06E-7A80-4B35-8556-56BAACAB01D2@mcs.anl.gov>


   If you are running a true explicit scheme then you have no need
to ever have a "global representation" at each time step. In this
case you can use DALocalToLocalBegin() then DALocalToLocalEnd()
and pass the same vector in both locations. This will update the ghost
points but WILL NOT do any copy of the local data since it is already
in the correct locations.

   Barry

On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote:

> Hi,
>
> I spent some more time understanding DA's, and how DA's should serve  
> my
> purpose. Since in the time domain calculation, I will have to  
> scatter from
> the global vector to the local vector and vice-versa at every  
> iteration
> step, I have some follow-up questions.
>
> 1) Does the scattering involve copying the part stored on the local  
> node as
> well (i.e. part of the local vector other than the ghost values), or  
> is the
> local part just accessed by reference ? In the first scenario, this  
> would
> involve allocating twice the storage for the local part. Also, does  
> the
> scattering of the local part give a big hit in terms of CPU time ?
>
> 2) In the manual, it says "In most cases, several different vectors  
> can
> share the same communication information (or, in other words, can  
> share a
> given DA)" and "PETSc currently provides no container for multiple  
> arrays
> sharing the same distributed array communication; note, however,  
> that the
> dof parameter handles many cases of interest". I am a bit confused.  
> Suppose
> I have two arrays having the same layout on the regular grid, can I  
> store
> the first array data on one vector, and the second array data on the  
> second
> vector (and have a DA with dof=1, instead of a DA with dof=2), and  
> be able
> to scatter and update the first vector without scattering/updating the
> second vector ?
>
> Thanks
>
> Rgds,
> Amit
>
> owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:
>
>> Hi Amit,
>>
>> Why do you need two staggered grids? I do EM finite difference  
>> frequency
>> domain modeling on a staggered grid using just one DA. Works  
>> perfectly
> fine.
>> There are some grid points that are not used, but you just set them  
>> to
> zero
>> and put a 1 on the diagonal of the coefficient matrix.
>>
>>
>> Randy
>>
>>
>> Amit.Itagi at seagate.com wrote:
>>> Hi Berend,
>>>
>>> A detailed explanation of the finite difference scheme is given  
>>> here :
>>>
>>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>>>
>>>
>>> Thanks
>>>
>>> Rgds,
>>> Amit
>>>
>>>
>>>
>>>
>
>>>            Berend van Wachem
>
>>>            <berend at chalmers.
>
>>>            se>
> To
>>>            Sent by:                  petsc-users at mcs.anl.gov
>
>>>            owner-petsc-users
> cc
>>>            @mcs.anl.gov
>
>>>            No Phone Info
> Subject
>>>            Available                 Re: DA question
>
>>>
>
>>>
>
>>>            04/09/2008 02:59
>
>>>            PM
>
>>>
>
>>>
>
>>>            Please respond to
>
>>>            petsc-users at mcs.a
>
>>>                 nl.gov
>
>>>
>
>>>
>
>>>
>>>
>>>
>>>
>>> Dear Amit,
>>>
>>> Could you explain how the two grids are attached?
>>> I am using multiple DA's for multiple structured grids glued  
>>> together.
>>> I've done the gluing with setting up various IS objects. From the
>>> multiple DA's, one global variable vector is formed. Is that what  
>>> you
>>> are looking for?
>>>
>>> Best regards,
>>>
>>> Berend.
>>>
>>>
>>> Amit.Itagi at seagate.com wrote:
>>>> Hi,
>>>>
>>>> Is it possible to use DA to perform finite differences on two
> staggered
>>>> regular grids (as in the electromagnetic finite difference time  
>>>> domain
>>>> method) ? Surrounding nodes from one grid are used to update the  
>>>> value
> in
>>>> the dual grid. In addition, local manipulations need to be done  
>>>> on the
>>>> nodal values.
>>>>
>>>> Thanks
>>>>
>>>> Rgds,
>>>> Amit
>>>>
>>>
>>>
>>>
>>
>


From Amit.Itagi at seagate.com  Tue Apr 29 14:27:53 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Tue, 29 Apr 2008 15:27:53 -0400
Subject: DA question
In-Reply-To: <B977A06E-7A80-4B35-8556-56BAACAB01D2@mcs.anl.gov>
Message-ID: <OF0BE4E547.C7A8E470-ON8525743A.006AC318-8525743A.006BAC9C@seagate.com>

Barry,

Can this be achieved using SDA ? I am working with regular arrays, and
doing only explicit updates.

Thanks

Rgds,
Amit

owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM:

>
>    If you are running a true explicit scheme then you have no need
> to ever have a "global representation" at each time step. In this
> case you can use DALocalToLocalBegin() then DALocalToLocalEnd()
> and pass the same vector in both locations. This will update the ghost
> points but WILL NOT do any copy of the local data since it is already
> in the correct locations.
>
>    Barry
>
> On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote:
>
> > Hi,
> >
> > I spent some more time understanding DA's, and how DA's should serve
> > my
> > purpose. Since in the time domain calculation, I will have to
> > scatter from
> > the global vector to the local vector and vice-versa at every
> > iteration
> > step, I have some follow-up questions.
> >
> > 1) Does the scattering involve copying the part stored on the local
> > node as
> > well (i.e. part of the local vector other than the ghost values), or
> > is the
> > local part just accessed by reference ? In the first scenario, this
> > would
> > involve allocating twice the storage for the local part. Also, does
> > the
> > scattering of the local part give a big hit in terms of CPU time ?
> >
> > 2) In the manual, it says "In most cases, several different vectors
> > can
> > share the same communication information (or, in other words, can
> > share a
> > given DA)" and "PETSc currently provides no container for multiple
> > arrays
> > sharing the same distributed array communication; note, however,
> > that the
> > dof parameter handles many cases of interest". I am a bit confused.
> > Suppose
> > I have two arrays having the same layout on the regular grid, can I
> > store
> > the first array data on one vector, and the second array data on the
> > second
> > vector (and have a DA with dof=1, instead of a DA with dof=2), and
> > be able
> > to scatter and update the first vector without scattering/updating the
> > second vector ?
> >
> > Thanks
> >
> > Rgds,
> > Amit
> >
> > owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:
> >
> >> Hi Amit,
> >>
> >> Why do you need two staggered grids? I do EM finite difference
> >> frequency
> >> domain modeling on a staggered grid using just one DA. Works
> >> perfectly
> > fine.
> >> There are some grid points that are not used, but you just set them
> >> to
> > zero
> >> and put a 1 on the diagonal of the coefficient matrix.
> >>
> >>
> >> Randy
> >>
> >>
> >> Amit.Itagi at seagate.com wrote:
> >>> Hi Berend,
> >>>
> >>> A detailed explanation of the finite difference scheme is given
> >>> here :
> >>>
> >>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
> >>>
> >>>
> >>> Thanks
> >>>
> >>> Rgds,
> >>> Amit
> >>>
> >>>
> >>>
> >>>
> >
> >>>            Berend van Wachem
> >
> >>>            <berend at chalmers.
> >
> >>>            se>
> > To
> >>>            Sent by:                  petsc-users at mcs.anl.gov
> >
> >>>            owner-petsc-users
> > cc
> >>>            @mcs.anl.gov
> >
> >>>            No Phone Info
> > Subject
> >>>            Available                 Re: DA question
> >
> >>>
> >
> >>>
> >
> >>>            04/09/2008 02:59
> >
> >>>            PM
> >
> >>>
> >
> >>>
> >
> >>>            Please respond to
> >
> >>>            petsc-users at mcs.a
> >
> >>>                 nl.gov
> >
> >>>
> >
> >>>
> >
> >>>
> >>>
> >>>
> >>>
> >>> Dear Amit,
> >>>
> >>> Could you explain how the two grids are attached?
> >>> I am using multiple DA's for multiple structured grids glued
> >>> together.
> >>> I've done the gluing with setting up various IS objects. From the
> >>> multiple DA's, one global variable vector is formed. Is that what
> >>> you
> >>> are looking for?
> >>>
> >>> Best regards,
> >>>
> >>> Berend.
> >>>
> >>>
> >>> Amit.Itagi at seagate.com wrote:
> >>>> Hi,
> >>>>
> >>>> Is it possible to use DA to perform finite differences on two
> > staggered
> >>>> regular grids (as in the electromagnetic finite difference time
> >>>> domain
> >>>> method) ? Surrounding nodes from one grid are used to update the
> >>>> value
> > in
> >>>> the dual grid. In addition, local manipulations need to be done
> >>>> on the
> >>>> nodal values.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Rgds,
> >>>> Amit
> >>>>
> >>>
> >>>
> >>>
> >>
> >
>


From knepley at gmail.com  Tue Apr 29 14:39:19 2008
From: knepley at gmail.com (Matthew Knepley)
Date: Tue, 29 Apr 2008 14:39:19 -0500
Subject: DA question
In-Reply-To: <OF0BE4E547.C7A8E470-ON8525743A.006AC318-8525743A.006BAC9C@seagate.com>
References: <B977A06E-7A80-4B35-8556-56BAACAB01D2@mcs.anl.gov>
	 <OF0BE4E547.C7A8E470-ON8525743A.006AC318-8525743A.006BAC9C@seagate.com>
Message-ID: <a9f269830804291239s77229ecdg78d9c52988b16089@mail.gmail.com>

On Tue, Apr 29, 2008 at 2:27 PM,  <Amit.Itagi at seagate.com> wrote:
> Barry,
>
>  Can this be achieved using SDA ? I am working with regular arrays, and
>  doing only explicit updates.

What is SDA? Barry's point is that, if no solve is done (as in your case), no
global system or global vectors need to be formed. You can use LocalToLocal
calls to keep gohst points synchronized.

  Matt

>  Thanks
>
>  Rgds,
>  Amit
>
>  owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM:
>
>  >
>  >    If you are running a true explicit scheme then you have no need
>  > to ever have a "global representation" at each time step. In this
>  > case you can use DALocalToLocalBegin() then DALocalToLocalEnd()
>  > and pass the same vector in both locations. This will update the ghost
>  > points but WILL NOT do any copy of the local data since it is already
>  > in the correct locations.
>  >
>  >    Barry
>  >
>  > On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote:
>  >
>  > > Hi,
>  > >
>  > > I spent some more time understanding DA's, and how DA's should serve
>  > > my
>  > > purpose. Since in the time domain calculation, I will have to
>  > > scatter from
>  > > the global vector to the local vector and vice-versa at every
>  > > iteration
>  > > step, I have some follow-up questions.
>  > >
>  > > 1) Does the scattering involve copying the part stored on the local
>  > > node as
>  > > well (i.e. part of the local vector other than the ghost values), or
>  > > is the
>  > > local part just accessed by reference ? In the first scenario, this
>  > > would
>  > > involve allocating twice the storage for the local part. Also, does
>  > > the
>  > > scattering of the local part give a big hit in terms of CPU time ?
>  > >
>  > > 2) In the manual, it says "In most cases, several different vectors
>  > > can
>  > > share the same communication information (or, in other words, can
>  > > share a
>  > > given DA)" and "PETSc currently provides no container for multiple
>  > > arrays
>  > > sharing the same distributed array communication; note, however,
>  > > that the
>  > > dof parameter handles many cases of interest". I am a bit confused.
>  > > Suppose
>  > > I have two arrays having the same layout on the regular grid, can I
>  > > store
>  > > the first array data on one vector, and the second array data on the
>  > > second
>  > > vector (and have a DA with dof=1, instead of a DA with dof=2), and
>  > > be able
>  > > to scatter and update the first vector without scattering/updating the
>  > > second vector ?
>  > >
>  > > Thanks
>  > >
>  > > Rgds,
>  > > Amit
>  > >
>  > > owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:
>  > >
>  > >> Hi Amit,
>  > >>
>  > >> Why do you need two staggered grids? I do EM finite difference
>  > >> frequency
>  > >> domain modeling on a staggered grid using just one DA. Works
>  > >> perfectly
>  > > fine.
>  > >> There are some grid points that are not used, but you just set them
>  > >> to
>  > > zero
>  > >> and put a 1 on the diagonal of the coefficient matrix.
>  > >>
>  > >>
>  > >> Randy
>  > >>
>  > >>
>  > >> Amit.Itagi at seagate.com wrote:
>  > >>> Hi Berend,
>  > >>>
>  > >>> A detailed explanation of the finite difference scheme is given
>  > >>> here :
>  > >>>
>  > >>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>  > >>>
>  > >>>
>  > >>> Thanks
>  > >>>
>  > >>> Rgds,
>  > >>> Amit
>  > >>>
>  > >>>
>  > >>>
>  > >>>
>  > >
>  > >>>            Berend van Wachem
>  > >
>  > >>>            <berend at chalmers.
>  > >
>  > >>>            se>
>  > > To
>  > >>>            Sent by:                  petsc-users at mcs.anl.gov
>  > >
>  > >>>            owner-petsc-users
>  > > cc
>  > >>>            @mcs.anl.gov
>  > >
>  > >>>            No Phone Info
>  > > Subject
>  > >>>            Available                 Re: DA question
>  > >
>  > >>>
>  > >
>  > >>>
>  > >
>  > >>>            04/09/2008 02:59
>  > >
>  > >>>            PM
>  > >
>  > >>>
>  > >
>  > >>>
>  > >
>  > >>>            Please respond to
>  > >
>  > >>>            petsc-users at mcs.a
>  > >
>  > >>>                 nl.gov
>  > >
>  > >>>
>  > >
>  > >>>
>  > >
>  > >>>
>  > >>>
>  > >>>
>  > >>>
>  > >>> Dear Amit,
>  > >>>
>  > >>> Could you explain how the two grids are attached?
>  > >>> I am using multiple DA's for multiple structured grids glued
>  > >>> together.
>  > >>> I've done the gluing with setting up various IS objects. From the
>  > >>> multiple DA's, one global variable vector is formed. Is that what
>  > >>> you
>  > >>> are looking for?
>  > >>>
>  > >>> Best regards,
>  > >>>
>  > >>> Berend.
>  > >>>
>  > >>>
>  > >>> Amit.Itagi at seagate.com wrote:
>  > >>>> Hi,
>  > >>>>
>  > >>>> Is it possible to use DA to perform finite differences on two
>  > > staggered
>  > >>>> regular grids (as in the electromagnetic finite difference time
>  > >>>> domain
>  > >>>> method) ? Surrounding nodes from one grid are used to update the
>  > >>>> value
>  > > in
>  > >>>> the dual grid. In addition, local manipulations need to be done
>  > >>>> on the
>  > >>>> nodal values.
>  > >>>>
>  > >>>> Thanks
>  > >>>>
>  > >>>> Rgds,
>  > >>>> Amit
>  > >>>>
>  > >>>
>  > >>>
>  > >>>
>  > >>
>  > >
>  >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener


From bsmith at mcs.anl.gov  Tue Apr 29 14:51:42 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Tue, 29 Apr 2008 14:51:42 -0500
Subject: DA question
In-Reply-To: <OF0BE4E547.C7A8E470-ON8525743A.006AC318-8525743A.006BAC9C@seagate.com>
References: <OF0BE4E547.C7A8E470-ON8525743A.006AC318-8525743A.006BAC9C@seagate.com>
Message-ID: <A80B0A31-3752-49BC-879C-EDE3A1E1060A@mcs.anl.gov>


On Apr 29, 2008, at 2:27 PM, Amit.Itagi at seagate.com wrote:

> Barry,
>
> Can this be achieved using SDA ? I am working with regular arrays, and
> doing only explicit updates.

    Yes. SDA actually uses the DA, it just hides the Vec concept from  
the user.

    Barry

>
>
> Thanks
>
> Rgds,
> Amit
>
> owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM:
>
>>
>>   If you are running a true explicit scheme then you have no need
>> to ever have a "global representation" at each time step. In this
>> case you can use DALocalToLocalBegin() then DALocalToLocalEnd()
>> and pass the same vector in both locations. This will update the  
>> ghost
>> points but WILL NOT do any copy of the local data since it is already
>> in the correct locations.
>>
>>   Barry
>>
>> On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote:
>>
>>> Hi,
>>>
>>> I spent some more time understanding DA's, and how DA's should serve
>>> my
>>> purpose. Since in the time domain calculation, I will have to
>>> scatter from
>>> the global vector to the local vector and vice-versa at every
>>> iteration
>>> step, I have some follow-up questions.
>>>
>>> 1) Does the scattering involve copying the part stored on the local
>>> node as
>>> well (i.e. part of the local vector other than the ghost values), or
>>> is the
>>> local part just accessed by reference ? In the first scenario, this
>>> would
>>> involve allocating twice the storage for the local part. Also, does
>>> the
>>> scattering of the local part give a big hit in terms of CPU time ?
>>>
>>> 2) In the manual, it says "In most cases, several different vectors
>>> can
>>> share the same communication information (or, in other words, can
>>> share a
>>> given DA)" and "PETSc currently provides no container for multiple
>>> arrays
>>> sharing the same distributed array communication; note, however,
>>> that the
>>> dof parameter handles many cases of interest". I am a bit confused.
>>> Suppose
>>> I have two arrays having the same layout on the regular grid, can I
>>> store
>>> the first array data on one vector, and the second array data on the
>>> second
>>> vector (and have a DA with dof=1, instead of a DA with dof=2), and
>>> be able
>>> to scatter and update the first vector without scattering/updating  
>>> the
>>> second vector ?
>>>
>>> Thanks
>>>
>>> Rgds,
>>> Amit
>>>
>>> owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:
>>>
>>>> Hi Amit,
>>>>
>>>> Why do you need two staggered grids? I do EM finite difference
>>>> frequency
>>>> domain modeling on a staggered grid using just one DA. Works
>>>> perfectly
>>> fine.
>>>> There are some grid points that are not used, but you just set them
>>>> to
>>> zero
>>>> and put a 1 on the diagonal of the coefficient matrix.
>>>>
>>>>
>>>> Randy
>>>>
>>>>
>>>> Amit.Itagi at seagate.com wrote:
>>>>> Hi Berend,
>>>>>
>>>>> A detailed explanation of the finite difference scheme is given
>>>>> here :
>>>>>
>>>>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Rgds,
>>>>> Amit
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>>>           Berend van Wachem
>>>
>>>>>           <berend at chalmers.
>>>
>>>>>           se>
>>> To
>>>>>           Sent by:                  petsc-users at mcs.anl.gov
>>>
>>>>>           owner-petsc-users
>>> cc
>>>>>           @mcs.anl.gov
>>>
>>>>>           No Phone Info
>>> Subject
>>>>>           Available                 Re: DA question
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>           04/09/2008 02:59
>>>
>>>>>           PM
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>           Please respond to
>>>
>>>>>           petsc-users at mcs.a
>>>
>>>>>                nl.gov
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dear Amit,
>>>>>
>>>>> Could you explain how the two grids are attached?
>>>>> I am using multiple DA's for multiple structured grids glued
>>>>> together.
>>>>> I've done the gluing with setting up various IS objects. From the
>>>>> multiple DA's, one global variable vector is formed. Is that what
>>>>> you
>>>>> are looking for?
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Berend.
>>>>>
>>>>>
>>>>> Amit.Itagi at seagate.com wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Is it possible to use DA to perform finite differences on two
>>> staggered
>>>>>> regular grids (as in the electromagnetic finite difference time
>>>>>> domain
>>>>>> method) ? Surrounding nodes from one grid are used to update the
>>>>>> value
>>> in
>>>>>> the dual grid. In addition, local manipulations need to be done
>>>>>> on the
>>>>>> nodal values.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Rgds,
>>>>>> Amit
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


From amjad11 at gmail.com  Wed Apr 30 01:24:27 2008
From: amjad11 at gmail.com (amjad ali)
Date: Wed, 30 Apr 2008 11:24:27 +0500
Subject: PETSC with MPI-GAMMA ??
Message-ID: <428810f20804292324q25cbedanf6cef98153493460@mail.gmail.com>

Hello,
I read that:

Genoa Active Message MAchine (GAMMA <http://www.disi.unige.it/project/gamma>)
is a low-latency replacement for TCP/IP on gigabit and is supported for
Intel platforms on modern Linux kernels (both 32 and 64 bit). It completely
bypasses the Linux network stack to produce record breaking latency figures.

(Please see also http://www.opencfd.co.uk/openfoam/parallel1.4.html that
tells how OPEN-FOAM is/will-be  using GAMMA).

Please comment on that can take benefit of GAMMA (if there is??) with PETSc?
For example, installing PETSc with GAMMA?

regards,
Amjad Ali.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080430/6a2138b0/attachment.htm>

From Amit.Itagi at seagate.com  Wed Apr 30 10:33:11 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 30 Apr 2008 11:33:11 -0400
Subject: DA question
In-Reply-To: <B977A06E-7A80-4B35-8556-56BAACAB01D2@mcs.anl.gov>
Message-ID: <OF8707A932.291DDE17-ON8525743B.00553F83-8525743B.00562FE5@seagate.com>

Barry,

I tried this out. This serves my purpose nicely.

One question : How compatible is PetSc with Blitz++ ? Can I declare the
array to be returned by DAVecGetArray to be a Blitz array ?

Thanks

Rgds,
Amit


             Barry Smith                                                   
             <bsmith at mcs.anl.g                                             
             ov>                                                        To 
             Sent by:                  petsc-users at mcs.anl.gov             
             owner-petsc-users                                          cc 
             @mcs.anl.gov                                                  
             No Phone Info                                         Subject 
             Available                 Re: DA question                     
                                                                           
                                                                           
             04/29/2008 01:28                                              
             PM                                                            
                                                                           
                                                                           
             Please respond to                                             
             petsc-users at mcs.a                                             
                  nl.gov                                                   
                                                                           
                                                                           
   If you are running a true explicit scheme then you have no need
to ever have a "global representation" at each time step. In this
case you can use DALocalToLocalBegin() then DALocalToLocalEnd()
and pass the same vector in both locations. This will update the ghost
points but WILL NOT do any copy of the local data since it is already
in the correct locations.

   Barry

On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote:

> Hi,
>
> I spent some more time understanding DA's, and how DA's should serve
> my
> purpose. Since in the time domain calculation, I will have to
> scatter from
> the global vector to the local vector and vice-versa at every
> iteration
> step, I have some follow-up questions.
>
> 1) Does the scattering involve copying the part stored on the local
> node as
> well (i.e. part of the local vector other than the ghost values), or
> is the
> local part just accessed by reference ? In the first scenario, this
> would
> involve allocating twice the storage for the local part. Also, does
> the
> scattering of the local part give a big hit in terms of CPU time ?
>
> 2) In the manual, it says "In most cases, several different vectors
> can
> share the same communication information (or, in other words, can
> share a
> given DA)" and "PETSc currently provides no container for multiple
> arrays
> sharing the same distributed array communication; note, however,
> that the
> dof parameter handles many cases of interest". I am a bit confused.
> Suppose
> I have two arrays having the same layout on the regular grid, can I
> store
> the first array data on one vector, and the second array data on the
> second
> vector (and have a DA with dof=1, instead of a DA with dof=2), and
> be able
> to scatter and update the first vector without scattering/updating the
> second vector ?
>
> Thanks
>
> Rgds,
> Amit
>
> owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:
>
>> Hi Amit,
>>
>> Why do you need two staggered grids? I do EM finite difference
>> frequency
>> domain modeling on a staggered grid using just one DA. Works
>> perfectly
> fine.
>> There are some grid points that are not used, but you just set them
>> to
> zero
>> and put a 1 on the diagonal of the coefficient matrix.
>>
>>
>> Randy
>>
>>
>> Amit.Itagi at seagate.com wrote:
>>> Hi Berend,
>>>
>>> A detailed explanation of the finite difference scheme is given
>>> here :
>>>
>>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>>>
>>>
>>> Thanks
>>>
>>> Rgds,
>>> Amit
>>>
>>>
>>>
>>>
>
>>>            Berend van Wachem
>
>>>            <berend at chalmers.
>
>>>            se>
> To
>>>            Sent by:                  petsc-users at mcs.anl.gov
>
>>>            owner-petsc-users
> cc
>>>            @mcs.anl.gov
>
>>>            No Phone Info
> Subject
>>>            Available                 Re: DA question
>
>>>
>
>>>
>
>>>            04/09/2008 02:59
>
>>>            PM
>
>>>
>
>>>
>
>>>            Please respond to
>
>>>            petsc-users at mcs.a
>
>>>                 nl.gov
>
>>>
>
>>>
>
>>>
>>>
>>>
>>>
>>> Dear Amit,
>>>
>>> Could you explain how the two grids are attached?
>>> I am using multiple DA's for multiple structured grids glued
>>> together.
>>> I've done the gluing with setting up various IS objects. From the
>>> multiple DA's, one global variable vector is formed. Is that what
>>> you
>>> are looking for?
>>>
>>> Best regards,
>>>
>>> Berend.
>>>
>>>
>>> Amit.Itagi at seagate.com wrote:
>>>> Hi,
>>>>
>>>> Is it possible to use DA to perform finite differences on two
> staggered
>>>> regular grids (as in the electromagnetic finite difference time
>>>> domain
>>>> method) ? Surrounding nodes from one grid are used to update the
>>>> value
> in
>>>> the dual grid. In addition, local manipulations need to be done
>>>> on the
>>>> nodal values.
>>>>
>>>> Thanks
>>>>
>>>> Rgds,
>>>> Amit
>>>>
>>>
>>>
>>>
>>
>


From bsmith at mcs.anl.gov  Wed Apr 30 10:58:30 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Wed, 30 Apr 2008 10:58:30 -0500
Subject: DA question
In-Reply-To: <OF8707A932.291DDE17-ON8525743B.00553F83-8525743B.00562FE5@seagate.com>
References: <OF8707A932.291DDE17-ON8525743B.00553F83-8525743B.00562FE5@seagate.com>
Message-ID: <1D003C34-5E65-4340-98EC-8274AA32BA16@mcs.anl.gov>


On Apr 30, 2008, at 10:33 AM, Amit.Itagi at seagate.com wrote:

> Barry,
>
> I tried this out. This serves my purpose nicely.
>
> One question : How compatible is PetSc with Blitz++ ? Can I declare  
> the
> array to be returned by DAVecGetArray to be a Blitz array ?

   Likely you would need to use VecGetArray() and then somehow build  
the Blitz
array using the pointer returned and the sizes of the local part of  
the DA.

   If you figure out how to do this then maybe we could have a  
DAVecGetArrayBlitz()

    Barry

>
>
> Thanks
>
> Rgds,
> Amit
>
>
>
>
>             Barry Smith
>             <bsmith at mcs.anl.g
>              
> ov>                                                        To
>             Sent by:                  petsc-users at mcs.anl.gov
>             owner-petsc- 
> users                                          cc
>             @mcs.anl.gov
>             No Phone Info                                          
> Subject
>             Available                 Re: DA question
>
>
>             04/29/2008 01:28
>             PM
>
>
>             Please respond to
>             petsc-users at mcs.a
>                  nl.gov
>
>
>
>
>
>
>
>   If you are running a true explicit scheme then you have no need
> to ever have a "global representation" at each time step. In this
> case you can use DALocalToLocalBegin() then DALocalToLocalEnd()
> and pass the same vector in both locations. This will update the ghost
> points but WILL NOT do any copy of the local data since it is already
> in the correct locations.
>
>   Barry
>
> On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote:
>
>> Hi,
>>
>> I spent some more time understanding DA's, and how DA's should serve
>> my
>> purpose. Since in the time domain calculation, I will have to
>> scatter from
>> the global vector to the local vector and vice-versa at every
>> iteration
>> step, I have some follow-up questions.
>>
>> 1) Does the scattering involve copying the part stored on the local
>> node as
>> well (i.e. part of the local vector other than the ghost values), or
>> is the
>> local part just accessed by reference ? In the first scenario, this
>> would
>> involve allocating twice the storage for the local part. Also, does
>> the
>> scattering of the local part give a big hit in terms of CPU time ?
>>
>> 2) In the manual, it says "In most cases, several different vectors
>> can
>> share the same communication information (or, in other words, can
>> share a
>> given DA)" and "PETSc currently provides no container for multiple
>> arrays
>> sharing the same distributed array communication; note, however,
>> that the
>> dof parameter handles many cases of interest". I am a bit confused.
>> Suppose
>> I have two arrays having the same layout on the regular grid, can I
>> store
>> the first array data on one vector, and the second array data on the
>> second
>> vector (and have a DA with dof=1, instead of a DA with dof=2), and
>> be able
>> to scatter and update the first vector without scattering/updating  
>> the
>> second vector ?
>>
>> Thanks
>>
>> Rgds,
>> Amit
>>
>> owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:
>>
>>> Hi Amit,
>>>
>>> Why do you need two staggered grids? I do EM finite difference
>>> frequency
>>> domain modeling on a staggered grid using just one DA. Works
>>> perfectly
>> fine.
>>> There are some grid points that are not used, but you just set them
>>> to
>> zero
>>> and put a 1 on the diagonal of the coefficient matrix.
>>>
>>>
>>> Randy
>>>
>>>
>>> Amit.Itagi at seagate.com wrote:
>>>> Hi Berend,
>>>>
>>>> A detailed explanation of the finite difference scheme is given
>>>> here :
>>>>
>>>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Rgds,
>>>> Amit
>>>>
>>>>
>>>>
>>>>
>>
>>>>           Berend van Wachem
>>
>>>>           <berend at chalmers.
>>
>>>>           se>
>> To
>>>>           Sent by:                  petsc-users at mcs.anl.gov
>>
>>>>           owner-petsc-users
>> cc
>>>>           @mcs.anl.gov
>>
>>>>           No Phone Info
>> Subject
>>>>           Available                 Re: DA question
>>
>>>>
>>
>>>>
>>
>>>>           04/09/2008 02:59
>>
>>>>           PM
>>
>>>>
>>
>>>>
>>
>>>>           Please respond to
>>
>>>>           petsc-users at mcs.a
>>
>>>>                nl.gov
>>
>>>>
>>
>>>>
>>
>>>>
>>>>
>>>>
>>>>
>>>> Dear Amit,
>>>>
>>>> Could you explain how the two grids are attached?
>>>> I am using multiple DA's for multiple structured grids glued
>>>> together.
>>>> I've done the gluing with setting up various IS objects. From the
>>>> multiple DA's, one global variable vector is formed. Is that what
>>>> you
>>>> are looking for?
>>>>
>>>> Best regards,
>>>>
>>>> Berend.
>>>>
>>>>
>>>> Amit.Itagi at seagate.com wrote:
>>>>> Hi,
>>>>>
>>>>> Is it possible to use DA to perform finite differences on two
>> staggered
>>>>> regular grids (as in the electromagnetic finite difference time
>>>>> domain
>>>>> method) ? Surrounding nodes from one grid are used to update the
>>>>> value
>> in
>>>>> the dual grid. In addition, local manipulations need to be done
>>>>> on the
>>>>> nodal values.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Rgds,
>>>>> Amit
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
>


From Amit.Itagi at seagate.com  Wed Apr 30 15:27:57 2008
From: Amit.Itagi at seagate.com (Amit.Itagi at seagate.com)
Date: Wed, 30 Apr 2008 16:27:57 -0400
Subject: DA question
In-Reply-To: <B977A06E-7A80-4B35-8556-56BAACAB01D2@mcs.anl.gov>
Message-ID: <OFD364C718.29DAD19C-ON8525743B.006FF3BC-8525743B.00713190@seagate.com>


owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM:

>
>    If you are running a true explicit scheme then you have no need
> to ever have a "global representation" at each time step. In this
> case you can use DALocalToLocalBegin() then DALocalToLocalEnd()
> and pass the same vector in both locations. This will update the ghost
> points but WILL NOT do any copy of the local data since it is already
> in the correct locations.

Barry,

I implemented an explicit scheme using your suggestion. The scheme seems to
work. Now I want to output the data to a file (according to the natural
ordering of a 3D array). I guess, I would need a global vector for this.
Hence, I did

DACreateGlobalVector
DALocalToGlocal
DAGetAO

Thus, I have a global vector and the AO.

Now, how do access the vector elements in the AO order ? Eventually, I will
use PetscFPrintf for writing.

Thanks

Rgds,
Amit


>
>    Barry
>
> On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote:
>
> > Hi,
> >
> > I spent some more time understanding DA's, and how DA's should serve
> > my
> > purpose. Since in the time domain calculation, I will have to
> > scatter from
> > the global vector to the local vector and vice-versa at every
> > iteration
> > step, I have some follow-up questions.
> >
> > 1) Does the scattering involve copying the part stored on the local
> > node as
> > well (i.e. part of the local vector other than the ghost values), or
> > is the
> > local part just accessed by reference ? In the first scenario, this
> > would
> > involve allocating twice the storage for the local part. Also, does
> > the
> > scattering of the local part give a big hit in terms of CPU time ?
> >
> > 2) In the manual, it says "In most cases, several different vectors
> > can
> > share the same communication information (or, in other words, can
> > share a
> > given DA)" and "PETSc currently provides no container for multiple
> > arrays
> > sharing the same distributed array communication; note, however,
> > that the
> > dof parameter handles many cases of interest". I am a bit confused.
> > Suppose
> > I have two arrays having the same layout on the regular grid, can I
> > store
> > the first array data on one vector, and the second array data on the
> > second
> > vector (and have a DA with dof=1, instead of a DA with dof=2), and
> > be able
> > to scatter and update the first vector without scattering/updating the
> > second vector ?
> >
> > Thanks
> >
> > Rgds,
> > Amit
> >
> > owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:
> >
> >> Hi Amit,
> >>
> >> Why do you need two staggered grids? I do EM finite difference
> >> frequency
> >> domain modeling on a staggered grid using just one DA. Works
> >> perfectly
> > fine.
> >> There are some grid points that are not used, but you just set them
> >> to
> > zero
> >> and put a 1 on the diagonal of the coefficient matrix.
> >>
> >>
> >> Randy
> >>
> >>
> >> Amit.Itagi at seagate.com wrote:
> >>> Hi Berend,
> >>>
> >>> A detailed explanation of the finite difference scheme is given
> >>> here :
> >>>
> >>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
> >>>
> >>>
> >>> Thanks
> >>>
> >>> Rgds,
> >>> Amit
> >>>
> >>>
> >>>
> >>>
> >
> >>>            Berend van Wachem
> >
> >>>            <berend at chalmers.
> >
> >>>            se>
> > To
> >>>            Sent by:                  petsc-users at mcs.anl.gov
> >
> >>>            owner-petsc-users
> > cc
> >>>            @mcs.anl.gov
> >
> >>>            No Phone Info
> > Subject
> >>>            Available                 Re: DA question
> >
> >>>
> >
> >>>
> >
> >>>            04/09/2008 02:59
> >
> >>>            PM
> >
> >>>
> >
> >>>
> >
> >>>            Please respond to
> >
> >>>            petsc-users at mcs.a
> >
> >>>                 nl.gov
> >
> >>>
> >
> >>>
> >
> >>>
> >>>
> >>>
> >>>
> >>> Dear Amit,
> >>>
> >>> Could you explain how the two grids are attached?
> >>> I am using multiple DA's for multiple structured grids glued
> >>> together.
> >>> I've done the gluing with setting up various IS objects. From the
> >>> multiple DA's, one global variable vector is formed. Is that what
> >>> you
> >>> are looking for?
> >>>
> >>> Best regards,
> >>>
> >>> Berend.
> >>>
> >>>
> >>> Amit.Itagi at seagate.com wrote:
> >>>> Hi,
> >>>>
> >>>> Is it possible to use DA to perform finite differences on two
> > staggered
> >>>> regular grids (as in the electromagnetic finite difference time
> >>>> domain
> >>>> method) ? Surrounding nodes from one grid are used to update the
> >>>> value
> > in
> >>>> the dual grid. In addition, local manipulations need to be done
> >>>> on the
> >>>> nodal values.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Rgds,
> >>>> Amit
> >>>>
> >>>
> >>>
> >>>
> >>
> >
>


From bsmith at mcs.anl.gov  Wed Apr 30 16:41:55 2008
From: bsmith at mcs.anl.gov (Barry Smith)
Date: Wed, 30 Apr 2008 16:41:55 -0500
Subject: DA question
In-Reply-To: <OFD364C718.29DAD19C-ON8525743B.006FF3BC-8525743B.00713190@seagate.com>
References: <OFD364C718.29DAD19C-ON8525743B.006FF3BC-8525743B.00713190@seagate.com>
Message-ID: <205D1872-F1BC-4B1A-BF0E-4A6FE057AA79@mcs.anl.gov>


On Apr 30, 2008, at 3:27 PM, Amit.Itagi at seagate.com wrote:

>
> owner-petsc-users at mcs.anl.gov wrote on 04/29/2008 01:28:17 PM:
>
>>
>>   If you are running a true explicit scheme then you have no need
>> to ever have a "global representation" at each time step. In this
>> case you can use DALocalToLocalBegin() then DALocalToLocalEnd()
>> and pass the same vector in both locations. This will update the  
>> ghost
>> points but WILL NOT do any copy of the local data since it is already
>> in the correct locations.
>
> Barry,
>
> I implemented an explicit scheme using your suggestion. The scheme  
> seems to
> work. Now I want to output the data to a file (according to the  
> natural
> ordering of a 3D array). I guess, I would need a global vector for  
> this.
> Hence, I did
>
> DACreateGlobalVector
> DALocalToGlocal
> DAGetAO
>
> Thus, I have a global vector and the AO.
>
> Now, how do access the vector elements in the AO order ? Eventually,  
> I will
> use PetscFPrintf for writing.
>
   It would be insane to use PetscFPrintf() to print/save the vector  
entries; it doesn't
scale in larger problems (even moderate sized problems).

   You can use VecView() on the global vector; it will automatically  
map the
vector entries to the natural ordering so you don't need to worry  
about the AO.
VecView() as various options for presenting the values, as ASCII if  
you want,
or binary or even HDF5 format.

    Barry


> Thanks
>
> Rgds,
> Amit
>
>
>
>>
>>   Barry
>>
>> On Apr 29, 2008, at 8:54 AM, Amit.Itagi at seagate.com wrote:
>>
>>> Hi,
>>>
>>> I spent some more time understanding DA's, and how DA's should serve
>>> my
>>> purpose. Since in the time domain calculation, I will have to
>>> scatter from
>>> the global vector to the local vector and vice-versa at every
>>> iteration
>>> step, I have some follow-up questions.
>>>
>>> 1) Does the scattering involve copying the part stored on the local
>>> node as
>>> well (i.e. part of the local vector other than the ghost values), or
>>> is the
>>> local part just accessed by reference ? In the first scenario, this
>>> would
>>> involve allocating twice the storage for the local part. Also, does
>>> the
>>> scattering of the local part give a big hit in terms of CPU time ?
>>>
>>> 2) In the manual, it says "In most cases, several different vectors
>>> can
>>> share the same communication information (or, in other words, can
>>> share a
>>> given DA)" and "PETSc currently provides no container for multiple
>>> arrays
>>> sharing the same distributed array communication; note, however,
>>> that the
>>> dof parameter handles many cases of interest". I am a bit confused.
>>> Suppose
>>> I have two arrays having the same layout on the regular grid, can I
>>> store
>>> the first array data on one vector, and the second array data on the
>>> second
>>> vector (and have a DA with dof=1, instead of a DA with dof=2), and
>>> be able
>>> to scatter and update the first vector without scattering/updating  
>>> the
>>> second vector ?
>>>
>>> Thanks
>>>
>>> Rgds,
>>> Amit
>>>
>>> owner-petsc-users at mcs.anl.gov wrote on 04/09/2008 04:09:59 PM:
>>>
>>>> Hi Amit,
>>>>
>>>> Why do you need two staggered grids? I do EM finite difference
>>>> frequency
>>>> domain modeling on a staggered grid using just one DA. Works
>>>> perfectly
>>> fine.
>>>> There are some grid points that are not used, but you just set them
>>>> to
>>> zero
>>>> and put a 1 on the diagonal of the coefficient matrix.
>>>>
>>>>
>>>> Randy
>>>>
>>>>
>>>> Amit.Itagi at seagate.com wrote:
>>>>> Hi Berend,
>>>>>
>>>>> A detailed explanation of the finite difference scheme is given
>>>>> here :
>>>>>
>>>>> http://en.wikipedia.org/wiki/Finite-difference_time-domain_method
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Rgds,
>>>>> Amit
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>>>           Berend van Wachem
>>>
>>>>>           <berend at chalmers.
>>>
>>>>>           se>
>>> To
>>>>>           Sent by:                  petsc-users at mcs.anl.gov
>>>
>>>>>           owner-petsc-users
>>> cc
>>>>>           @mcs.anl.gov
>>>
>>>>>           No Phone Info
>>> Subject
>>>>>           Available                 Re: DA question
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>           04/09/2008 02:59
>>>
>>>>>           PM
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>           Please respond to
>>>
>>>>>           petsc-users at mcs.a
>>>
>>>>>                nl.gov
>>>
>>>>>
>>>
>>>>>
>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dear Amit,
>>>>>
>>>>> Could you explain how the two grids are attached?
>>>>> I am using multiple DA's for multiple structured grids glued
>>>>> together.
>>>>> I've done the gluing with setting up various IS objects. From the
>>>>> multiple DA's, one global variable vector is formed. Is that what
>>>>> you
>>>>> are looking for?
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Berend.
>>>>>
>>>>>
>>>>> Amit.Itagi at seagate.com wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Is it possible to use DA to perform finite differences on two
>>> staggered
>>>>>> regular grids (as in the electromagnetic finite difference time
>>>>>> domain
>>>>>> method) ? Surrounding nodes from one grid are used to update the
>>>>>> value
>>> in
>>>>>> the dual grid. In addition, local manipulations need to be done
>>>>>> on the
>>>>>> nodal values.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Rgds,
>>>>>> Amit
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>