[petsc-users] Irritating behavior of MUMPS with PETSc

Barry Smith bsmith at mcs.anl.gov
Thu Jun 26 12:37:28 CDT 2014


   -display <xwindowsdisplay> -start_in_debugger -debugger_nodes 1  

     for example just have the debugger run on node one. The tricky part is setting up the X windows so that the compute node can open the xterm back on your machine 

    Then when the program “hangs” hit control C in the debugger window and then type where and look around to see why it might be hanging.

   Barry



On Jun 26, 2014, at 7:37 AM, Gunnar Jansen <jansen.gunnar at gmail.com> wrote:

> Ok, I tried superlu_dist as well. Unfortunately the system seems to hang at more or less the same position. 
> 
> Sadly I can not check another version of openmpi since only this version is installed on the cluster at the time (which it needs to be because of CUDA for other programmers).
> 
> The -info command told me that the processes were successfully started on both nodes. In the GMRES case this also leads to a clean run-through of the program.
> 
> The -log_trace tells me that the problem occurs within the numeric factorization of the matrix.
> 
>     [5] 0.00311184 Event begin: MatLUFactorSym
>     [1] 0.0049789 Event begin: MatLUFactorSym
>     [3] 0.00316596 Event begin: MatLUFactorSym
>     [4] 0.00345397 Event begin: MatLUFactorSym
>     [0] 0.00546789 Event end: MatLUFactorSym
>     [0] 0.0054841 Event begin: MatLUFactorNum
>     [2] 0.00545907 Event end: MatLUFactorSym
>     [2] 0.005476 Event begin: MatLUFactorNum
>     [1] 0.00542402 Event end: MatLUFactorSym
>     [1] 0.00544 Event begin: MatLUFactorNum
>     [4] 0.00369906 Event end: MatLUFactorSym
>     [4] 0.00372505 Event begin: MatLUFactorNum
>     [3] 0.00371909 Event end: MatLUFactorSym
>     [3] 0.00374603 Event begin: MatLUFactorNum
>     [5] 0.00367594 Event end: MatLUFactorSym
>     [5] 0.00370193 Event begin: MatLUFactorNum
> 
> Any hints?
> 
> 
> 
> 
> 2014-06-25 17:17 GMT+02:00 Satish Balay <balay at mcs.anl.gov>:
> Suggest running the non-mumps case with -log_summary [to confirm that
> '-np 6' is actually used in both cases]
> 
> Secondly - you can try a 'release' version of openmpi or mpich and see
> if that works. [I don't see a mention of openmpi-1.9a on the website]
> 
> Also you can try -log_trace to see where its hanging [or figure out how
> to run code in debugger on this cluster]. But that might not help in
> figuring out the solution to the hang..
> 
> Satish
> 
> On Wed, 25 Jun 2014, Matthew Knepley wrote:
> 
> > On Wed, Jun 25, 2014 at 7:09 AM, Gunnar Jansen <jansen.gunnar at gmail.com>
> > wrote:
> >
> > > You are right about the queuing system. The job is submitted with a PBS
> > > script specifying the number of nodes/processors. On the cluster petsc is
> > > configured in a module environment which sets the appropriate flags for
> > > compilers/rules etc.
> > >
> > > The same exact job script on the same exact nodes with a standard krylov
> > > method does not give any trouble but executes nicely on all processors (and
> > > also give the correct result).
> > >
> > > Therefore my suspicion is a missing flag in the mumps interface. Is this
> > > maybe rather a topic for the mumps-dev team?
> > >
> >
> > I doubt this. The whole point of MPI is to shield code from these details.
> >
> > Can you first try this system with SuperLU_dist?
> 
> >
> >   Thanks,
> >
> >      MAtt
> >
> >
> > > Best, Gunnar
> > >
> > >
> > >
> > > 2014-06-25 15:52 GMT+02:00 Dave May <dave.mayhem23 at gmail.com>:
> > >
> > > This sounds weird.
> > >>
> > >> The launch line you provided doesn't include any information regarding
> > >> how many processors (nodes/nodes per core to use). I presume you are using
> > >> a queuing system. My guess is that there could be an issue with either (i)
> > >> your job script, (ii) the configuration of the job scheduler on the
> > >> machine, or (iii) the mpi installation on the machine.
> > >>
> > >> Have you been able to successfully run other petsc (or any mpi) codes
> > >> with the same launch options (2 nodes, 3 procs per node)?
> > >>
> > >> Cheers.
> > >>   Dave
> > >>
> > >>
> > >>
> > >>
> > >> On 25 June 2014 15:44, Gunnar Jansen <jansen.gunnar at gmail.com> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> i try to solve a problem in parallel with MUMPS as the direct solver. As
> > >>> long as I run the program on only 1 node with 6 processors everything works
> > >>> fine! But using 2 nodes with 3 processors each gets mumps stuck in the
> > >>> factorization.
> > >>>
> > >>> For the purpose of testing I run the ex2.c on a resolution of 100x100
> > >>> (which is of course way to small for a direct solver in parallel).
> > >>>
> > >>> The code is run with :
> > >>> mpirun ./ex2 -on_error_abort -pc_type lu -pc_factor_mat_solver_package
> > >>> mumps -ksp_type preonly -log_summary -options_left -m 100 -n 100
> > >>> -mat_mumps_icntl_4 3
> > >>>
> > >>> The petsc-configuration I used is:
> > >>> --prefix=/opt/Petsc/3.4.4.extended --with-mpi=yes
> > >>> --with-mpi-dir=/opt/Openmpi/1.9a/ --with-debugging=no --download-mumps
> > >>>  --download-scalapack --download-parmetis --download-metis
> > >>>
> > >>> Is this common behavior? Or is there an error in the petsc configuration
> > >>> I am using here?
> > >>>
> > >>> Best,
> > >>> Gunnar
> > >>>
> > >>
> > >>
> > >
> >
> >
> >
> 
> 



More information about the petsc-users mailing list