[petsc-users] one process is lost but others are still running

Hong Zhang hzhang at mcs.anl.gov
Thu Apr 5 20:01:36 CDT 2012


Wen:
Do you use superlu_dist as parallel direct solver?
Suggest also install mumps.
(need F90 compiler, configure petsc with
'--download-blacs --download-scalapack --download-mumps').
When superlu_dist fails, switch to mumps
(use runtime option '-pc_factor_mat_solver_package mumps').
If both solvers fail, something might be wrong with your model or code.

Hong

>
>
> I reported this several days ago and I found my code just hanged inside
> Super LU Dist solve. For the test purpose, I let my code keep on solving a
> same linear system many times. My code will still hang at solving step but
> not at the same stage every time. My code was distributed on 4 nodes and
> each node had 4 processes(totally 16 processes). Before it gets stuck, one
> process will disappear, which means that I can no longer see it by the top
> command. The Other 15 processes are still running. I think those processes
> might not know that one has been lost and just keep on waiting for it. It
> looks like the cluster system kills that process without giving me any
> error information. I am pretty sure that the memory is quite big enough for
> my calculation (each core has 6GB), so I cannot figure out what causes. I
> have very little knowledge about the cluster system and could you give me
> any hints on this issue. Is this the problem with PETSc, Super LU or the
> cluster? Thanks.
>
> Regards,
>
> Wen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120405/568a8082/attachment.htm>


More information about the petsc-users mailing list