[petsc-users] one process is lost but others are still running

Wen Jiang jiangwen84 at gmail.com
Thu Apr 5 19:40:16 CDT 2012


Hi,

I reported this several days ago and I found my code just hanged inside
Super LU Dist solve. For the test purpose, I let my code keep on solving a
same linear system many times. My code will still hang at solving step but
not at the same stage every time. My code was distributed on 4 nodes and
each node had 4 processes(totally 16 processes). Before it gets stuck, one
process will disappear, which means that I can no longer see it by the top
command. The Other 15 processes are still running. I think those processes
might not know that one has been lost and just keep on waiting for it. It
looks like the cluster system kills that process without giving me any
error information. I am pretty sure that the memory is quite big enough for
my calculation (each core has 6GB), so I cannot figure out what causes. I
have very little knowledge about the cluster system and could you give me
any hints on this issue. Is this the problem with PETSc, Super LU or the
cluster? Thanks.

Regards,

Wen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120405/d7dae697/attachment.htm>


More information about the petsc-users mailing list