[mpich-discuss] How to specify the "--save-all" option when using blcr to checkpoint apps in mpich2-1.4.1p1?

Pavan Balaji balaji at mcs.anl.gov
Tue Nov 29 22:29:19 CST 2011


Cc'ing mpich-discuss

Maybe the checkpointing is taking too long, or the image is not being 
sync'ed on the file system. I'll let Darius answer this part, since he 
wrote the code that creates the actual images.

But please do create a ticket for this. Sounds like a bug to me.

  -- Pavan

On 11/29/2011 11:01 PM, Wei Jiang wrote:
> Yes. I am writing the checkpoint images to a shared file system (under
> my home directory on a cluster).
>
> I will do that, but the program didn't crash, it just hangs there and I
> have no idea what was happening.
>
> On Mon, Nov 28, 2011 at 11:28 PM, Pavan Balaji <balaji at mcs.anl.gov
> <mailto:balaji at mcs.anl.gov>> wrote:
>
>     Hi Wei,
>
>     Are you writing the checkpoint images to some shared file system? If
>     you are seeing a problem in this case, can you file a bug report?
>
>     https://trac.mcs.anl.gov/__projects/mpich2/newticket
>     <https://trac.mcs.anl.gov/projects/mpich2/newticket>
>
>       -- Pavan
>
>
>     On 11/28/2011 11:08 PM, Wei Jiang wrote:
>
>         Hi,
>
>         I was using blcr library with mpich2 to checkpoint/restart my
>         applications. It is working well when I restart the apps on the
>         same set
>         of nodes.
>         But when I use a different node (or set of nodes) to restart, the
>         restarting process just hangs there.
>
>         I looked at the BLCR documentation and it is mentioned that the
>         "--save-all" flag should be specified with using a different
>         node (or
>         set of nodes) to re-run the saved apps.
>
>         So I was wondering that whether mpich2 provides such a "--save-all"
>         option to enable blcr calls when I use mpiexec? If so, how should I
>         specify that?
>
>         Thanks very much!
>
>         Let me know if you need more information.
>
>         Thanks~
>
>         --
>         -- Wei
>
>
>
>         _________________________________________________
>         mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>
>         To manage subscription options or unsubscribe:
>         https://lists.mcs.anl.gov/__mailman/listinfo/mpich-discuss
>         <https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss>
>
>
>     --
>     Pavan Balaji
>     http://www.mcs.anl.gov/~balaji
>
>
>
>
> --
> -- Wei
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list