[mpich-discuss] BLCR library - Restart Problem

Darius Buntinas buntinas at mcs.anl.gov
Mon Sep 24 15:06:39 CDT 2012


Assuming the variables (like CKPT_PATH) are right, this looks correct.  

What's NUM_NODES set to? How many nodes in $hosts?  What's CK_NUM set to?

-d


On Sep 24, 2012, at 2:59 PM, Mehmet Kurt wrote:

> Hi,
> 
> Is there any reason that might cause such a behavior?
> 
> Thank you,
> 
> Mehmet 
> ________________________________________
> From: Mehmet Kurt
> Sent: Thursday, September 20, 2012 2:46 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: RE: [mpich-discuss] BLCR library - Restart Problem
> 
> I use a script file for both initial run and restart.
> 
> This is the command I use for initial run:
> 
> /home/kurt/mpich2-blcr-install/bin/mpiexec -env MPICH_ASYNC_PROGRESS 1 -disable-auto-cleanup -ckpointlib blcr -ckpoint-prefix ${CKPT_PATH} -f $hosts -np ${NUM_NODES}  ./md ${FREQUENCY} ${FAILING_IT}
> 
> And, this is for restart:
> 
> /home/kurt/mpich2-blcr-install/bin/mpiexec -env MPICH_ASYNC_PROGRESS 1 -disable-auto-cleanup -ckpointlib blcr -ckpoint-prefix ${CKPT_PATH} -ckpoint-num ${CK_NUM} -f $hosts -np ${NUM_NODES}
> 
> Mehmet
> ______________________
> From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] on behalf of Darius Buntinas [buntinas at mcs.anl.gov]
> Sent: Thursday, September 20, 2012 2:18 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] BLCR library - Restart Problem
> 
> Can you send us the command lines you used for the initial run and when restarting?
> 
> Thanks,
> -d
> 
> On Sep 20, 2012, at 12:36 PM, Mehmet Kurt wrote:
> 
>> Hello,
>> 
>> I'm using BLCR checkpointing library with mpich2-1.4.1p1.
>> 
>> I have no problem with checkpointing my application, but when I want to restart it with the same set of nodes, nothing happens; it just hangs there.
>> I connected the same node, which restarts the application by mpiexec, from another terminal. after running "top" command I saw that it
>> creates a <DEFUNCT> process for my executable.
>> 
>> Any ideas about what can cause this behavior?
>> 
>> Thank you,
>> 
>> Mehmet Can Kurt
>> -----------------------------
>> Graduate Student
>> Computer Engineering Department
>> Ohio State University
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list