[mpich-discuss] How to get checkpoint-file

Bagus Jati Santoso bagus.jati at gmail.com
Fri May 21 11:47:25 CDT 2010


I try :
mpiexec -n 11 ckpoint-prefix /tmp/app.ckpoint
to restart

And got this : ..
system msg for write_line failure : Broken pipe
Error: mpid_nem_ckpt.c:92 "ckpt_restart failed"
Other MPI error, error stack:
ckpt_restart(168)........:
MPIDI_PG_SetConnInfo(632): PMI_KVS_Put returned -1
[cli_0]: write_line error; fd=12 buf=:cmd=put kvsname=kvs_3354_0
key=P0-business                                                        card
value=description#ndsl1$port#42965$ifname#192.168.1.1$
....
system msg for write_line failure : Broken pipe
Error: mpid_nem_ckpt.c:92 "ckpt_restart failed"
Other MPI error, error stack:
ckpt_restart(168)........:
MPIDI_PG_SetConnInfo(632): PMI_KVS_Put returned -1
[cli_0]: write_line error; fd=12 buf=:cmd=put kvsname=kvs_3354_0
key=P0-business                                                        card
value=description#ndsl1$port#42965$ifname#192.168.1.11$

Sorry Mr. Darius,
If I put the checkpoint prefix in the same directory at shared FS.
For example :
*mpiexec -n 11 -ckpoint-interval 5 -ckpoint-prefix /mirror/app.ckpoint ./cg*
(my shared FS is located in .mirror)

Will the checkpoint file writing be success? Since all 11 nodes/cpu will
write their checkpoint file using the same filename, *context, *at the
shared FS.

Thank you for your answers.

Best regards,
Bagus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100522/0d0416ba/attachment.htm>


More information about the mpich-discuss mailing list