I try :<br>mpiexec -n 11 ckpoint-prefix /tmp/app.ckpoint<br>to restart<br><br>And got this : ..<br>system msg for write_line failure : Broken pipe<br>Error: mpid_nem_ckpt.c:92 "ckpt_restart failed"<br>Other MPI error, error stack:<br>
ckpt_restart(168)........:<br>MPIDI_PG_SetConnInfo(632): PMI_KVS_Put returned -1<br>[cli_0]: write_line error; fd=12 buf=:cmd=put kvsname=kvs_3354_0 key=P0-business card value=description#ndsl1$port#42965$ifname#192.168.1.1$<br>
....<br>system msg for write_line failure : Broken pipe<br>
Error: mpid_nem_ckpt.c:92 "ckpt_restart failed"<br>
Other MPI error, error stack:<br>
ckpt_restart(168)........:<br>
MPIDI_PG_SetConnInfo(632): PMI_KVS_Put returned -1<br>
[cli_0]: write_line error; fd=12 buf=:cmd=put kvsname=kvs_3354_0
key=P0-business
card value=description#ndsl1$port#42965$ifname#192.168.1.11$<br>
<br>Sorry Mr. Darius,<br>If I put the checkpoint prefix in the same directory at shared FS.<br>For example :<br><i>mpiexec -n 11 -ckpoint-interval 5 -ckpoint-prefix /mirror/app.ckpoint ./cg</i><br>(my shared FS is located in .mirror)<br>
<br>Will the checkpoint file writing be success? Since all 11 nodes/cpu will write their checkpoint file using the same filename, <i>context, </i>at the shared FS.<br><br>Thank you for your answers.<br><br>Best regards,<br>
Bagus<br>