<div dir="ltr">Hi,<div><br></div><div>Thanks for the help with codes-storage-server. </div><div>I was trying to run the client-mul-wklds test on the cluster here using the instructions in the wiki. First I tried with the exact command from the wiki:</div><div><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">./client-mul-wklds --sync=1 </span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">--workload-conf-file=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/checkpoint-study/workload-files/workload-512.conf </span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">--rank-alloc-file=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/checkpoint-study/config-files/ --codes-config=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/checkpoint-study/config-files/test-checkpoint-dfly-1T-adap.conf </span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">--lp-io-dir=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/test-dir </span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">--lp-io-use-suffix=1</span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures"><br></span></p>But the rank-alloc-file is a directory in the wiki instructions, so I expectedly got an error.</div><div><br></div><div>Then I tried passing the rank alloc file from the checkpoint-study/workload-files/allocations/contiguous/cont-alloc-8832-2048.conf, but I got this error:</div><div><br></div><div><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">512 instances of workload checkpoint node: 0: error: ../src/networks/model-net/dragonfly-custom.C:630: intra-group file not found </span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">Rank 0 [Mon Jul 10 13:58:50 2017] [c6-1c0s6n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0</span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">_pmiu_daemon(SIGCHLD): [NID 02348] [c6-1c0s6n2] [Mon Jul 10 13:58:50 2017] PE RANK 0 exit signal Aborted</span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">Application 15031648 exit codes: 134</span></p><p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures"><br></span></p>I have set the $HOME_CODES environment variable correctly as well.</div><div>Also, the instructions in the wiki specifies the codes-config file as "test-checkpoint-dfly-1T.conf" but there isn't any such file there, so I've tried with both the "test-checkpoint-dfly-1T-adap.conf" and "test-checkpoint-dfly-1T-min.conf", but no luck.</div><div><br></div><div><br></div><div>Thanks again for all the help.</div><div><br></div><div>Thanks,</div><div>Harsh</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 7, 2017 at 1:58 PM, Mubarak, Misbah <span dir="ltr"><<a href="mailto:mmubarak@anl.gov" target="_blank">mmubarak@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word;color:rgb(0,0,0);font-size:14px;font-family:Calibri,sans-serif">
<div>Hi Harsh,</div>
<div><br>
</div>
<div>You can refer to the instructions at the wiki for running the simulation:</div>
<div><br>
</div>
<div><a href="https://xgitlab.cels.anl.gov/codes/codes-storage-server/wikis/checkpoint-study" target="_blank">https://xgitlab.cels.anl.gov/<wbr>codes/codes-storage-server/<wbr>wikis/checkpoint-study</a></div>
<div><br>
</div>
<div>The right config file to use is test-checkpoint-dfly-1T.conf. The one you mentioned is dated so I have removed that to avoid confusion. For a test run, you can reduce the parameter checkpoint_sz in the config (it is currently set to 1TiB which takes some
time to complete). </div>
<div><br>
</div>
<div>For using the Darshan workload generator, please use Darshan versions 2.x (Darshan 3.0 and up are not supported right now). </div>
<div><br>
</div>
<div>Thanks,</div>
<div>Misbah</div>
<span id="m_-8790270060582505694OLK_SRC_BODY_SECTION">
<div style="font-family:Calibri;font-size:11pt;text-align:left;color:black;BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOTTOM:0in;PADDING-LEFT:0in;PADDING-RIGHT:0in;BORDER-TOP:#b5c4df 1pt solid;BORDER-RIGHT:medium none;PADDING-TOP:3pt">
<span style="font-weight:bold">From: </span>Harsh Khetawat <<a href="mailto:hkhetaw@ncsu.edu" target="_blank">hkhetaw@ncsu.edu</a>><br>
<span style="font-weight:bold">Date: </span>Friday, July 7, 2017 at 1:36 PM<br>
<span style="font-weight:bold">To: </span>Misbah Mubarak <<a href="mailto:mmubarak@anl.gov" target="_blank">mmubarak@anl.gov</a>><br>
<span style="font-weight:bold">Cc: </span>"<a href="mailto:codes-ross-users@lists.mcs.anl.gov" target="_blank">codes-ross-users@lists.mcs.<wbr>anl.gov</a>" <<a href="mailto:codes-ross-users@lists.mcs.anl.gov" target="_blank">codes-ross-users@lists.mcs.<wbr>anl.gov</a>><br>
<span style="font-weight:bold">Subject: </span>Re: [codes-ross-users] Help with Storage Simulation<br>
</div><div><div class="h5">
<div><br>
</div>
<div>
<div>
<div dir="ltr">Hi,
<div><br>
</div>
<div>I have been trying to run the tests/test-checkpoint.c simulation as you had suggested, with the intention of building on top of it with darshan workloads. But when I try to run the simulation (without any changes) with:</div>
<div><br>
</div>
<div>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">aprun -n4 ./test-checkpoint --sync=3 --codes-config=/lustre/atlas/<wbr>scratch/hkhetaw/gen008/CODES-<wbr>darshan/codes-storage-server/<wbr>tests/conf/test-checkpoint.<wbr>conf
--lp-io-dir=test-checkpoint-<wbr>output-ser</span></p>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures"><br>
</span></p>
I get the following error:</div>
<div><br>
</div>
<div>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">node: 0: error: ../src/util/codes_mapping.c:<wbr>487: could not find LP with type name "dragonfly_router", did you forget to register
the LP?</span></p>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">node: 1: error: ../src/util/codes_mapping.c:<wbr>487: could not find LP with type name "dragonfly_router", did you forget to register
the LP?</span></p>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo;min-height:13px">
<span style="font-variant-ligatures:no-common-ligatures"></span><br>
</p>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo;min-height:13px">
<span style="font-variant-ligatures:no-common-ligatures"></span><br>
</p>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">node: 2: error: ../src/util/codes_mapping.c:<wbr>487: could not find LP with type name "dragonfly_router", did you forget to register
the LP?</span></p>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo;min-height:13px">
<span style="font-variant-ligatures:no-common-ligatures"></span><br>
</p>
<p style="margin:0px;font-size:11px;line-height:normal;font-family:Menlo"><span style="font-variant-ligatures:no-common-ligatures">Rank 0 [Fri Jul 7 13:12:24 2017] [c2-1c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0</span></p>
</div>
<div><span style="font-variant-ligatures:no-common-ligatures"><br>
</span></div>
<div>I checked in the source code for the test and the LP seems to be registered properly. Any idea why this could be happening?</div>
<div>And would it be the right way to go about it if I used the darshan-workload-generator to issue I/O requests to the storage-model instead of the checkpointing I/O being issued by the test case right now?</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Harsh Khetawat</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Fri, Jul 7, 2017 at 9:40 AM, Harsh Khetawat <span dir="ltr">
<<a href="mailto:hkhetaw@ncsu.edu" target="_blank">hkhetaw@ncsu.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Hi Misbah,
<div><br>
</div>
<div>Thank you for the information. This seems perfect for what I want to do. I'll try to set-up a simulation with replays of darshan logs and see how it goes.</div>
<div>Thanks again.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Harsh Khetawat</div>
</div>
<div class="m_-8790270060582505694HOEnZb">
<div class="m_-8790270060582505694h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Jul 6, 2017 at 4:02 PM, Mubarak, Misbah <span dir="ltr">
<<a href="mailto:mmubarak@anl.gov" target="_blank">mmubarak@anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word;color:rgb(0,0,0);font-size:14px;font-family:Calibri,sans-serif">
<div>Hi Harsh,</div>
<div><br>
</div>
<div>Have you looked at the codes-storage-server model? There is a wiki at:</div>
<div><br>
</div>
<div><a href="https://xgitlab.cels.anl.gov/codes/codes-storage-server/wikis/home" target="_blank">https://xgitlab.cels.anl.gov/c<wbr>odes/codes-storage-server/wiki<wbr>s/home</a></div>
<div><br>
</div>
<div>The wiki has an example case study here that models storage nodes on the dragonfly network. The clients (compute nodes) interact with the storage nodes via an API that does RDMA style communication. You can find the example on the API usage in tests/test-client-checkpoint.c
[Look for the function: send_req_to_store]. The API itself can be found at: codes-store-lp.h.</div>
<div><br>
</div>
<div>The repo can be cloned from:</div>
<div><a href="https://xgitlab.cels.anl.gov/codes/codes-storage-server" target="_blank">https://xgitlab.cels.anl.gov/c<wbr>odes/codes-storage-server</a></div>
<div><br>
</div>
<div>The CODES storage model makes use of the local-storage-model in CODES. Attached is a diagram showing the sequence of steps for a write operation (read operation is also implemented, you have to specify a flag indicating whether it is a read or a write).</div>
<div><br>
</div>
<div>Hope that helps. </div>
<div><br>
</div>
<div>Regards,</div>
<div>Misbah</div>
<span id="m_-8790270060582505694m_4082410634883130483m_-5506049454143521579OLK_SRC_BODY_SECTION">
<div style="font-family:Calibri;font-size:11pt;text-align:left;color:black;BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOTTOM:0in;PADDING-LEFT:0in;PADDING-RIGHT:0in;BORDER-TOP:#b5c4df 1pt solid;BORDER-RIGHT:medium none;PADDING-TOP:3pt">
<span style="font-weight:bold">From: </span><<a href="mailto:codes-ross-users-bounces@lists.mcs.anl.gov" target="_blank">codes-ross-users-bounces@list<wbr>s.mcs.anl.gov</a>> on behalf of Harsh Khetawat <<a href="mailto:hkhetaw@ncsu.edu" target="_blank">hkhetaw@ncsu.edu</a>><br>
<span style="font-weight:bold">Date: </span>Thursday, July 6, 2017 at 3:37 PM<br>
<span style="font-weight:bold">To: </span>"<a href="mailto:codes-ross-users@lists.mcs.anl.gov" target="_blank">codes-ross-users@lists.mcs.an<wbr>l.gov</a>" <<a href="mailto:codes-ross-users@lists.mcs.anl.gov" target="_blank">codes-ross-users@lists.mcs.an<wbr>l.gov</a>><br>
<span style="font-weight:bold">Cc: </span>Harsh Khetawat <<a href="mailto:hkhetaw@ncsu.edu" target="_blank">hkhetaw@ncsu.edu</a>><br>
<span style="font-weight:bold">Subject: </span>[codes-ross-users] Help with Storage Simulation<br>
</div>
<div>
<div class="m_-8790270060582505694m_4082410634883130483h5">
<div><br>
</div>
<div>
<div>
<div dir="ltr">Hi,
<div><br>
</div>
<div>I have set-up CODES on my cluster and would like to use it to simulate different storage strategies. </div>
<div><br>
</div>
<div>I wanted to use the darshan logs I have collected along with the local-storage-model in CODES to simulate I/O behavior of the application. </div>
<div><br>
</div>
<div>My simulation involves not having just one global storage LP, but having: (i) one storage LP per terminal, or (ii) one storage LP per Modelnet_group, or (iii) other strategies. </div>
<div>Would the local-storage-model be the correct choice for this?</div>
<div>Should I write my own storage model for each scenario?</div>
<div>How would I go about incorporating this with the fattree or dragonfly network models?</div>
<div><br>
</div>
<div>Thanks for the help.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Harsh Khetawat</div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</div>
</span></div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div></div></span>
</div>
</blockquote></div><br></div>