[codes-ross-users] Help with Storage Simulation

Mubarak, Misbah mmubarak at anl.gov
Fri Jul 7 12:58:27 CDT 2017


Hi Harsh,

You can refer to the instructions at the wiki for running the simulation:

https://xgitlab.cels.anl.gov/codes/codes-storage-server/wikis/checkpoint-study

The right config file to use is test-checkpoint-dfly-1T.conf. The one you mentioned is dated so I have removed that to avoid confusion. For a test run, you can reduce the parameter checkpoint_sz in the config (it is currently set to 1TiB which takes some time to complete).

For using the Darshan workload generator, please use Darshan versions 2.x (Darshan 3.0 and up are not supported right now).

Thanks,
Misbah
From: Harsh Khetawat <hkhetaw at ncsu.edu<mailto:hkhetaw at ncsu.edu>>
Date: Friday, July 7, 2017 at 1:36 PM
To: Misbah Mubarak <mmubarak at anl.gov<mailto:mmubarak at anl.gov>>
Cc: "codes-ross-users at lists.mcs.anl.gov<mailto:codes-ross-users at lists.mcs.anl.gov>" <codes-ross-users at lists.mcs.anl.gov<mailto:codes-ross-users at lists.mcs.anl.gov>>
Subject: Re: [codes-ross-users] Help with Storage Simulation

Hi,

I have been trying to run the tests/test-checkpoint.c simulation as you had suggested, with the intention of building on top of it with darshan workloads. But when I try to run the simulation (without any changes) with:


aprun -n4 ./test-checkpoint --sync=3 --codes-config=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/tests/conf/test-checkpoint.conf --lp-io-dir=test-checkpoint-output-ser


I get the following error:


node: 0: error: ../src/util/codes_mapping.c:487: could not find LP with type name "dragonfly_router", did you forget to register the LP?

node: 1: error: ../src/util/codes_mapping.c:487: could not find LP with type name "dragonfly_router", did you forget to register the LP?



node: 2: error: ../src/util/codes_mapping.c:487: could not find LP with type name "dragonfly_router", did you forget to register the LP?


Rank 0 [Fri Jul  7 13:12:24 2017] [c2-1c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

I checked in the source code for the test and the LP seems to be registered properly. Any idea why this could be happening?
And would it be the right way to go about it if I used the darshan-workload-generator to issue I/O requests to the storage-model instead of the checkpointing I/O being issued by the test case right now?

Thanks,
Harsh Khetawat

On Fri, Jul 7, 2017 at 9:40 AM, Harsh Khetawat <hkhetaw at ncsu.edu<mailto:hkhetaw at ncsu.edu>> wrote:
Hi Misbah,

Thank you for the information. This seems perfect for what I want to do. I'll try to set-up a simulation with replays of darshan logs and see how it goes.
Thanks again.

Regards,
Harsh Khetawat

On Thu, Jul 6, 2017 at 4:02 PM, Mubarak, Misbah <mmubarak at anl.gov<mailto:mmubarak at anl.gov>> wrote:
Hi Harsh,

Have you looked at the codes-storage-server model? There is a wiki at:

https://xgitlab.cels.anl.gov/codes/codes-storage-server/wikis/home

The wiki has an example case study here that models storage nodes on the dragonfly network. The clients (compute nodes) interact with the storage nodes via an API that does RDMA style communication. You can find the example on the API usage in tests/test-client-checkpoint.c [Look for the function: send_req_to_store].  The API itself can be found at: codes-store-lp.h.

The repo can be cloned from:
https://xgitlab.cels.anl.gov/codes/codes-storage-server

The CODES storage model makes use of the local-storage-model in CODES. Attached is a diagram showing the sequence of steps for a write operation (read operation is also implemented, you have to specify a flag indicating whether it is a read or a write).

Hope that helps.

Regards,
Misbah
From: <codes-ross-users-bounces at lists.mcs.anl.gov<mailto:codes-ross-users-bounces at lists.mcs.anl.gov>> on behalf of Harsh Khetawat <hkhetaw at ncsu.edu<mailto:hkhetaw at ncsu.edu>>
Date: Thursday, July 6, 2017 at 3:37 PM
To: "codes-ross-users at lists.mcs.anl.gov<mailto:codes-ross-users at lists.mcs.anl.gov>" <codes-ross-users at lists.mcs.anl.gov<mailto:codes-ross-users at lists.mcs.anl.gov>>
Cc: Harsh Khetawat <hkhetaw at ncsu.edu<mailto:hkhetaw at ncsu.edu>>
Subject: [codes-ross-users] Help with Storage Simulation

Hi,

I have set-up CODES on my cluster and would like to use it to simulate different storage strategies.

I wanted to use the darshan logs I have collected along with the local-storage-model in CODES to simulate I/O behavior of the application.

My simulation involves not having just one global storage LP, but having: (i) one storage LP per terminal, or (ii) one storage LP per Modelnet_group, or (iii) other strategies.
Would the local-storage-model be the correct choice for this?
Should I write my own storage model for each scenario?
How would I go about incorporating this with the fattree or dragonfly network models?

Thanks for the help.

Regards,
Harsh Khetawat



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/codes-ross-users/attachments/20170707/220d439b/attachment.html>


More information about the codes-ross-users mailing list