[codes-ross-users] Help with Storage Simulation

Harsh Khetawat hkhetaw at ncsu.edu
Mon Jul 10 14:07:38 CDT 2017


Hi,

Thanks for the help with codes-storage-server.
I was trying to run the client-mul-wklds test on the cluster here using the
instructions in the wiki. First I tried with the exact command from the
wiki:

./client-mul-wklds --sync=1

--workload-conf-file=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/checkpoint-study/workload-files/workload-512.conf

--rank-alloc-file=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/checkpoint-study/config-files/
--codes-config=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/checkpoint-study/config-files/test-checkpoint-dfly-1T-adap.conf

--lp-io-dir=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/test-dir

--lp-io-use-suffix=1


But the rank-alloc-file is a directory in the wiki instructions, so I
expectedly got an error.

Then I tried passing the rank alloc file from the
checkpoint-study/workload-files/allocations/contiguous/cont-alloc-8832-2048.conf,
but I got this error:

512 instances of workload checkpoint node: 0: error:
../src/networks/model-net/dragonfly-custom.C:630: intra-group file not
found

Rank 0 [Mon Jul 10 13:58:50 2017] [c6-1c0s6n2] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 0

_pmiu_daemon(SIGCHLD): [NID 02348] [c6-1c0s6n2] [Mon Jul 10 13:58:50 2017]
PE RANK 0 exit signal Aborted

Application 15031648 exit codes: 134


I have set the $HOME_CODES environment variable correctly as well.
Also, the instructions in the wiki specifies the codes-config file as
"test-checkpoint-dfly-1T.conf" but there isn't any such file there, so I've
tried with both the "test-checkpoint-dfly-1T-adap.conf" and
"test-checkpoint-dfly-1T-min.conf", but no luck.


Thanks again for all the help.

Thanks,
Harsh

On Fri, Jul 7, 2017 at 1:58 PM, Mubarak, Misbah <mmubarak at anl.gov> wrote:

> Hi Harsh,
>
> You can refer to the instructions at the wiki for running the simulation:
>
> https://xgitlab.cels.anl.gov/codes/codes-storage-server/
> wikis/checkpoint-study
>
> The right config file to use is test-checkpoint-dfly-1T.conf. The one you
> mentioned is dated so I have removed that to avoid confusion. For a test
> run, you can reduce the parameter checkpoint_sz in the config (it is
> currently set to 1TiB which takes some time to complete).
>
> For using the Darshan workload generator, please use Darshan versions 2.x
> (Darshan 3.0 and up are not supported right now).
>
> Thanks,
> Misbah
> From: Harsh Khetawat <hkhetaw at ncsu.edu>
> Date: Friday, July 7, 2017 at 1:36 PM
> To: Misbah Mubarak <mmubarak at anl.gov>
> Cc: "codes-ross-users at lists.mcs.anl.gov" <codes-ross-users at lists.mcs.
> anl.gov>
> Subject: Re: [codes-ross-users] Help with Storage Simulation
>
> Hi,
>
> I have been trying to run the tests/test-checkpoint.c simulation as you
> had suggested, with the intention of building on top of it with darshan
> workloads. But when I try to run the simulation (without any changes) with:
>
> aprun -n4 ./test-checkpoint --sync=3 --codes-config=/lustre/atlas/
> scratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/
> tests/conf/test-checkpoint.conf --lp-io-dir=test-checkpoint-output-ser
>
>
> I get the following error:
>
> node: 0: error: ../src/util/codes_mapping.c:487: could not find LP with
> type name "dragonfly_router", did you forget to register the LP?
>
> node: 1: error: ../src/util/codes_mapping.c:487: could not find LP with
> type name "dragonfly_router", did you forget to register the LP?
>
>
>
> node: 2: error: ../src/util/codes_mapping.c:487: could not find LP with
> type name "dragonfly_router", did you forget to register the LP?
>
>
> Rank 0 [Fri Jul  7 13:12:24 2017] [c2-1c0s1n2] application called
> MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>
> I checked in the source code for the test and the LP seems to be
> registered properly. Any idea why this could be happening?
> And would it be the right way to go about it if I used the
> darshan-workload-generator to issue I/O requests to the storage-model
> instead of the checkpointing I/O being issued by the test case right now?
>
> Thanks,
> Harsh Khetawat
>
> On Fri, Jul 7, 2017 at 9:40 AM, Harsh Khetawat <hkhetaw at ncsu.edu> wrote:
>
>> Hi Misbah,
>>
>> Thank you for the information. This seems perfect for what I want to do.
>> I'll try to set-up a simulation with replays of darshan logs and see how it
>> goes.
>> Thanks again.
>>
>> Regards,
>> Harsh Khetawat
>>
>> On Thu, Jul 6, 2017 at 4:02 PM, Mubarak, Misbah <mmubarak at anl.gov> wrote:
>>
>>> Hi Harsh,
>>>
>>> Have you looked at the codes-storage-server model? There is a wiki at:
>>>
>>> https://xgitlab.cels.anl.gov/codes/codes-storage-server/wikis/home
>>>
>>> The wiki has an example case study here that models storage nodes on the
>>> dragonfly network. The clients (compute nodes) interact with the storage
>>> nodes via an API that does RDMA style communication. You can find the
>>> example on the API usage in tests/test-client-checkpoint.c [Look for the
>>> function: send_req_to_store].  The API itself can be found at:
>>> codes-store-lp.h.
>>>
>>> The repo can be cloned from:
>>> https://xgitlab.cels.anl.gov/codes/codes-storage-server
>>>
>>> The CODES storage model makes use of the local-storage-model in CODES.
>>> Attached is a diagram showing the sequence of steps for a write operation
>>> (read operation is also implemented, you have to specify a flag indicating
>>> whether it is a read or a write).
>>>
>>> Hope that helps.
>>>
>>> Regards,
>>> Misbah
>>> From: <codes-ross-users-bounces at lists.mcs.anl.gov> on behalf of Harsh
>>> Khetawat <hkhetaw at ncsu.edu>
>>> Date: Thursday, July 6, 2017 at 3:37 PM
>>> To: "codes-ross-users at lists.mcs.anl.gov" <codes-ross-users at lists.mcs.an
>>> l.gov>
>>> Cc: Harsh Khetawat <hkhetaw at ncsu.edu>
>>> Subject: [codes-ross-users] Help with Storage Simulation
>>>
>>> Hi,
>>>
>>> I have set-up CODES on my cluster and would like to use it to simulate
>>> different storage strategies.
>>>
>>> I wanted to use the darshan logs I have collected along with the
>>> local-storage-model in CODES to simulate I/O behavior of the application.
>>>
>>> My simulation involves not having just one global storage LP, but
>>> having: (i) one storage LP per terminal, or (ii) one storage LP per
>>> Modelnet_group, or (iii) other strategies.
>>> Would the local-storage-model be the correct choice for this?
>>> Should I write my own storage model for each scenario?
>>> How would I go about incorporating this with the fattree or dragonfly
>>> network models?
>>>
>>> Thanks for the help.
>>>
>>> Regards,
>>> Harsh Khetawat
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/codes-ross-users/attachments/20170710/1c250f60/attachment.html>


More information about the codes-ross-users mailing list