[codes-ross-users] Help with Storage Simulation

Harsh Khetawat hkhetaw at ncsu.edu
Fri Jul 28 08:40:19 CDT 2017


Hi,

I have been working on the codes-storage-server simulation by replacing the
checkpoint workload with a darshan workload. Things seem to be working
fine, and I am getting the results that I expect but the simulation takes a
long time to complete even for moderately sized darshan workloads.

In fact, sequential simulation seems to be running faster than a parallel
conservative simulation. For sync=1, I see that the simulation makes much
faster progress than it does for sync=2 with 64 or even higher node count.

I have tried other configurations as well but there's no observable
speed-up. Any ideas on why that could be happening?

Thanks,
Harsh

On Tue, Jul 11, 2017 at 9:32 AM, Mubarak, Misbah <mmubarak at anl.gov> wrote:

> Great, thanks for the heads up!
>
> From: Harsh Khetawat <hkhetaw at ncsu.edu>
> Date: Monday, July 10, 2017 at 4:50 PM
> To: Misbah Mubarak <mmubarak at anl.gov>
>
> Subject: Re: [codes-ross-users] Help with Storage Simulation
>
> Yes, it works now. Thank you.
> I realized what the problem was, the inter and intra group connectivity
> files paths needed to be absolute in the conf file in my case. It seemed to
> have a problem picking it up from the environment variable for some reason.
>
> Thanks,
> Harsh
>
> On Mon, Jul 10, 2017 at 4:26 PM, Mubarak, Misbah <mmubarak at anl.gov> wrote:
>
>> FYI I have updated the instructions to point to the right allocation
>> file.  I think the problem was that in your case, there was a mismatch
>> between allocation file and the workloads file.  You can give it a try and
>> let me if that works.
>>
>> Thanks,
>> Misbah
>> From: Misbah Mubarak <mmubarak at anl.gov>
>> Date: Monday, July 10, 2017 at 2:25 PM
>> To: Harsh Khetawat <hkhetaw at ncsu.edu>
>>
>> Subject: Re: [codes-ross-users] Help with Storage Simulation
>>
>> Hi Harsh,
>>
>> The test-checkpoint-dfly-1T config file is actually in the repo:
>>
>> https://xgitlab.cels.anl.gov/codes/codes-storage-server/blob
>> /master/tests/conf/test-checkpoint-dfly-1T.conf
>>
>> You will have to adjust the paths for the input network configuration
>> files. The network config files are located in codes at the following path:
>>
>> codes/src/network-workloads/conf/dragonfly-custom/intra-9K-custom
>>
>> Thanks,
>> Misbah
>> From: Harsh Khetawat <hkhetaw at ncsu.edu>
>> Date: Monday, July 10, 2017 at 3:07 PM
>> To: Misbah Mubarak <mmubarak at anl.gov>
>> Cc: "codes-ross-users at lists.mcs.anl.gov" <codes-ross-users at lists.mcs.an
>> l.gov>
>> Subject: Re: [codes-ross-users] Help with Storage Simulation
>>
>> Hi,
>>
>> Thanks for the help with codes-storage-server.
>> I was trying to run the client-mul-wklds test on the cluster here using
>> the instructions in the wiki. First I tried with the exact command from the
>> wiki:
>>
>> ./client-mul-wklds --sync=1
>>
>> --workload-conf-file=/lustre/atlas/scratch/hkhetaw/gen008/CO
>> DES-darshan/codes-storage-server/checkpoint-study/workload-
>> files/workload-512.conf
>>
>> --rank-alloc-file=/lustre/atlas/scratch/hkhetaw/gen008/CODES
>> -darshan/codes-storage-server/checkpoint-study/config-files/
>> --codes-config=/lustre/atlas/scratch/hkhetaw/gen008/CODES-da
>> rshan/codes-storage-server/checkpoint-study/config-files/tes
>> t-checkpoint-dfly-1T-adap.conf
>>
>> --lp-io-dir=/lustre/atlas/scratch/hkhetaw/gen008/CODES-darshan/test-dir
>>
>> --lp-io-use-suffix=1
>>
>>
>> But the rank-alloc-file is a directory in the wiki instructions, so I
>> expectedly got an error.
>>
>> Then I tried passing the rank alloc file from the
>> checkpoint-study/workload-files/allocations/contiguous/cont-alloc-8832-2048.conf,
>> but I got this error:
>>
>> 512 instances of workload checkpoint node: 0: error:
>> ../src/networks/model-net/dragonfly-custom.C:630: intra-group file not
>> found
>>
>> Rank 0 [Mon Jul 10 13:58:50 2017] [c6-1c0s6n2] application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>
>> _pmiu_daemon(SIGCHLD): [NID 02348] [c6-1c0s6n2] [Mon Jul 10 13:58:50
>> 2017] PE RANK 0 exit signal Aborted
>>
>> Application 15031648 exit codes: 134
>>
>>
>> I have set the $HOME_CODES environment variable correctly as well.
>> Also, the instructions in the wiki specifies the codes-config file as
>> "test-checkpoint-dfly-1T.conf" but there isn't any such file there, so I've
>> tried with both the "test-checkpoint-dfly-1T-adap.conf" and
>> "test-checkpoint-dfly-1T-min.conf", but no luck.
>>
>>
>> Thanks again for all the help.
>>
>> Thanks,
>> Harsh
>>
>> On Fri, Jul 7, 2017 at 1:58 PM, Mubarak, Misbah <mmubarak at anl.gov> wrote:
>>
>>> Hi Harsh,
>>>
>>> You can refer to the instructions at the wiki for running the simulation:
>>>
>>> https://xgitlab.cels.anl.gov/codes/codes-storage-server/wiki
>>> s/checkpoint-study
>>>
>>> The right config file to use is test-checkpoint-dfly-1T.conf. The one
>>> you mentioned is dated so I have removed that to avoid confusion. For a
>>> test run, you can reduce the parameter checkpoint_sz in the config (it is
>>> currently set to 1TiB which takes some time to complete).
>>>
>>> For using the Darshan workload generator, please use Darshan versions
>>> 2.x (Darshan 3.0 and up are not supported right now).
>>>
>>> Thanks,
>>> Misbah
>>> From: Harsh Khetawat <hkhetaw at ncsu.edu>
>>> Date: Friday, July 7, 2017 at 1:36 PM
>>> To: Misbah Mubarak <mmubarak at anl.gov>
>>> Cc: "codes-ross-users at lists.mcs.anl.gov" <codes-ross-users at lists.mcs.an
>>> l.gov>
>>> Subject: Re: [codes-ross-users] Help with Storage Simulation
>>>
>>> Hi,
>>>
>>> I have been trying to run the tests/test-checkpoint.c simulation as you
>>> had suggested, with the intention of building on top of it with darshan
>>> workloads. But when I try to run the simulation (without any changes) with:
>>>
>>> aprun -n4 ./test-checkpoint --sync=3 --codes-config=/lustre/atlas/s
>>> cratch/hkhetaw/gen008/CODES-darshan/codes-storage-server/tests/conf/test-checkpoint.conf
>>> --lp-io-dir=test-checkpoint-output-ser
>>>
>>>
>>> I get the following error:
>>>
>>> node: 0: error: ../src/util/codes_mapping.c:487: could not find LP with
>>> type name "dragonfly_router", did you forget to register the LP?
>>>
>>> node: 1: error: ../src/util/codes_mapping.c:487: could not find LP with
>>> type name "dragonfly_router", did you forget to register the LP?
>>>
>>>
>>>
>>> node: 2: error: ../src/util/codes_mapping.c:487: could not find LP with
>>> type name "dragonfly_router", did you forget to register the LP?
>>>
>>>
>>> Rank 0 [Fri Jul  7 13:12:24 2017] [c2-1c0s1n2] application called
>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>>
>>> I checked in the source code for the test and the LP seems to be
>>> registered properly. Any idea why this could be happening?
>>> And would it be the right way to go about it if I used the
>>> darshan-workload-generator to issue I/O requests to the storage-model
>>> instead of the checkpointing I/O being issued by the test case right now?
>>>
>>> Thanks,
>>> Harsh Khetawat
>>>
>>> On Fri, Jul 7, 2017 at 9:40 AM, Harsh Khetawat <hkhetaw at ncsu.edu> wrote:
>>>
>>>> Hi Misbah,
>>>>
>>>> Thank you for the information. This seems perfect for what I want to
>>>> do. I'll try to set-up a simulation with replays of darshan logs and see
>>>> how it goes.
>>>> Thanks again.
>>>>
>>>> Regards,
>>>> Harsh Khetawat
>>>>
>>>> On Thu, Jul 6, 2017 at 4:02 PM, Mubarak, Misbah <mmubarak at anl.gov>
>>>> wrote:
>>>>
>>>>> Hi Harsh,
>>>>>
>>>>> Have you looked at the codes-storage-server model? There is a wiki at:
>>>>>
>>>>> https://xgitlab.cels.anl.gov/codes/codes-storage-server/wikis/home
>>>>>
>>>>> The wiki has an example case study here that models storage nodes on
>>>>> the dragonfly network. The clients (compute nodes) interact with the
>>>>> storage nodes via an API that does RDMA style communication. You can find
>>>>> the example on the API usage in tests/test-client-checkpoint.c [Look for
>>>>> the function: send_req_to_store].  The API itself can be found at:
>>>>> codes-store-lp.h.
>>>>>
>>>>> The repo can be cloned from:
>>>>> https://xgitlab.cels.anl.gov/codes/codes-storage-server
>>>>>
>>>>> The CODES storage model makes use of the local-storage-model in CODES.
>>>>> Attached is a diagram showing the sequence of steps for a write operation
>>>>> (read operation is also implemented, you have to specify a flag indicating
>>>>> whether it is a read or a write).
>>>>>
>>>>> Hope that helps.
>>>>>
>>>>> Regards,
>>>>> Misbah
>>>>> From: <codes-ross-users-bounces at lists.mcs.anl.gov> on behalf of Harsh
>>>>> Khetawat <hkhetaw at ncsu.edu>
>>>>> Date: Thursday, July 6, 2017 at 3:37 PM
>>>>> To: "codes-ross-users at lists.mcs.anl.gov" <
>>>>> codes-ross-users at lists.mcs.anl.gov>
>>>>> Cc: Harsh Khetawat <hkhetaw at ncsu.edu>
>>>>> Subject: [codes-ross-users] Help with Storage Simulation
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have set-up CODES on my cluster and would like to use it to simulate
>>>>> different storage strategies.
>>>>>
>>>>> I wanted to use the darshan logs I have collected along with the
>>>>> local-storage-model in CODES to simulate I/O behavior of the application.
>>>>>
>>>>> My simulation involves not having just one global storage LP, but
>>>>> having: (i) one storage LP per terminal, or (ii) one storage LP per
>>>>> Modelnet_group, or (iii) other strategies.
>>>>> Would the local-storage-model be the correct choice for this?
>>>>> Should I write my own storage model for each scenario?
>>>>> How would I go about incorporating this with the fattree or dragonfly
>>>>> network models?
>>>>>
>>>>> Thanks for the help.
>>>>>
>>>>> Regards,
>>>>> Harsh Khetawat
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/codes-ross-users/attachments/20170728/75f18269/attachment-0001.html>


More information about the codes-ross-users mailing list