[petsc-dev] [Radev, Martin] Re: Adding a new encoding for FP data

Thu Jul 11 10:56:32 CDT 2019

  Sorry, I wasn't clear. Just meant something simpler. Compress the matrix to copy it to the GPU for faster transfers (and uncompress it appropriately on the GPU).

   Barry

> On Jul 11, 2019, at 10:49 AM, Jed Brown <jed at jedbrown.org> wrote:
> 
> I don't know anything about zstd (or competitive compression) for GPU,
> but doubt it works at the desired granularity.  I think SpMV on late-gen
> CPUs can be accelerated by zstd column index compression, especially for
> semi-structured problems, but likely also for unstructured problems
> numbered by breadth-first search or similar.  But we'd need to demo that
> use specifically.
> 
> "Smith, Barry F." <bsmith at mcs.anl.gov> writes:
> 
>>  CPU to GPU? Especially matrices?
>> 
>>> On Jul 11, 2019, at 9:05 AM, Jed Brown via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>> 
>>> Zstd is a remarkably good compressor.  I've experimented with it for
>>> compressing column indices for sparse matrices on structured grids and
>>> (after a simple transform: subtracting the row number) gotten
>>> decompression speed in the neighborhood of 10 GB/s (i.e., faster per
>>> core than DRAM).  I've been meaning to follow up.  The transformation
>>> described below (splitting the bytes) is yielding decompression speed
>>> around 1GB/s (in this link below), which isn't competitive for things
>>> like MatMult, but could be useful for things like trajectory
>>> checkpointing.
>>> 
>>> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>>> 
>>> 
>>> From: "Radev, Martin" <martin.radev at tum.de>
>>> Subject: Re: Adding a new encoding for FP data
>>> Date: July 11, 2019 at 4:55:03 AM CDT
>>> To: "dev at arrow.apache.org" <dev at arrow.apache.org>
>>> Cc: "Raoofy, Amir" <amir.raoofy at tum.de>, "Karlstetter, Roman" <roman.karlstetter at tum.de>
>>> Reply-To: <dev at arrow.apache.org>
>>> 
>>> 
>>> Hello Liya Fan,
>>> 
>>> 
>>> this explains the technique but for a more complex case:
>>> 
>>> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
>>> 
>>> For FP data, the approach which seemed to be the best is the following.
>>> 
>>> Say we have a buffer of two 32-bit floating point values:
>>> 
>>> buf = [af, bf]
>>> 
>>> We interpret each FP value as a 32-bit uint and look at each individual byte. We have 8 bytes in total for this small input.
>>> 
>>> buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
>>> 
>>> Then we apply stream splitting and the new buffer becomes:
>>> 
>>> newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
>>> 
>>> We compress newbuf.
>>> 
>>> Due to similarities the sign bits, mantissa bits and MSB exponent bits, we might have a lot more repetitions in data. For scientific data, the 2nd and 3rd byte for 32-bit data is probably largely noise. Thus in the original representation we would always have a few bytes of data which could appear somewhere else in the buffer and then a couple bytes of possible noise. In the new representation we have a long stream of data which could compress well and then a sequence of noise towards the end.
>>> 
>>> This transformation improved compression ratio as can be seen in the report.
>>> 
>>> It also improved speed for ZSTD. This could be because ZSTD makes a decision of how to compress the data - RLE, new huffman tree, huffman tree of the previous frame, raw representation. Each can potentially achieve a different compression ratio and compression/decompression speed. It turned out that when the transformation is applied, zstd would attempt to compress fewer frames and copy the other. This could lead to less attempts to build a huffman tree. It's hard to pin-point the exact reason.
>>> 
>>> I did not try other lossless text compressors but I expect similar results.
>>> 
>>> For code, I can polish my patches, create a Jira task and submit the patches for review.
>>> 
>>> 
>>> Regards,
>>> 
>>> Martin
>>> 
>>> 
>>> ________________________________
>>> From: Fan Liya <liya.fan03 at gmail.com>
>>> Sent: Thursday, July 11, 2019 11:32:53 AM
>>> To: dev at arrow.apache.org
>>> Cc: Raoofy, Amir; Karlstetter, Roman
>>> Subject: Re: Adding a new encoding for FP data
>>> 
>>> Hi Radev,
>>> 
>>> Thanks for the information. It seems interesting.
>>> IMO, Arrow has much to do for data compression. However, it seems there are
>>> some differences for memory data compression and external storage data
>>> compression.
>>> 
>>> Could you please provide some reference for stream splitting?
>>> 
>>> Best,
>>> Liya Fan
>>> 
>>> On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin <martin.radev at tum.de> wrote:
>>> 
>>>> Hello people,
>>>> 
>>>> 
>>>> there has been discussion in the Apache Parquet mailing list on adding a
>>>> new encoder for FP data.
>>>> The reason for this is that the supported compressors by Apache Parquet
>>>> (zstd, gzip, etc) do not compress well raw FP data.
>>>> 
>>>> 
>>>> In my investigation it turns out that a very simple simple technique,
>>>> named stream splitting, can improve the compression ratio and even speed
>>>> for some of the compressors.
>>>> 
>>>> You can read about the results here:
>>>> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>>>> 
>>>> 
>>>> I went through the developer guide for Apache Arrow and wrote a patch to
>>>> add the new encoding and test coverage for it.
>>>> 
>>>> I will polish my patch and work in parallel to extend the Apache Parquet
>>>> format for the new encoding.
>>>> 
>>>> 
>>>> If you have any concerns, please let me know.
>>>> 
>>>> 
>>>> Regards,
>>>> 
>>>> Martin
>>>> 
>>>> 
>>> 
>>>