[petsc-dev] [Radev, Martin] Re: Adding a new encoding for FP data

Thu Jul 11 15:15:03 CDT 2019

A side question: Do lossy compressors have value for PETSc?

--Junchao Zhang

On Thu, Jul 11, 2019 at 9:06 AM Jed Brown via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
Zstd is a remarkably good compressor.  I've experimented with it for
compressing column indices for sparse matrices on structured grids and
(after a simple transform: subtracting the row number) gotten
decompression speed in the neighborhood of 10 GB/s (i.e., faster per
core than DRAM).  I've been meaning to follow up.  The transformation
described below (splitting the bytes) is yielding decompression speed
around 1GB/s (in this link below), which isn't competitive for things
like MatMult, but could be useful for things like trajectory
checkpointing.

https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view

---------- Forwarded message ----------
From: "Radev, Martin" <martin.radev at tum.de<mailto:martin.radev at tum.de>>
To: "dev at arrow.apache.org<mailto:dev at arrow.apache.org>" <dev at arrow.apache.org<mailto:dev at arrow.apache.org>>
Cc: "Raoofy, Amir" <amir.raoofy at tum.de<mailto:amir.raoofy at tum.de>>, "Karlstetter, Roman" <roman.karlstetter at tum.de<mailto:roman.karlstetter at tum.de>>
Bcc:
Date: Thu, 11 Jul 2019 09:55:03 +0000
Subject: Re: Adding a new encoding for FP data
Hello Liya Fan,

this explains the technique but for a more complex case:

https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/

For FP data, the approach which seemed to be the best is the following.

Say we have a buffer of two 32-bit floating point values:

buf = [af, bf]

We interpret each FP value as a 32-bit uint and look at each individual byte. We have 8 bytes in total for this small input.

buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]

Then we apply stream splitting and the new buffer becomes:

newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]

We compress newbuf.

Due to similarities the sign bits, mantissa bits and MSB exponent bits, we might have a lot more repetitions in data. For scientific data, the 2nd and 3rd byte for 32-bit data is probably largely noise. Thus in the original representation we would always have a few bytes of data which could appear somewhere else in the buffer and then a couple bytes of possible noise. In the new representation we have a long stream of data which could compress well and then a sequence of noise towards the end.

This transformation improved compression ratio as can be seen in the report.

It also improved speed for ZSTD. This could be because ZSTD makes a decision of how to compress the data - RLE, new huffman tree, huffman tree of the previous frame, raw representation. Each can potentially achieve a different compression ratio and compression/decompression speed. It turned out that when the transformation is applied, zstd would attempt to compress fewer frames and copy the other. This could lead to less attempts to build a huffman tree. It's hard to pin-point the exact reason.

I did not try other lossless text compressors but I expect similar results.

For code, I can polish my patches, create a Jira task and submit the patches for review.

Regards,

Martin

________________________________
From: Fan Liya <liya.fan03 at gmail.com<mailto:liya.fan03 at gmail.com>>
Sent: Thursday, July 11, 2019 11:32:53 AM
To: dev at arrow.apache.org<mailto:dev at arrow.apache.org>
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Adding a new encoding for FP data

Hi Radev,

Thanks for the information. It seems interesting.
IMO, Arrow has much to do for data compression. However, it seems there are
some differences for memory data compression and external storage data
compression.

Could you please provide some reference for stream splitting?

Best,
Liya Fan

On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin <martin.radev at tum.de<mailto:martin.radev at tum.de>> wrote:

> Hello people,
>
>
> there has been discussion in the Apache Parquet mailing list on adding a
> new encoder for FP data.
> The reason for this is that the supported compressors by Apache Parquet
> (zstd, gzip, etc) do not compress well raw FP data.
>
>
> In my investigation it turns out that a very simple simple technique,
> named stream splitting, can improve the compression ratio and even speed
> for some of the compressors.
>
> You can read about the results here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>
>
> I went through the developer guide for Apache Arrow and wrote a patch to
> add the new encoding and test coverage for it.
>
> I will polish my patch and work in parallel to extend the Apache Parquet
> format for the new encoding.
>
>
> If you have any concerns, please let me know.
>
>
> Regards,
>
> Martin
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190711/3cb4e8fa/attachment-0001.html>