The mathematics of compression in database systems

The mathematics of compression in database systems

by agavra

thesz

Arithmetic coding of a single bit preserves ordering of encoded bits, if CDF(1) > CDF(0). If byte's encoding process is going from higher bits to lower bits, arithmetic coding (even with dynamic model) will preserve ordering of individual bytes.

In the end, arithmetic coding preserves ordering of encoded strings. Thus, comparison operations can be performed on the compressed representation of strings (and big-endian representations of integers and even floating point values), without the need to decompress data until that decompressed strings are needed.

Another view: strings are compared by memcmp as if they are mantissas with the base 256. "hi!" is 'h'(1/256)+'i'(1/256)^2+'!'(1/256)^3+0(1/256)^4 and then there are zeroes to the infinity. Arithmetic encoding represents encoded strings as mantissas where base is 2. Range coding can utilize other bases such as 256.

codeismightier

19h

Intriguing claim! At first I was skeptical, thinking that there would be an issue with leading zeros being discarded. However, with some piloting, I was able to use Claude and ChatGPT to construct a proof of your claim.

Sketch of the argument:

First, an arithmetic coder maps strings to non-overlapping subintervals of [0, 1) that respect lexicographic order.

Second, the process of emitting the final encoding preserves this. If enc(s) ∈ I(s), enc(t) ∈ I(t), and I(s) <= I(t), then enc(s) <= enc(t).

Finally, binary fractions compared left-to-right bitwise yield the same order as their numerical values — this is just memcmp.

Thus, we have a proof of your claim that arithmetic coding preserves lexicographic order! Nice result!

My mistake was in thinking that leading zeros are discarded -- it is tailing zeros that are discarded!

srean

Then there is this eternal conversation about whether on should encrypt and then compress or compress and then encrypt.

Encrypted data will not compress well because encryption needs to remove patterns and patterns are what one exploits for compression.

If you compress and then encrypt, yes you can leak information through the file sizes, but there isn't really a way out of this. Encryption and compression are fundamentally at odds with each other.

rcxdude

The biggest risk if if you are compressing secrets alongside attacker-controlled data. Then there's a host of attacks on the secret that become possible.

smilliken

Compress then encrypt is not an option because your encryption is broken if it can be compressed at all. Mathematically it's a near certainty that the compression would increase the file size when given an encrypted input.

adrian_b

You mistyped "compress then encrypt".

Your argument explains correctly that "encrypt then compress" is not an option, because in this order compress will do nothing, except wasting time and energy.

On the other hand "compress then encrypt" is more secure then simple encryption, because even a weak encryption method may be difficult to break when applied only to compressed data, because this use is similar to encrypting random numbers, i.e. the statistical properties of the plaintext have already been hidden.

The only disadvantage of "compress then encrypt" is in the less frequent cases when you are more concerned with deceiving traffic analysis of the amount of data sent than with saving resources, when you will pad anyway your useful data with irrelevant junk, to hide the real length of the data.

If you send data that is highly compressible, even if you want to hide the length it may be advantageous to first compress the data, and then pad it with junk before encryption, to hide the length.

Thus, you might e.g. compress the data 8 times, then double the length by padding and still send less data than without compression, while also masking the true length.

srean

If you are saying that good encryption ought to make the result incompressible, then I agree with you.

UltraSane

16h

Encrypt then compress is pointless.

hinkley

Seems like it should be possible to compress the hell out of any column with an index. Problem is you can always drop an index on a running system.

ozgrakkurt

Really interesting.

I was trying to implement a compression algorithm selection heuristic in some file format code I am developing. I found this to be too hard for me to reason about so basically gave up on it.

Feels like this blog post is getting there but there could be a more detailed sets of equations that actually calculate this based on some other parameters.

Having the code completely flexible and doing a full load production test with desired parameters to find the best tuning is an option but is also very difficult.

Also read this previously which I find similar.

https://rocksdb.org/blog/2021/12/29/ribbon-filter.html

agavra

23h

the compression algorithm you select for your data is quite dependent on the dataset you have. the equations in this blog post don't help you choose which compression to use, but rather "how much" and when to compress. I would be curious to formalize the math for different compression algorithms though... might be a good follow up post!

ozgrakkurt

14h

I was calculating timings and compression ratio for each array with each algorithm. Then I would save the “best” one to use for next chunks of data.

But it is hard to decide how to judge the cpu vs disk/network tradeoff like you explain in the article.

I was a bit curios if I could make an API so on the top level user enters some parameters and the system can adjust this calculation according to that.

But had some issues with this because the hardware budget used by all parts of the system, not only by the compression code.

As an example network is mega fast in data center but can be slow and expensive when connecting to a user. The application can know which case it is executing but it is hard to connect that part of the code into the compression selection stuff cleanly.

Also on network case. It might make sense to keep data large but cpu time low until I hit the limit but nothing matters when I hit the limit.

Would be cool to have a mathematical framework to put some numbers in and be able to reason about the whole picture

esafak

Your SSL cert is invalid.

pixl97

Unless the user just updated it, the current cert has been valid for the last 12 days. Maybe you're getting MITM'ed or don't have the root in your trust store?

esafak

Could be. I was in an educational environment at time of access.

hinkley

Ruh roh.

Crafted by Rajat

Source Code

hckrnws

The mathematics of compression in database systems