Integer Compression

It never occurred to me that there would be a way to compress an integer...

I recently learned about three methods of encoding integer values:

  1. Unary
  2. γ (Elias gamma coding)
  3. δ (Elias delta coding)

This sort of blew my mind because I never thought about trying to take a basic, primitive type (such as an integer) and store it using less disk space.

In a simple example, let’s assume a basic signed integer in C which takes 16 bits. If we were to save the number 3, we would be storing the following bit sequence:

00000000 00000011

BUT, we can apply one of the methods above and actually use less than 16 bits. Let’s take a look at each method.

(It’s important to note that in all scenarios in this post we will be using log base 2)

Unary encoding

Unary encoding is by far the simplest. For an integer \(x\) where \(x \geq 1\), the unary encoding is \(x-1\) 1s, followed by a 0.

For example, the number 3 will be two 1s, followed by a 0:

110

And the number 9 will be eight 1s, followed by a 0:

111111110

For small numbers this is great - we’ve saved 13 bits and 7 bits respectively in these examples. However, you’ll likely notice that as the number grows larger, we end up taking more space than we would have by just storing the integer itself.

γ-encoding

For γ-encoding we need to start doing some math.

We start by finding the unary code for \(1+\lfloor log x \rfloor\). So, for the number 3 we end up with:

\(1+\lfloor log 3 \rfloor = 2\), which in unary code becomes:

10

Next, we find the uniform (binary) code for \(x-2^{\lfloor log x \rfloor}\), but represented in \(\lfloor log x \rfloor\) bits:

\(3-2^{\lfloor log 3 \rfloor} = 3-2^1 = 1\), which when represented in \(\lfloor log 3 \rfloor = 1\) bit is:

1

Piecing the two parts together (unary part first, then the uniform), we get:

101

Definitely a savings compared to storing the raw value of 3 in 16 bits, but there is no gain over the unary encoding. What about the number 9?

\(1+\lfloor log 9 \rfloor = 4\), which in unary code becomes:

1110

\(9-2^{\lfloor log 9 \rfloor} = 9-2^3 = 1\), represented in \(\lfloor log 9 \rfloor = 3\) bits is:

001

Piecing the two parts together (unary part first, then the uniform), we get:

1110001

A 2-bit savings over the unary encoding! As the numbers grow, so do the savings.

For example, 15 in unary-encoding:

111111111111110

…and in γ-encoding:

1110111

δ-encoding

The δ encoding is more complicated. It is nearly identical to γ-encoding, however instead of starting with the unary encoding of \(1+\lfloor log x \rfloor\), we start with the γ-encoding of \(1+\lfloor log x \rfloor\). This can get tricky to keep track of.

Let’s use the number 3 again for our first example. We start with the γ-encoding of \(1+\lfloor log x \rfloor\):

1+⌊log 3⌋ = 1+1 = 2, so we find the γ-encoding of the number 2. Following the steps in the section above, you will see that the γ-encoding is:

100

The rest is identical to the γ-encoding. We find the uniform (binary) code for \(x-2^{\lfloor log x \rfloor}\), but represented in \(\lfloor log x \rfloor\) bits:

\(3-2^{\lfloor log 3 \rfloor} = 1\), represented in \(\lfloor log 3\rfloor = 1\) bit is simply:

1

Piecing the two parts together (γ-encoding first, then the uniform):

1001

Yikes! This took 4 bits instead of the 3 bits used by the γ-encoding! What about a larger number like 125?

We start with the γ-encoding of \(1 + \lfloor log x \rfloor\):

\(1 + \lfloor log 125 \rfloor = 7\), represented in γ-encoding is:

11011

Next we find the uniform (binary) code for \(x - 2^{\lfloor log x \rfloor}\), but represented in \(\lfloor log x \rfloor\) bits:

\(125-2^{\lfloor log 125 \rfloor} = 125 - 2^6 = 61\), represented in binary in \(\lfloor log 125 \rfloor = 6\) bits is:

111101

Piecing the two parts together (γ-encoding first, then the uniform):

11011111101

On the other hand, the γ-encoding for 125 is:

1111110111101

A 2-bit savings!

Summary

Long story short, I though these were some pretty cool methods for encoding integers. I won’t write about it here, but you can just as simply decode these bit strings. Actually, I won’t lie, it’s not quite as simple.

Here is a link to a GitHub repo where I’ve put together a couple Python scripts to perform both the compression and decompression techniques:

Some rights reserved

Up Next

Installing ESXi with a USB NIC

Creating a custom ISO for ESXi 7 with a USB NIC

Review of the LinkedIn URL Detector

Hydrogen

A plugin for the Eclipse IDE to configure and launch one or more H2 database servers

Related