Applying Zipf's Law To Adaptive Text Compression

Whether you are performing abstract processes on text, or transmitting it over a wire, at one point while dealing with large volumes of data you will run into the need for a robust compression algorithm.
I have been working on a natural language processing problem recently, as part of a larger artificial intelligence project, and I am storing raw blocks of text, sourced from the web, into a postgres database, before converting this text into various levels of detail.

I will go over these various levels in the coming posts on this project, but for this one we will focus exclusively on the compression.
What we are looking to achieve here is to compress a block of text to the smallest size possible, while still being able to perfectly reconstruct the full text whenever we need to.

At first I thought to just use any of the existing algorithms out there for compressing text, but the more the problem rolled around in my head, the more I was attracted to trying to solve the problem myself, and roll my own compression.

Then I remembered the video below...

If you have watched the video above, you might be able to tell where I am going with this. The idea here, in simple terms, is to order words by their frequency of use, and then substitute them with a smaller representation, in our case numbers and letters.
I will break it down below, and while we will run into some problems along the way, they are relatively easily solved.

In fact the first problem we run into is to get a list of words ordered by frequency of use. While there are many word lists out there that aim to make this kind of data available, none of them are perfectly complete, and there is just no such thing as a complete English word list in the first place, let alone ordered by frequency of use.
Besides that, I would like the algorithm to stay flexible enough to be able to compress more than just words, but possibly in the future entire sentences, common expressions,  and who knows what else.

To achieve all this, I am creating my own word list on the fly, as the blocks of text are coming in I separate out all the words, count the duplicates, and then store them in another database table using the following structure:

locale: Language key for this word
word: The actual word itself
frequency: Times this word has been seen
compression: Substitution key

When more text comes in,  we first look up each word in this database table to see if we have already seen it, and have an entry for it, and if so we add to the frequency count, and if not, we create a new entry for this word.
This way we end up with a word list, per language, that grows naturally, and is not limited to a static list sourced from possibly stale data.

Next we perform the actual compression.
I came up with the idea of  using a simple substitution table, where each word will have a smaller representation as it's compressed form. Consider the following example:

this is a sentence about a cat
Compressed form: 0,1,2,3,4,2,5


As you can see what is happening here is that each unique word will be represented by one single digit, which will therefor compress the byte-count down.
Obviously the first problem is that we run out of single digit number really fast, so to increase efficiency I opted to also use the letters of the alphabet, which now gives us 36 substitutions, right out of the box.
Have a look at a second example, any values here are still completely arbitrary:

this was a sentence after another sentence that came previously
Compressed form: 0,6,2,3,7,8,3,9,a,b

Here we start adding letters of the alphabet, once we run out of single digits we can use. For clarity I did build upon the previous example, using the same word to digit encoding as used there.
I hope it is now clearly visible how much this technique can compress a sentence down, just looking at the length of the original text, and the length of the encoded string we end up with.
In the case of the word sentence, we are going from an 8 byte string to a 1 byte.

Now we need to expand on the encoding algorithm, because there are far more words in the English language (ignoring the fact for now that this is really applicable to any language) than 36.
What needs to happen is once we run out of usable characters in our lookup table, we expand the encoding with another byte.
In this case consider the initial lookup table which was an array of single digits and letters. I will give an example of its generation in Ruby below:

@lookup_table = 0.upto(9).to_a + 'a'.upto('z').to_a

Once we run out of usable substitutions we simply start doubling up the bytes, so after z comes 0a, then 0b, and so forth.
This already gives us 36^2 = 1296 positions to work with, and of course once those are full we just add another byte, thus getting 36^3 = 46,656.
We continue on with this until we run out of encodable words and our work is done, we now have a compression algorithm that makes words as small as they can be with the initial parameters of this solution.

Or do we?

See, as we learned from the video, and Zipf's Law, some words are used way more often than others, and we have no way of telling in what order they will come into our word list table.
Our compression is not uniform, some words will be compressed to 1 byte, and some will be compressed with N-bytes.
It would make a lot of sense if the most used words are compressed with the lowest byte amount possible, as we will be using them a lot more.

This is where the update algorithm comes in, which iterates over the entire word list, and reorders it by the frequency field.
It will then re-compress all the words, based on this new order, and finally go through the Blocks table (that stores the raw and compressed text fields) to also update the compressed representation, based on this new encoding.
This is obviously a bit expensive to do, since it will have to update a lot of data, and meanwhile there can be no new data coming in, so the entire system is basically offline during this process, but it is currently a small price to pay for gaining a much stronger, and adaptive compression.
There are some small gains made by only updating the values that are actually changed, and specifically the rows in the table that contain these values.

Another great inherited benefit is that even the expense of restructuring the encoding is a finite expense, in as much that the more data is collected, and the more often the expensive restructuring is done, the less it will have to be done in the future.
Once the data organizes along Zipf's law, it is basically done, and we can reap the full benefits of this compression.

So what improvements can we make from here?
First of all we can go to a different counting system. As we have the 36 single digits and letters right now, we can easily expand this by another 26 already, including capital letters.
We can start adding even more to our set by adding various non-word characters, but a really interesting prospect comes from someone I can only identify as ferno, who came up with base65536, which allow us to extend out character set with all letters out of every possible language.

This was a little introduction into this problem, and I am finalizing my implementation at the moment, so I will bring you another update when I am ready.
In the mean time, let me know your thoughts, I am very interested to hear what I am doing right, and what I am doing wrong here.