Adventures in compressing integer sets
For my webfont subsetting app Glypht, I provide a list of predefined character subsets that the user can choose to include in the output fonts (they're the same ones that Google Fonts uses, taken from googlefonts/nam-files). These are sets like "Latin", "Latin Extended", "Cyrillic", "Simplified Chinese", etcetera, and they're just lists of Unicode codepoints.
Most of the time, these sets of codepoints are made up of a few contiguous ranges--for instance, the "Cyrillic" set contains a few control characters, followed by every codepoint from U+0400 ("CYRILLIC CAPITAL LETTER IE WITH GRAVE") to U+045F ("CYRILLIC SMALL LETTER DZHE"), and finally a few more additional individual characters. Those are pretty easy to store.
However, for various reasons, the Chinese and Japanese subsets are not contiguous. They contain a mix of a few thousand smallish ranges and individual characters, which takes up a lot more space. I need to store four of these, for the "chinese-traditional", "chinese-simplified", "chinese-hongkong", and "japanese" subsets.
To make matters even worse, Google Fonts noticed that CJK fonts are quite large, and decided to subdivide each script even further into 120 different "slices". The idea is to cluster characters that are often used together into the same slices, and then create one font file per slice, reducing the total amount of webfont data that users have to download. These "slices" are generated by an automatic optimization process, are extremely non-contiguous, and therefore are even harder to compress!
I started out by just storing all the code points as JavaScript arrays. When trying to size-optimize the app, I noticed that they were a significant part of the bundle size and switched to a pretty simple method that used LEB128 and run-length encoding. That worked until I decided to include the CJK "slices", and things got much larger.
This made me curious about the best way to encode all that data, which sent me down a bit of a rabbit hole. It might not have been the best return on my time investment (my app also ships WebAssembly, which is much larger, relatively speaking), but I learned a lot along the way and got some interesting results.
As a disclaimer, I'm not an expert in the field of compression, and this post chronicles my first real forays into the field. If anybody is a compression expert, I'd love to hear from you! I've had a hard time finding good compression algorithms for highly-clustered data. As such, I might be missing some obvious algorithm that'd work great on my dataset.
The problem
In abstract terms, our goal is to compress a set of integers as much as possible. For compression purposes, it's useful to conceptualize it as an ascending list (which lets you do things like encode gaps or run lengths); or as a bitset, where 1 means the integer at that location is present and 0 means it's absent. I'll do both when explaining various compression methods.
Our input dataset (sets of codepoints by language, and "slices" of those sets) has certain characteristics that differ from a purely random dataset, and any good compression scheme should be able to handle those properties:
- Contiguous ranges. The non-CJK subsets consist largely of long runs of sequential integers.
- Large gaps. Most (all?) subsets include things like the space characters (U+0020, U+00A0) before moving on to the actual script-specific codepoints.
- Sometimes there are sparse, more random sections. The previous two bullet points might have made you think about using run-length encoding, but this is why it's not a slam dunk--for the CJK subsets, perhaps 1 in every 3 or 4 codepoints will be present.
Many compression schemes that work well on random data will perform worse, often significantly so, on data that's run-heavy.
The techniques
I implemented and tested a variety of approaches, some ad-hoc and some described in the literature I could find. Most integer-set compression algorithm research seems to be focused on compressing inverted indexes, which are a lot sparser and less clustered than the Unicode data I'm compressing. The paper Techniques for Inverted Index Compression gives a good summary of these algorithms. Many of them are also intended for online usage as data structures without any decompression step, whereas I care purely about compressed size.
LEB128 with ranges
This is what I started with, and it has a lot of overhead but compresses fairly well if put through something like gzip:
- Group the input into ranges.
- Encode each range as follows:
- Let n be the gap between the start of the range and the end of the previous range.
- Let m be
n << 1if the range contains just one integer, and(n << 1) | 1if it contains multiple. That is, use the least-significant bit to encode whether the range contains just a single integer or multiple. - Encode m using LEB128.
- If the range contains multiple integers, encode the range's length using LEB128.
It provides a good baseline, especially if compressed.
Three data structures in a trenchcoat ("triply partitioned")
This was the second encoding I tried, and it turned out to be frustratingly hard to beat. It converts the input set of integers into ranges, and then clusters them into "sparse-ish" and "dense-ish" sections based on the length of each range, and the gap between it and the previous range. Sections with small gaps are stored as bitsets, and sections with large gaps are stored as LEB128-encoded ranges or lone values.
The complete encoding works like this: first we encode the number of lone values, then the number of ranges, and finally the number of bitsets. We then gap-encode all the lone values in order, followed by the ranges. Finally, we encode all the bitsets, again using gap encoding to encode where each bitset starts.
Honestly, this encoding might have been the reason I embarked on this endeavor in the first place. I implemented it directly in JavaScript at first, not expecting it to be great, but didn't feel like using it even when it did turn out to be quite small. It requires lots of temporary storage to encode, the decoded output isn't in any sort of order, and it requires implementing two and a half different encoding schemes. I then went off in search of some more "elegant" encoding.
Regular bitset + deflate
This is quite simple, if you assume that somebody else has implemented the Deflate algorithm for you. It's a "reference algorithm" in that I didn't expect to use it directly, but rather to compare other algorithms against it. It's actually close to optimal for data where each integer has an independent chance of being present or absent, and does very well on many of the synthetic datasets I tested on, but performs poorly on all my real-world datasets. It therefore serves as a good reminder that synthetic benchmarks can be deceptive.
Prefix codes
These aren't a good compression technique in and of themselves, but are useful building blocks in a compression scheme. A variable-length prefix code allows you to unambiguously encode numbers with differing bit lengths, with the idea being that more frequent numbers can be assigned smaller codewords. I only implemented some universal codes, which assume that smaller numbers are more frequent:
-
Elias gamma coding: this one is the simplest, and it turns out it performs the best on most datasets. To encode a number n, just look at its highest power of two (that is,
log2(n)), and write that many zero bits, followed by n. -
Elias delta coding: this uses gamma coding as a building block, and represents larger integers using fewer bits at the expense of smaller integers. This turns out to not be very useful, since most of our integers are small.
-
Golomb-Rice coding: divide n into its low bits and high bits. Exactly where you divide it is a tunable parameter. Encode the value of the high bits as a unary number, followed by the low bits. This is suboptimal for our data because it assumes numbers are geometrically distributed, with lower ones being exponentially more common. For our Unicode data, the distribution is more bimodal: we might have lots of small runs, perhaps with small gaps in between, punctuated by extremely long gaps or runs. In my testing, it performed very badly no matter how I tuned it, so I opted not to pursue it further.
I tried implementing a couple more as well, but none of them could beat Elias gamma coding.
Gap coding
Instead of storing each integer directly, we store the gaps between successive integers. This is actually quite good on sparser data, but falls over on long runs of successive integers. This is a pretty elementary technique, but we can pair it with a universal code to take advantage of their "lower numbers are more common" property. For my benchmarks, I'll be using the Elias gamma code since it gets the best result.
This is often referred to as "delta encoding", but I refer to it as gap coding to avoid confusion with Elias delta coding as mentioned above.
Run-length encoding
Instead of storing the gaps between successive integers, we treat the input as a series of alternating runs of absences and presences, and encode the run lengths. This has the opposite tradeoff to gap coding: on dense data, we only need to encode a single run length, but on sparser data where most runs are of length 1, we waste space encoding every length. As with gap coding, this is typically paired with a prefix code.
"Flip coding"
I haven't seen this technique described anywhere before, and I'm not sure if I'm missing something, because it seems to provide very good compression performance while being relatively simple to implement. It behaves like gap coding, but when two successive integers are present (that is, after a gap of 0), we switch to encoding the gaps between absences. After two successive gaps between absences, we switch back, and so on. This provides a natural hybrid between run-length encoding and gap coding.
It has the nice advantage that it can also be implemented in terms of run lengths, although not straightforwardly--I almost gave up on it because my original implementation was incorrect and its output was larger than the naive implementation's. There's some finesse involved in mapping the states.
If you're experienced in the field of data compression, please let me know: is this a known approach? Intuitively, it makes sense why it would work so well; we get to avoid the redundancy of run-length encoding in sparse data without blowing up in very dense data. Still, I'm not sure if its improved performance is inherent to it being a good fit for the data, or an artifact of the specific way I'm using it (like, could it be compensating for my prefix code not matching the actual data distribution, and shifting the distribution around to better fit it?)
Elias-Fano encoding
(Not to be confused with Shannon-Fano-Elias coding, which is what comes up if you search for this on Google. Those guys are the Euler and Gauss of compression.)
This idea was described in Quasi-Succinct Indices, and kept coming up when I looked for ways to encode an ascending set of integers. However, it turned out to not quite be what I was looking for. It's intended to be used as an in-place data structure without any decompression step, and sacrifices some space to do so.
It's a bit like Golomb-Rice coding--you divide each integer into high and low bits. The low bits are packed together in a bit vector, and the gaps between the high bits (not the high bits directly this time) are unary-encoded and stored in another bit vector.
This isn't a very good encoding for our purposes because it can't take advantage of the data distribution at all; any runs or gaps are ignored. There's a hard floor of 2 bits per element. It's pretty good for very-sparse-yet-random data, but falls over when representing my Unicode-related data.
Binary interpolative coding
Hallelujah, finally a compression method with some theoretical underpinnings that works well on clustered data! This method operates recursively. We start by encoding the sequence length and its maximum value using some other method (like a different type of prefix code).
We then enter the recursive step of the algorithm: given an integer sequence and a lower and upper bound, we encode the middle value of the sequence, then recurse into the sequence's left and right halves. We can encode the middle value using fewer bits because we know it's within a certain range. For our initial step, the lower and upper bound are 0 and the maximum value minus 1 respectively. We can restrict the range even further because we know the length of the sequence and the fact that we're encoding the middle value, so there must be some (len - 1) / 2 integers below it and (len - 1) / 2 integers above it. After we encode the middle value of the sequence, we can then encode the left and right halves with tighter bounds, and repeat the process until the entire sequence is encoded.
My implementation is based on this easy-to-read C++ implementation and the accompanying technical report. I also implemented a sequential version described in Binary Interpolative Coding Revisited, which does a great job explaining the algorithm in-depth. The sequential version seems to have slightly different space characteristics--the paper implies it should be strictly better, and it is for real-world data, but it sometimes performs worse on synthetic data.
There's also some variation based on the specific binary encoding you use. When encoding a number with a known-restricted range, some numbers might require one more bit to encode than others if that range isn't a power-of-two size. You can choose which numbers get the more compact encoding--the two methods described in the C++ implementation are to give the compact encoding to the leftmost items in the range, or to the centered items.
The combination of sequential vs. recursive and leftmost vs. centered encoding does result in slightly different compression characteristics. For the pretty graphs, I chose sequential encoding with leftmost encoding since that performed the best on real-world data.
The datasets
My original dataset, and the one I'll end up compressing for real, is the nam-files list. However, I decided to also test the algorithms on various other real-world and synthetic data; the former to see where each algorithm does shine, and the latter to get some pretty graphs. My datasets are:
-
The main
nam-fileslist, not including "slices". As previously mentioned, most of these consist of a few contiguous ranges, and only the CJK scripts' codepoint sets are more complex. -
The
nam-files"slices". These have different compression characteristics from the mainnam-files; they still have some gaps and jumps but are much sparser. -
Select properties from the Unicode Character Database. I specifically took the "binary"- and "enum"-type properties and turned them into codepoint sets. I thought these would be pretty analogous to the
nam-filesdata in terms of composition, but the results were different still. -
Language membership data for the Google Fonts library. This is another thing I needed to compress in Glypht, so I decided to see how the fancy compression methods did on it. Basically, every font (there are around 2000) supports some set of languages (there are also around 2000 in total), which are mapped to numeric IDs. The interesting part here is that I control the numeric mapping, so I've sorted the languages by font support--the most frequently supported languages are all at the front, the rarely-supported or entirely unsupported ones are at the back, and the entropy should all be concentrated towards the middle of the range.
-
A couple text files (an alphabetical wordlist, and the Project Gutenberg version of Dracula), preprocessed using the Burrows-Wheeler transform and
MtF transform^H^H^H^Hmove-to-front transform. Their descriptions are a bit out-of-scope for this post, but the idea is that they reversibly transform text into integer sequences that should have compressible run-length characteristics. I stole this idea from Binary Interpolative Coding Revisited.
And the synthetic ones, which are pretty ad-hoc:
-
Completely uniformly random integer sets; each integer has a uniform probability of being present or absent. We can plot a nice graph by varying the probability.
-
Autocorrelated integer sets; each integer randomly chooses between a uniform probability of being present, or copying the previous integer's presence or absence. We now have two parameters we can tune: the presence probability (or the "density" of the set), and the probability to copy the previous integer. As we increase the latter, the data becomes more clustered.
-
Geometrically-distributed run lengths; we generate runs of integers where the run length follows a geometric distribution. We can again vary both the density and average run length.
The results
Finally, we get to the good part. Here's how the algorithms did on various datasets.
nam-files main list
My original LEB128 + ranges approach is surprisingly not terrible, being "only" about twice as large as the best performers. The run-length-aware encodings do a lot better here; note how plain gap coding takes up 20KB vs. the run-length encoding's 14KB. Even though Elias gamma coding encodes a gap of 1 using a single bit, there are just so many sequential integers.
"Flip coding" just ekes out a win here; it's a bit worse at encoding CJK scripts than gap coding but better than run-length encoding, and it doesn't have gap coding's worst-case behavior.
The "triple partitioned" strategy of bitsets + single values + ranges does really well, especially considering it uses LEB128 for everything, and even outperforms binary interpolative coding!
Just using zlib on a bitset loses out a lot; general-purpose compression algorithms are possible to beat, especially if you know a lot about the structure of your data. This will be a common theme--zlib on a bitset does well on synthetic data, but not so much on any real-world data.
Elias-Fano isn't on the chart because it uses like 70KB and dwarfs all the other algorithms.
nam-files "slices"
Now this is interesting! Binary interpolative coding wins here by a small but significant margin. The other approaches all seem to hit a wall at 51-53KB at best, and binary interpolative coding is the only one to shave that down to 47KB. My "flip coding" approach is still the second best, though.
This is also one of the few cases where Elias-Fano encoding isn't terrible, and there's less of a margin between LEB128 + ranges and the best approaches (it's ~1.5x worse, instead of ~2x like the nam-files main list). All the methods seem to perform pretty similarly here.
Unicode Character Database properties
Gap coding and Elias-Fano encoding are both omitted. There are a lot of very large ranges in this dataset, and both of those approaches fall over hard. Gap coding takes 360KB and Elias-Fano encoding takes a whopping 1.2MB!
With those two out of the picture, things become a lot closer. Gamma-coded run-length encoding is the leader, with "flip coding" not far behind, and the three-data-structures-in-a-trench-coat approach being annoyingly good as usual.
There are a couple interesting things here: LEB128 + ranges is competitive, and binary interpolative coding actually performs quite poorly compared to the leaders! I think it tends to trail slightly once the clusters get really large.
Font language membership
Once again, gap coding and Elias-Fano encoding are omitted because they don't like this dataset at all. Remember, I can control the numeric mapping here, so there's probably going to be a long run of present integers (ones) at the beginning, a long run of absent integers (zeroes) at the end, and some randomness mainly in the middle. Approaches that perform poorly on clustered data won't do well here.
While the datasets are quite different, the results are surprisingly similar to the UCD ones. Gap coding and Elias-Fano encoding are terrible, the "run-lengthy" approaches do well, and binary interpolative coding falls behind.
Preprocessed text
Now this is more like what you'd expect to see from a compression benchmark: the actual compression algorithms doing well, and the ad-hoc methods (the LEB128 + ranges and "triple partitioned" approach) falling behind.
For comparison's sake, the atebits wordlist is 2.8MB uncompressed, and compresses to ~800KB in a .zip, ~980KB in a .bz2, and ~600KB in a .xz (LZMA) archive. Dracula is 890KB, and compresses to ~300KB in a .zip, ~245KB in a .bz2, and ~270KB in a .xz archive.
Maybe these datasets are a bit too interesting. I certainly wasn't expecting bzip2 to lose to Deflate when compressing the wordlist, or beat LZMA when compressing Dracula. It does pretty closely match our results though, which makes sense because it uses the same preprocessing transforms. There are a few caveats when trying to compare our results to bzip2: I'm processing each file in one go instead of splitting it into blocks, I don't encode the end index of the Burrows-Wheeler transform which is necessary for decoding, and I haven't actually tested decoding.
Some more interesting things here:
-
Even LEB128, while being the worst encoding here by far, is smaller than the original data!
-
By including the zlib-deflated bitset in the comparison here, we're constructing a general-purpose data compressor using only the Burrows-Wheeler transform, the move-to-front transform, and... a general-purpose data compressor.
- It substantively outperforms and loses to the regular Deflate algorithm, depending on the dataset.
-
The only part here that really differs from bzip2 is the way we encode the preprocessed numbers. bzip2 uses Huffman coding, and we're essentially testing alternative encoding methods.
-
Despite not being designed for this type of data at all, "flip coding" outperforms gap coding and run-length encoding.
-
Binary interpolative coding, the best performer, gets within several % of bzip2's actual compression ratio.
That's all for real-world data. Now onto the synthetic data with accompanying pretty line graphs!
Uniform random
Since we're testing on synthetic data, we can now calculate the Shannon entropy of the data and use it as a lower bound!
The graph is a bit crowded, which isn't helped by Elias-Fano and LEB128 hogging half of the Y-axis. Elias-Fano is actually quite good at very low densities; it just falls over quickly as the density increases. Here's the same graph without those two, so you can see some of the subtleties (OK, maybe it's still crowded):
The two contenders closest to the lower bound, in this case the binary entropy function, are binary interpolative coding and a plain zlib-compressed bitset. The former does better when the density is farther from 50/50, and the latter does better as the density approaches 50/50 and we're trying to compress purely random data.
Flip coding, the green line, starts out tracking gap coding, the blue line. As the density increases, it does slightly worse and veers upwards to match run-length encoding; it really does seem to be a hybrid of the two. As we cross the halfway mark, gap coding falls behind since it has to encode all those ones. It still does decrease towards the end (I think it's because the encoding for 1 is smaller than the encoding for 2 or 3, so as we get more gaps of length 1, it does slightly less badly).
The triple-partitioned method starts to trace out the very steep arc of what would be LEB128-encoded ranges, but it's cut off as the bitset takes over. This cutoff can actually be tuned by changing the gap length threshold for using a bitset.
Increasing autocorrelation (10% density)
Here we can clearly see the problem with Elias-Fano encoding: even as the autocorrelation increases, and the entropy goes down with it, the encoded size remains constant. It doesn't adapt to the entropy of the input at all.
Tracing the pink line for binary interpolative coding, we can see more evidence for our earlier hypothesis that it gets worse as the clusters get larger. At the left end of the graph where the data is uniformly random, it's the best performer, but it's overtaken by gap coding and then the rest of the pack shortly after.
Gap coding itself starts off pretty strong, but starts losing out at the very end once the clusters become very large. All the run-length-based approaches really start to cluster together near the bottom, even the triple-partitioned and LEB128 approaches that start off badly. Notice once again how flip coding starts off matching gap coding, and slowly moves towards the performance of run-length encoding.
Increasing autocorrelation (50% density)
Since the left edge of the graph is pure uniform randomness, LEB128 + ranges is doing terribly. I would've expected it to be 4x the optimal size, though, rather than 3x--it takes at least 8 bits to encode a single integer, and the dataset is 50% present integers, so that should work out to 4 output bits per input bit. I think the runs that are present due to random chance are helping here.
Let's remove the poorly-performing encodings and focus on the good ones:
As we saw on the uniformly-random graph, the bitset-based approaches are the only ones that don't waste space trying to encode purely-random data on the left side of the graph. The zlib-encoded bitset is actually the best overall performer here, only being slightly overtaken near the end of the graph.
Gap coding is starting to fall behind now. Flip coding is dragged towards it a bit, but manages to recover. Binary interpolative coding continues its pattern of doing poorly on highly-clustered data while not falling over.
Increasing autocorrelation (90% density)
For the encodings that treat ones and zeroes (presences and absences) symmetrically, this is just the 10% density graph again. LEB128 + ranges and the "triple partitioned" approach aren't quite symmetrical--they can encode individual presences but not individual absences--but they're close.
Gap coding completely fails to take advantage of sequential runs of ones, which means flip coding actually does quite well all around. It's not really the best performer at any point, but it's pretty good as the autocorrelation increases.
Let's now look at the other clustered synthetic dataset.
Geometrically-distributed run lengths (10% density)
Note that the average run length parameter increases exponentially here; I mainly did that to make the graph more interesting to look at informative.
As we look to the right of the graph, we can clearly see the run-length-based approaches (and zlib, I guess) really group together near the bottom. Gap coding and binary interpolative coding both start out strong but quickly lag behind.
Geometrically-distributed run lengths (50% density)
As before, LEB128 + ranges and Elias-Fano both balloon in size at 50% density, so I've removed them from the graph.
Very interestingly, the "triple partitioned" approach leads by a visible margin on the right of the graph, outperforming gamma-coded run lengths! Maybe that's when the runs become long enough that LEB128 can encode them more efficiently than gamma coding?
Geometrically-distributed run lengths (90% density)
And finally, this is only here for completeness' sake. You can tell that gap coding "hits a wall" where it's just not possible to encode using less than 1 bit per present integer.
Making sense of it all
OK, so those graphs were pretty fun to look at. What have we learned?
Both of the techniques I learned about from inverted index compression, binary interpolative coding and Elias-Fano encoding, perform best on sparser data that's less clustered. Elias-Fano doesn't take advantage of clustering at all, and failed badly on some of the datasets. Binary interpolative coding was specifically mentioned in the literature as being good on clustered data, and it never fell over, but it still lost to run-length encoding in real-world and synthetic tests. This makes sense, since I'd imagine inverted indices look very different from Unicode codepoint ranges and are probably a lot sparser.
If you only looked at the synthetic benchmarks, you might be tempted to say "just use zlib, compression is a solved problem". The real-world benchmarks tell a very different story, though! Using zlib to compress a bitset gave mediocre results on the type of data I actually need to compress.[1] Sure, you can try to preprocess your data with things like run-length encoding and variable-length integers, but at that point you've started rolling your own compression and might as well commit to it.
Heavy theoretical knowledge isn't required to get good results, but it is required to understand why you're getting good results. My "flip coding" approach turned out to be the best and simplest overall, and it's what I ended up implementing. It makes intuitive sense why it works--that's why I thought to try it in the first place--but I still don't fully understand why it outperforms gap coding and run-length encoding on things like the BWT+MTF text data.
I wasn't focused specifically on code size or performance, but I did have a bias towards "elegant" algorithms. I wanted to avoid ones that switch between multiple different techniques and require implementing all of them, require significant temporary storage, or require postprocessing of the data. Flip coding is nice because the algorithm can operate on run lengths and ranges, and it outputs a naturally-sorted set of coalesced ranges. I got lucky that Elias gamma coding was the smallest, because it's also the simplest to implement. The algorithm can be implemented in JavaScript relatively succinctly--the decompressor is ~400 bytes minified.
I'd also like to know if there are succinct data structures for integer sets that are very "range-y". For my application, I need to pass every single range over an API boundary anyway, so I can't make use of any in-place data structures, but it seems like a good area of exploration. Existing data structures for this purpose make an effort to avoid overhead, but they're not succinct. The new Incremental Font Transfer standard uses a tree of bit sets, and HarfBuzz and Google Fonts' read-fonts crate use a paged bit set. ICU appears to use something similar, though I haven't looked much into it.
Things I haven't tried
I came across a few techniques that may or may not be worth trying:
-
Partitioned Elias-Fano indexes: This paper notes that Elias-Fano encoding doesn't exploit any underlying order in the sequence it encodes (such as long runs of sucessive present or absent integers), and proposes splitting up the data into smaller chunks and using different encodings for sparse and dense chunks. The code used in that paper looks a bit gnarly, and is split into a couple dozen different files plus three different Git submodules. I was too intimidated to try implementing this.
-
Optimized partitioned binary interpolative coding: This claims to improve binary interpolative coding's performance for gap-heavy datasets, which is exactly what we're after. Unfortunately, I have to pay for the paper.
-
Adaptive run-length/Golomb-Rice encoding: This algorithm appears to be covered by a patent that will expire next year. It's focused on encoding streams of integers that follow a generalized Gaussian distribution (most of the integers are zeroes, a few are nonzero) and I'm not sure if it's really adaptable for encoding a set of integers.
Conclusion
I came into this expecting it to be a solved problem, but if it is, then I certainly haven't found the solution. I expected to end up using a technique I discovered from previous literature, but "flip coding" turned out to perform the best overall, even on datasets that are nothing like what I intend to compress. Surely it has to have been described somewhere before. Still, I'm glad that such a simple encoding gives such good results.
As I mentioned above, I turned it into a JavaScript library, and I'll be using it to encode Unicode data in future versions of Glypht. I haven't benchmarked it against other compression techniques, but my focus is mainly on code size anyway.
I've also published the Rust code I used for this analysis, if you want to reproduce these results or look at other compression methods.
I also tried compressing arrays of integers, and it was easily the worst method. ↩︎