Ts_zip: Text Compression Using Large Language Models

tysam_and · on Aug 16, 2023

The reason this works is because LLMs are trained to minimize the empirical risk over a large set of data (https://en.wikipedia.org/wiki/Empirical_risk_minimization), and it is where the log-likelihood-based measure comes from, oftentimes represented simply as the perplexity.

This is precisely because LLMs learn a representation of the nth-order Markov Chain, which by its nature is extraordinarily sparse. This allows for excellent compression, and I would assume (and have for a while, I suppose) the nearly-linear nature of LLMs allows for excellent interpolation between these very sparse states.

This lets us 'break' the hard-combinatorial problem into one that can be softly, albeit with the fuzzy error that comes with any estimator.

Since computers cannot tractably compress markov chains past a certain point, this allows LLMs some opportunity to outpace traditional methods for compression, at least in theory.

Also, this method technically cheats since you need the language model to decompress, which counts towards your dictionary size. In a truly unbounded case, I would be interested to see which method wins. I'm assuming it's the LLM.

Apologies for any factual inaccuracies or such, I am still learning many of these things. Many thanks and much love! <3 :))))

fnordpiglet · on Aug 17, 2023

I would note you can externalize your dictionary in a lot of schemes. For tight protocol level encoding in my custom protocols I often build the dictionary as part of compilation and deploy it compiled into both sides. It saves a lot of overhead on small message protocols. In this case the dictionary is probably absurdly large, but I think we are at a point where new compression techniques are going to have weird and unexpected trade offs.

tysam_and · on Aug 17, 2023

Yes, in this case I'm counting the mathematical definition for a decoder that's required when compressing large corpii of text.

4sak3n · on Aug 17, 2023

I don't know whether this level of accuracy is imortant for you or not (if it isn't please excuse the correction) but the plural for "corpus" is "corpora"

tysam_and · on Aug 17, 2023

Oh my gosh, thank you so much. <3 :)

blovescoffee · on Aug 16, 2023

If you want to learn something cool and related, checkout Autoencoders (or VAEs). They effectively compress information by forming representations of some data.

tysam_and · on Aug 16, 2023

Indeed. They are extremely cool! <3 :) I think in a way, every neural network works on compression! Though AEs & VAEs are a bit more blatant in how they do it. It's just always turned around a little, here and there. I have a pet theory that no neural network can work without compression (even generative ones that have constantly increasing layer depths! :D :)))) )

nl · on Aug 17, 2023

The link between compression and intelligence is a popular theory. Indeed the linked work here came out of the Hutter Prize:

> The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal-seeking agent in an unknown but computable environment is to guess at each step that the environment is probably controlled by one of the shortest programs consistent with all interaction so far.

https://en.wikipedia.org/wiki/Hutter_Prize

angch · on Aug 17, 2023

NNCP (Bellard's prelude to ts_zip, using similar techniques) is not qualified for Hutter Prize, btw, because hardware and speed limitations specified by Hutter Prize.

"Must run in ≲50 hours using a single CPU core and <10GB RAM and <100GB HDD on our test machine." Which is an Intel Core i7-620M

tysam_and · on Aug 17, 2023

I agree that it is most certainly necessary to have compression to have intelligence, after all, it is the bridge from empirical examples to a learned policy. Oftentimes the subtle switch comes when people say that compression IS intelligence.

Similar to how metabolism in many ways is required for life, yet metabolism itself isn't life.

One of the challenges is that the minimum MDL (Minimum Descriptor Length -- https://en.wikipedia.org/wiki/Minimum_description_length) is intractable to prove directly, we can only prove that we are a bit closer to it than we were before. This of course becomes even more difficult in the temporal regime, as the amount of information to prove that an iterative mapping (i.e. a decision-making algorithm or what have you, in this case) over each time slice in a temporal system is nigh-impossible. We can tell _something_ about it because of the attractors generated by such a system, but even then, I think it's something basically impossible to do in a closed-form manner.

That being said, I do believe compression is required for intelligence, and that deliciously drags in all of the info theory stuff, which is very fun indeed. Just seems like it gets really messy with that time component, but that's just my 2 cents at least. :'(

I didn't know that was related to the Hutter Prize. Very cool. I'd read a little bit about that prize before, and I'll take another look at it now. Probably won't start any work on anything because I really, really, really do not want to Collatz Conjecture myself again.

nerdponx · on Aug 16, 2023

In a very loose sense, even plain linear regression is a form of lossy compression in the form of an orthogonal projection.

tysam_and · on Aug 17, 2023

Oh, I appreciate this point, thanks for making it! I like it a lot. <3 :))))

I think, perhaps, if inputs > outputs and there is some dimensionality reduction (though I think orthogonality would be a trait of an ideal system [i.e. an emergent property approached in the limit], not one that is explicitly enforced each step).

jandrese · on Aug 16, 2023

I guess they've worked out to always get exactly the same text back? Hopefully this won't be a repeat of the photocopiers that used a similar compression technique and occasionally substituted letters in the copy.

https://www.bbc.com/news/technology-23588202

cbhl · on Aug 16, 2023

The author, Fabrice Bellard, is well-known for his work on QEMU (among other things), but I'd treat the rest of the posts as "idea sharing" at an early stage. For example, while BPG (Better Portable Graphics) didn't take off, the core idea (image formats based on video codecs) can be found in WebP (VP8, VP9) and AVIF (AV1) and HEIF (HEVC/H.265).

For this, I'd look at comparable work in the audio codec space, including Lyra (https://arxiv.org/abs/2102.09660) and SoundStream (https://arxiv.org/abs/2107.03312).

For text compression, I think it would be novel if you could get a lossy but semantically equivalent decompression that was resilient to the exact hardware used for inference. I don't think that's what happened here, though, given the requirement for "exact same GPU and model".

wood_spirit · on Aug 16, 2023

Fabrice is also well known for creating Ffmpeg and discovering an algorithm for computing digits of PI https://en.m.wikipedia.org/wiki/Fabrice_Bellard

He has also held the ‘world record’ for data compression since 2019 with nncp http://mattmahoney.net/dc/text.html#1085

foooorsyth · on Aug 17, 2023

He is simply one of the most important programmers alive today. He should win a Turing award in his lifetime.

stevefan1999 · on Aug 17, 2023

I am afraid not. Turing awards are all theorists game, not engineers. He could get in the IEEE hall of fame though, I recommends.

foooorsyth · on Aug 17, 2023

>Turing awards are all theorists game, not engineers

Robert Metcalfe just won the award for Ethernet. Go scan through the list of Turing winners again. It is not just awarded to pure theorists. There’s a huge slant to academics who also found (or caused) massive commercial success, especially in recent years.

Bellard could get hit by a bus tomorrow and have

>the most important video tool ever

>the most important emulation tool ever

>the best text compression tool ever

…on his resume. And with those tools he’s not just wrapping algorithms cooked up by others, but also innovating on fundamental theory. VirtualBox and the Android emulator? That’s QEMU. Every video platform in existence? Thin wrappers to FFmpeg. Seems deserving to me, and certainly seems as or more impactful that many names on that winners list.

rcme · on Aug 17, 2023

Don’t forget about QuickJS!

IggleSniggle · on Aug 17, 2023

"Semantically equivalent decompression" is a very interesting idea, but I take issue with the idea that different words can share semantic equivalence. But I've also been reading a lot of poetry today, so subtle word choice feels important, even though I can recognize it may not be that important...a LLM could easily produce "better" or "worse" text than a human (while being lossy in either case), but then I have questions about the reason for using words in the first place.

crazygringo · on Aug 17, 2023

Wow. Lossy text compression is a... horrifying concept.

bambax · on Aug 17, 2023

But why?

A summary is a form of lossy text compression, and it's extremely useful.

That's not what this is but if it can actually produce "semantically equivalent" decompressed text it's fascinating.

pluijzer · on Aug 17, 2023

Yes, although, it brings some Philip K Dick-ian possibilities to computers, 'ghosts' slipping into books and that sort of thing. Imagine, sending a highly compressed copy of Wikipedia to a mars colony. The signal got distorted during transit and upon decompressing it some weird entity slipped in.

fsiefken · on Aug 17, 2023

Check keyscript https://www.reddit.com/r/FastWriting/comments/w7a4k1/keyscri...

Solvency · on Aug 17, 2023

Ww. Lssy txt cmprsn is hrffyng cncpt.

Or is it actly ok.

midoridensha · on Aug 17, 2023

Lossy text comparison is a horrifying concept?

crazygringo · on Aug 17, 2023

Jeez. Text squeezing in the manner of loss is a scary idea.

supertrope · on Aug 17, 2023

Y ur account b lanc is $36 2.17

throwaway14356 · on Aug 17, 2023

lossy text horror

OGWhales · on Aug 16, 2023

JBIG2 is what the NSO group exploited as part of their zero-click iMessage exploit. Google zero has a good write up if anyone hasn’t heard about it yet—it’s super cool:

https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-i...

Uptrenda · on Aug 17, 2023

>Practical circuits

>JBIG2 doesn't have scripting capabilities, but when combined with a vulnerability, it does have the ability to emulate circuits of arbitrary logic gates operating on arbitrary memory. So why not just use that to build your own computer architecture and script that!? That's exactly what this exploit does. Using over 70,000 segment commands defining logical bit operations, they define a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations. It's not as fast as Javascript, but it's fundamentally computationally equivalent.

>The bootstrapping operations for the sandbox escape exploit are written to run on this logic circuit and the whole thing runs in this weird, emulated environment created out of a single decompression pass through a JBIG2 stream. It's pretty incredible, and at the same time, pretty terrifying.

Wow... that's really out of the box. They built their own VM from a JBIG compression stream vuln and wrote an exploit to run on it... Not bad.

nocoiner · on Aug 16, 2023

This is so very much worth a read. It is almost literally mind blowing once you realize how deep the exploit goes.

ks2048 · on Aug 16, 2023

If you can predict text very well, it can still be a good lossless compressor - you just have to encode the errors you make. Fewer errors, less data to store

mistercow · on Aug 17, 2023

More specifically, you can use entropy coding to spend fewer bits the more likely you rated the token and more bits the less likely you rated the token. So "errors" aren't a black and white thing. And since LLMs output log probabilities, they're basically perfect for this.

CobrastanJorji · on Aug 16, 2023

Even if you haven't, passing along a checksum of the correct original would still allow it to be used safely. Decompression would either be correct or fail to decompress (outside of considerations of intentional hash collision attacks).

wmf · on Aug 16, 2023

Yeah, using a probabilistic predictor to do lossless compression has been known for a few decades.

stevefan1999 · on Aug 16, 2023

https://en.m.wikipedia.org/wiki/Arithmetic_coding

frud · on Aug 17, 2023

From the link: "The same exact GPU model and program versions must be used for compression and decompression". So it seems to me that you need a standardized and deterministic LLM execution model before this becomes useful.

asicsp · on Aug 17, 2023

See also: https://bellard.org/nncp/

NNCP: Lossless Data Compression with Neural Networks

https://news.ycombinator.com/item?id=27244004 (397 points | May 22, 2021 | 160 comments)

N0YS · on Aug 18, 2023

Here is a very basic attempt at implementation of the method idea in Python, successfully compressing and de-compressing a file without any neural weights available:

https://github.com/Magnushhoie/weightless_NN_decompression

fsiefken · on Aug 16, 2023

Impressive! But an order of a magnitude slower then non-LLM compression techniques.

I try to compare with the results from the top 9 of this enwik8 compression test by Matt Mahoney: https://www.mattmahoney.net/dc/text.html

The durilca compressor by Dmitry Shkarin (what happened to him?) has been the fastest compressor in the top since it's debut in 2006.

The original size, enwik8, is 10^8 bytes. The compressed size is the compressend enwik8 plus the size of a zip archive containing the decompressor.

The bpb value of rwkv_430M on enwik8 is 0.948 bpb, the 7B models will be lower, perhaps around 0.800. So if my calculations are correct, the LLM's can perform 50% better then the best performing conventional compressors excluding the zipped LLM decompressor code (I am unsure about the size).

The bpb ratio for each existing program can be calculated as: bpb ratio=(Original size (enwik9)Total size (enwik9+prog) )×8

nncp v3.1: Given compressed size is not provided, we cannot calculate bpb for nncp v3.1 using enwik8.

cmix v19: bpb=(14,837,987+223,485108)×8bpb=(10814,837,987+223,485 )×8 bpb≈1.205

tensorflow-compress v4: bpb=(15,905,037+55,283108)×8bpb=(10815,905,037+55,283 )×8 bpb≈1.272

cmix-hp 10 Jun 2021: Given compressed size is not provided, we cannot calculate bpb for cmix-hp using enwik8.

fast-cmix: Given compressed size is not provided, we cannot calculate bpb for fast-cmix using enwik8.

starlit 31 May 2021: bpb=(15,215,107108)×8bpb=(10815,215,107 )×8 bpb≈1.217

phda9 1.8: bpb=(15,010,414+42,944108)×8bpb=(10815,010,414+42,944 )×8 bpb≈1.205

paq8px_v206fix1: bpb=(15,849,084+402,949108)×8bpb=(10815,849,084+402,949 )×8 bpb≈1.265

durilca'kingsize: bpb=(16,209,167+407,477108)×8bpb=(10816,209,167+407,477 )×8 bpb≈1.317

londons_explore · on Aug 16, 2023

> The bpb value of rwkv_430M on enwik8 is 0.948 bpb

This is an unfair test, because the rwkv_430M model was almost certainly trained on wikipedia. Testing using training data is a big no-no in the ML world - you get unrealistically good results.

To test this properly, it needs to be run on a bit of newly written text which is nowhere on the internet. Obviously writing 100 MB of human written text which is not already on the internet is rather a big task...

ndriscoll · on Aug 16, 2023

It's fair because the model itself is included in the compressed size. You're not trying to extrapolate predictions. It makes perfect sense for compression "at rest" applications to intentionally overfit on the data you're trying to compress if that gives a smaller total size(model ++ compressed text).

If you were going to standardize on a foundational model that gets delivered to billions of devices to be used in network transfers (where you just transfer the compressed text and assume the other side already has the model), then it makes sense to optimize for generalizability. In that case it would make sense to exclude the model itself from the compressed size. Basically like how brotli gets to "cheat" by including a 120KB pre-defined dictionary.

Edit: Actually that can't be right. The numbers on https://www.mattmahoney.net/dc/text.html use that methodology, but the numbers in OP here can't be, since a zip compressed version of rwkv_430M is ~650 MB, so obviously it's not being included or Alice in Wonderland would have an abysmal ratio.

etiam · on Aug 16, 2023

Newly declassified material maybe? Or just newly digitized?

freedmand · on Aug 16, 2023

Was it run on any text that was not feasibly training data for the LLMs? It wouldn't be a fair comparison otherwise

NobodyNada · on Aug 16, 2023

Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

The goal of compression, then, is to make common strings shorter while making uncommon strings longer. You can think of, say, UTF-8 as a simple compression algorithm: it would take 20 bits per character to encode all Unicode code points with a fixed-width encoding, but UTF-8 takes advantage of the fact that the most commonly used characters have a lot of leading zeroes (at least in English). So the characters of the English alphabet require only 8 bits to encode, but uncommon characters require up to 32 bits.

Thus, I would expect an LLM-based compression algorithm to do well on strings that were common in its training data, and make strings that were uncommon or absent slightly longer. If it did not do that, it would not be a lossless compression algorithm.

freedmand · on Aug 16, 2023

I don't disagree with anything you said.

I'm saying more that if the compression algorithm is benchmarking against "Alice in Wonderland" and has consumed the entirety of "Alice in Wonderland" in training the LLM (along with popular paragraphs and sentences quoted elsewhere), then it might do particularly well at reciting lines from that book and thus be able to compress it extremely well. I'd be more interested in seeing the compression algorithm's performance on new or unreleased works that would have no way of being training data.

As an extreme hypothetical, I could make a compression algorithm that is a table mapping an ID to an entire book and fill it with all the popular works. "Alice in Wonderland" would be replaced with a single short identifier string and achieve a ~0.001% compression ratio. An unseen work would be replaced with an <unknown> ID followed by the entire work and be slightly bigger. Then, I benchmark only the popular works and show insanely impressive results!

I have no doubt the LLM compressor would do really well on unseen works based on what you said above, but it's not a fair look at its performance to run it on works it may have been explicitly trained on.

ant6n · on Aug 16, 2023

Well, unless the weights are part of the compressed file.

NobodyNada · on Aug 17, 2023

Gotcha, sorry I completely misunderstood what you were asking! That’s a really insightful question that didn’t cross my mind at all.

boywitharupee · on Aug 17, 2023

> there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

This is due to the pigeonhole principle

red369 · on Aug 17, 2023

>Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

I think I'm missing a point here. Is it a provable fact that all lossless compression algorithms are neutral at best? Or just something that occurs in the real world?

I can't think of a good example, so I'll give a trivial example to explain my thinking.

It seems to me that you could have an algorithm which did not change any text, except the sentence "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo", which it encodes as something much shorter like "<Buffalo*8>". The algorithm would be (very marginally) better than neutral. If this is true, then surely other improvements would be possible to make a lossless algorithm with real reduction.

(Edited a typo)

tarvaina · on Aug 17, 2023

The impossibility of a general lossless compression algorithm is a consequence of the pigeon-hole principle.

https://en.m.wikipedia.org/wiki/Pigeonhole_principle

coryrc · on Aug 17, 2023

Then how do you encode an input of "<Buffalo*8>"? If your answer is "I'm using characters outside the alphabet" then, there's your answer, you're wasting bits on them.

thomasahle · on Aug 16, 2023

Exactly. This is id always a pitfall when benchmarking LLM based techniques. The enwiki8 dataset they use, for example, is for sure in the training data.

To know how the method performs on novel data, the authors have to come up with entirely new datasets, since anything already existing must be assumed probably contaminated.

makapuf · on Aug 16, 2023

But doesn't the size in benchmarks include the size of the binary decoder ? So the embedded trained data is accounted for (preventing a plain copy of wikipedia to be included in the decoder)

loeg · on Aug 16, 2023

I don't think the compressed size statistics on this webpage include the size of the LLM needed to decode. Some of these inputs are only a few 100 kB -- LLMs absolutely dwarf that.

hnfong · on Aug 20, 2023

A similar project by Bellard is https://bellard.org/nncp/

It's a transformer that trains itself on the user's input and nothing else. The problem is that for the first N megabytes of input the compression ratio is going to be really bad because the model is only just starting to figure things out. Once you get a couple GB of training data things get much better. That's why in practice grabbing a pre-trained model is more practical.

elil17 · on Aug 16, 2023

Idea: lossy text compression, where the LLM replaces unlikely words with likelier synonyms.

mhitza · on Aug 17, 2023

That's a very good approach also when you want to preserve anonymity. Writing idiosyncrasies can be used to distictely fingerprint us (with enough effort), llm's could in theory enhance anonymity as long as we can run these things locally on cheap hardware.

andai · on Aug 17, 2023

Stylometry is one thing, but your thoughts/ideas will also give you away. For true anonymity, you'd have to let the thing do your thinking for you too!

radford-neal · on Aug 17, 2023

From 1992: https://core.ac.uk/download/pdf/44289908.pdf

flandolfi · on Aug 17, 2023

That very last chapter got me curious, so I looked at the footnote:

> A brief note for the curious. "Lena" or "Lenna" is a digitized Buck centerfold. Lena Soderberg (nee Sjooblom) was be popped keep in her folk Sweden, well married and three boys and a an and the air grog lock. In 1988, she was seen by a Swedish computer akin tome, and she was fairly charmed by what had done to her art. That was the top she knew of the way of that oil in the computer job. The item in the January 1992 end of Optical Engineering (v. 31 no. 1) data how Buck has lastly caught at to the life that their claim on Lenna Sjooblom's slide is man bigly defied. It arms as if you wish get to nab grant from Stud to blaze it in the next.

Now, English is not my first language, but I can hardly grasp what this footnote is about. Is it due to my language barrier, or did the authors just applied their lossy compression method to this footnote?

GaggiX · on Aug 16, 2023

This is actually a very interesting idea, I hope someone would implement it and how. Would be fun to read the same text with stronger and stronger lossy compression.

dekhn · on Aug 16, 2023

up-goer five: explained using only the ten hundred words people use the most often

davidzweig · on Aug 16, 2023

Oh, this would be super neat for (our) language learning chatbot.. restricting the vocab the model uses.. if someone knows how this can be done. :)

GaggiX · on Aug 17, 2023

I guess you try to just zero the values that corresponds to unwanted tokens before the softmax.

ant6n · on Aug 16, 2023

Carried to us by the Podcaster of Phd comics!

kbenson · on Aug 17, 2023

What? Isn't it Randall Munroe (XKCD)?

ant6n · on Aug 17, 2023

Well, it was before that nnet compression.

dekhn · on Aug 17, 2023

Oops- I left out the link! https://xkcd.com/1133/

"you will not go to space today"

canjobear · on Aug 17, 2023

Similar to human memory for text.

londons_explore · on Aug 16, 2023

There are plenty of usecases for very slow and very good text compression.

Imagine you run a big email server for millions of users.

Any email that hasn't been accessed for a month is probably a good candidate for compressing with this method.

In the unlikely event the user wishes to read an old email again, I'm sure they won't mind waiting 0.1 seconds for 10 kilobytes of email (ie. a screenfull) to decompress.

eklitzke · on Aug 16, 2023

As soon as someone uses a POP or IMAP client to fetch all of their mail your server is going to catch on fire. Even if you had POP/IMAP disabled, the computation costs to decode data are enormous with this method, which is precisely why people don't use the most maximally compressing codecs for anything other than true archival applications (e.g. database backups which, unlike old email, really are expected to be accessed almost never).

londons_explore · on Aug 16, 2023

Sounds like a good reason to just ship the compressed data to the customer and say "decompress it yourself - this bellard guy makes a decompressor!"

etiam · on Aug 16, 2023

That sounds like recipe for really poor relations with the sizeable fraction of customers who are more concerned with getting their data back than delighted to be introduced to a cutting-edge text compression method...

londons_explore · on Aug 16, 2023

A full IMAP sync of a 20 gigabyte inbox from gmail takes ~12 hours.

Thats 400 kilobytes per second. And much of that is image attachments and stuff - perhaps only 10 kilobytes/second of text.

Maybe Google already does this?

CharlesW · on Aug 16, 2023

> Any email that hasn't been accessed for a month is probably a good candidate for compressing with this method.

You don't think "the same exact GPU model and program versions must be used for compression and decompression" rules it out as more than an interesting experiment, especially given that storage is cheap and continues to get cheaper?

londons_explore · on Aug 16, 2023

I think they mean the exact same model data (for the GPU). I don't think it requires the exact same model of GPU.

CharlesW · on Aug 16, 2023

I’m also reading it as "the same exact GPU [LLM] and program versions must be used for compression and decompression".

otabdeveloper4 · on Aug 16, 2023

Old email needs to be indexed for search, and those indexes might be as large as the original text.

ouraf · on Aug 17, 2023

Right now it's more of an idea to keep on the back of your head than anything else.

LLMs are still being created and refined, token dictionaries are also changing. For this to work, you'd need something future proof, to avoid having and ultra efficiently compressed file that can never be opened because part of the decompressor is an extremely big and outdated data file that may or may not disappear from the most acessible sites for being heavy and obsolete.

That being said, if we ever make an "absolute" language model with a "definitive" dictionary or make a cheap and efficient way of grounding future models to make them "backwards compatible", this coould be used to preserve data for longevity. HDs and SSDs don't have the same durability of books, floppy discs can demagnetize, etc.

I think microsoft was working on a "true long term" storage medium using a crystal engraving technique to ensure the data could live for thousands of years and survive the elements. For that extreme case it might be useful.

AceJohnny2 · on Aug 16, 2023

This is hilarious.

In compression contest circles, there's long been this concept of "AI compression" (long before the current LLM wave), considered the upper limit of compression algorithms. This is based on the idea of the Dictionary... IIRC, some compression systems include the Dictionary as part of the decompressor, to save space in the payload, but then that Dictionary is fixed and may be suboptimal for the data. Others include it in the payload, so that it's optimal for the data, which most do.

But you could achieve incredible compression of, say, The Complete Works Of Shakespeare if you knew your peer knew about The Complete Works Of Shakespeare (had a rich enough dictionary) and just send them that designator of the payload! Or, you could send them a particular high-resolution picture of a wheat field with cypresses under a blue sky with an appropriately specific "description".

What Fabrice did is gone ahead and put this theoretical idea in practice. And that's hilarious.

sterlind · on Aug 16, 2023

Notably, I think this would fail on the Wikipedia compression contest because that counts the size of the decompression program (model in this case), so that it measures Kolmogorov complexity. And RWKV is surely too big. A neat thing to think about though, and maybe a winning idea if a more space-efficient model can be found.

ttoinou · on Aug 16, 2023

Aren't the LLM supposed to have "seen" those text corpus before ? I would expect that a very small prompt could generate the whole requested text

MaximilianEmel · on Aug 16, 2023

"The same exact GPU model ... must be used for compression and decompression."

This seems like a massive issue for actual use. Are there really not some workarounds to get deterministic output?

emidoots · on Aug 16, 2023

Unfortunately GPUs (and even CPUs' SIMD) floating point math is riddled with cross-platform determinism issues; hardware manufacturers intentionally trade that in order to get faster math operations in general, because although behavior of floating point is defined in IEEE 754, you get rounding errors for each operation.

Compiler optimizations (and remember, GPU drivers each use their own compiler behind the scenes to translate to their actual hardware architecture) can alter rounding errors of each operation, and parallel execution - which differs from hardware to hardware - also affects it.

Some APIs (Cuda?) let you disable all optimizations and there are ways to get cross-platform determinism, but in general it's much much slower if you want bit-for-bit equality across different hardware.

SPIR-V/Vulkan for example[0] only define an error range based in ULP for some operations - not bit-for-bit equality.

[0] https://registry.khronos.org/vulkan/specs/1.2-extensions/htm...

johndough · on Aug 16, 2023

Reproducible results across GPUs are theoretically possible, but it would take some effort to implement and be slower. At least the primitive operations (addition, multiplication, etc.) are there: https://docs.nvidia.com/cuda/floating-point/index.html

Firadeoclus · on Aug 17, 2023

One way to get deterministic output is to use integer/fixed point math. Quantised models already do that for matrix multiplication, but things like softmax may still be implemented using some floating point math. It's possible to replace that, just takes a bit of extra work and is probably slower than using the GPU's native float ops.

sterlind · on Aug 16, 2023

How does this work? I know that LLMs predict the log-likelihood of the next token in a sequence.. so I guess you initialize the LLM from scratch, say that each token of the input text is the Nth most likely possibility, and record what N is for that token? And then you do some combination of run-length encoding and Huffman trees or something to record the Ns?

Since this is not open-source, but seems simple to do, perhaps I'll make my own.

Scene_Cast2 · on Aug 16, 2023

A slightly more elegant method would be to throw an arithmetic coder on the output vector (normalized and represented as an interval) and record the interval represented by the real token to be compressed. This would allow for better compression ratios than your approach, I think (due to arbitrary probability support).

My potential issue with NNs for compression is that sometimes they predict near zero for some probabilities that actually occur in real data - in which case the number of bytes needed to encode would blow up. But that can be somewhat guarded against (perhaps by limiting the lower bound of output probabilities).

dmillar · on Aug 17, 2023

Zstandard (zstd), with a properly trained dictionary, could do well against larger (than alice29) texts. I doubt you'd see ~0.4 bpb, but 1-1.5 seems achievable. And it would be ~100% accurate with no GPU or other overhead necessary.

A quick experiment using a small selection of truncated texts yielded 1.8 bpb against alice29. I imagine other texts could do better.

albertzeyer · on Aug 16, 2023

> The same exact GPU model and program versions must be used for compression and decompression.

This makes the whole approach really difficult to be useful in practice. Basically whenever you have a method that relies on 100% reproducible float numbers.

What are ways around that? Why do the quantized variants actually have the same problem?

TheMiddleMan · on Aug 16, 2023

Is it reasonable to imagine a future where most devices come with a, say, 500MB file used for decompressing?

Imagine the potential bandwidth savings. I bet this has applications as a modern "Dial-up accelerator" for people with slow connections and fast hardware.

ilaksh · on Aug 17, 2023

It seems likely that to me we will eventually start using AI models for compression and decompression of text, images, 3d models, programs, video, and streaming. AI has the potential for much higher compression. It's actually quite feasible for most PCs and laptops to keep in memory multiple GBs models. And the typical PC game is dozens of GBs.

Turskarama · on Aug 17, 2023

You run into a problem where the model needs to be exactly the same on both ends. If your model is a couple of MB then updating it on occasion is no problem, but if it's several GB then you DO have a problem.

Then if you want to decompress something that was archived 10 years ago you need to make sure you can find and install the old compression model.

landgenoot · on Aug 16, 2023

And could you use that file to compress itself?

Vecr · on Aug 18, 2023

It's a bunch of weights, it's not really compressible. Obviously you can't decode the model without the model so it's useless as well.

luke-stanley · on Aug 17, 2023

I'm curious how cmix 19 compares to rwkv_169M (the LLM with the fastest compression speed and also low GPU memory use), since cmix 19 still had a better compression ratio at the moment. Anyone know?

refulgentis · on Aug 16, 2023

Intriguing promise, but I can't find technical details or source, only binary exectuables. Anyone else have luck?

The "(and hopefully decompress)" would usually make me think it was like when people would "compress" sequences into emoji using ChatGPT...but its bellard

scott-smith_us · on Aug 17, 2023

Is there significantly more to this than just tokenizing a document's words and then ZIP-ing the stream of tokens?

spiritplumber · on Aug 17, 2023

I wrote a dictionary compression type thing a while ago, should I post it somewhere?

bcjordan · on Aug 17, 2023

I'd be curious to check it out!

I've been wondering this last year about the use case of compressing highly repetitive logs in a streaming fashion, and whether fine-tuning a model or combination of LLM / datastore might make a sort of adaptive online compression perform well (e.g. allowing central versioned coordination of compression steps/layers as they evolve)

spiritplumber · on Aug 18, 2023

https://github.com/RbtsEvrwhr-Riley/CellSol here ya go it's on my SO's github

leumassuehtam · on Aug 16, 2023

Interesting application for the RWKV family of models.

ivzhh · on Aug 20, 2023

When I first read about RWKV on zhihu and the author said Bellard was working with him on RWKV. I am so glad ts_zip is out and bring a new application to RWKV.

londons_explore · on Aug 16, 2023

Bellard still does great work... But no longer releases anything opensource.

In turn, that makes it far less interesting to me.

Especially since the majority of what he releases is more of a cool tech demo than a product in it's own right. A closed source tech demo really isn't of much use to anyone who doesn't have the source code to extend it.

DSingularity · on Aug 16, 2023

Why did he stop?

londons_explore · on Aug 16, 2023

I can't speak for him...

But I stopped open sourcing some of my work when I got aggressive users demanding I support them for free. I realised I was pouring hours into triaging bugs, explaining why PR's weren't good enough, fixing other people's corner cases, and settling disputes, without really getting much back. Sure, my projects were getting a following and there were lots of happy users, but the actual benefit to me of that happening was negative.

One middle ground I used for a while was releasing my projects as a tar file only, not a git repo. Someone else can then make a git repo and maintain the project. Looks like Bellard did the same for some projects.

dakotasmith · on Aug 16, 2023

I recently came across Curio (https://github.com/dabeaz/curio) & really appreciated their FAQ section on contributing. I plan to use it in any future open source projects:

Q: Can I contribute?

A: Curio is not a community-based project seeking developers or maintainers. However, having it work reliably is important. If you've found a bug or have an idea for making it better, please file an issue.

londons_explore · on Aug 16, 2023

Code is a tool. Often you build a tool to complete a task.

Open Sourcing the tools lets others do their tasks. But you run the risk of becoming an unpaid toolmaker and unpaid tool maintainer, and getting none of your own tasks done.

In a way, github's biggest opensource committers are in a way the 'gig economy' workers of the tech world. Paid in just the occasional few bucks of donation for work which would otherwise be worth hundreds of thousands of dollars done with a salary.

lwneal · on Aug 16, 2023

A man goes to prison, and the first night while he's laying in bed, he hears someone yell out, "44!", followed by laughter from the other prisoners.

Puzzled, he laid back down, but then he heard someone else yell out, "72!", followed by even more laughter.

"What's going on?" he asked his cellmate.

"Well, we've all heard every joke so many times, we've given them each a number to make it easier."

"Oh," he says, "can I try?"

"Sure, go ahead."

So, he yells out "102!" and the place goes nuts. People are whooping and laughing in a hysteria. He looks at his cellmate rolling on the ground laughing.

"Wow, good joke, huh?"

"Yeah! We ain't never heard that one before!"

redsaz · on Aug 17, 2023

I've not heard that version before, I like it. The way I usually hear it end:

So, he yells out "102!" and... Crickets.

"What'd I do wrong?"

"Ehh, you must not have told it right."

(...Or, in this case, "you're probably not using the right model GPU")

optimalsolver · on Aug 16, 2023

Reminder that compression = intelligence:

https://mattmahoney.net/dc/rationale.html

...which leads to the theory of mind known as "compressionism":

https://ceur-ws.org/Vol-1419/paper0045.pdf

tysam_and · on Aug 16, 2023

No. Compression == information, not intelligence. This is a common mistake that people make, and is a popular expression making the rounds right now, but it is incorrect.

Intelligence _may arise_ from information. Information _necessarily arises_ from compression.

It is very similar, to me, to the concept of humors (https://en.wikipedia.org/wiki/Humorism). Empirically, it was observed that these things had some correlation, and they occurred together, but they are not directly causative. Similarly with the theory of spontaneous generation (https://en.wikipedia.org/wiki/Spontaneous_generation), which was another theory spawned (heh, as it were), from casual causal correlation (again, as it were).

This can be shown rather easily when you show that the time-dynamics of an exemplar system with high information diverge from the time dynamics of a system with high intelligence, i.e. there is a structural component to the information embedded in the system such that the temporal aspects of applied information are somehow necessary for intelligence.

I hope to release some work at least tangentially related to this within the next few years (though of course it is a bit more high-flying and depends on some other in-flight work). If we are to attempt to move towards AGI, I really think we need to stick with the mathematical basics. Neural networks, although they've been made out to be really complicated, are actually quite simple in terms of the concepts powering them, I believe. That's something I'm currently working on putting together and trying to distill to its essence. That will also be at least 1-2 years out, and likely before any temporally-related work, all going well. In my experience, this all is just a personal belief from a great deal of time and thought with the minutia of neural networks. That said, I could perhaps be biased with some notion of simplicity, since I've worked with them long enough for some concepts to feel comfortably second-nature.

derefr · on Aug 16, 2023

I think what the GP meant is not that compressed data has intelligence, but rather that a compressor is necessarily "intelligent" in correlation to the degree that it compresses well on arbitrary novel datasets it wasn't tuned for. At least insofar as we define "intelligence" with measures like IQ, that relate to "number of examples of a pattern required to notice the pattern."

To compress data, you have to derive some interesting insight about that data that allows the data to be projected into a lower-dimensional, lower-entropy embedding. (In pop-neurological terms, you have to "make a neuron" that represents a concept, so that you can re-phrase your dataset into a star-schema expressed in its connections to that concept.)

The more different kinds of things a single compressor can do that for — without the compressor itself being made larger — the more "intelligent" the compressor is, no?

tysam_and · on Aug 16, 2023

Yes, that is the point I'm addressing, I'm using Shannon's definition of information and making an argument about the _temporal_ aspects of it. If you're looking for a hint, it's in the Lyapunov exponents of the system, which do not come necessarily from compression alone, though compression I feel is a critical part of the process.

Hence, the set vs superset argument.

bbstats · on Aug 16, 2023

OTOH true intelligence can still poorly compress things.

tysam_and · on Aug 16, 2023

Any ERM-based method will have this flaw (if it is a flaw -- I'd possibly consider it a tradeoff) -- it is the nature of compression itself. I would make the argument, I think, that I believe that this disrupts the 'information' layer of the hierarchy, not the 'intelligence', though that is just my personal 2 cents of course.

rurban · on Aug 18, 2023

No, compressability == complexity.

Kolmogorov actually

tysam_and · on Aug 19, 2023

To be a little blunt, this comment does not make that much sense.

Kolmogorov complexity is based directly off of Shannon's information theory. It's the definition of the shortest possible length of a program that can output a particular string.

It is an extension of information theory. If you just take either of the words 'Kolmogorov' or 'complexity' out of the phrase 'Kolmogorov complexity', it loses all meaning. We measure the programs for a given string against that string's Kolmogorov complexity by their entropy. Hence, why at the core, the very pith of the matter at hand is information, which is a measurable quantity. Additionally, K complexity is a lower bound, not a quantity.

This is a similar kind of swap to saying "No, those baked goods are not food, because goods == items bought and sold for currency", if this recontextualizes it a bit.

ks2048 · on Aug 16, 2023

I would say compression and intelligence are closely related, not equal. For example, zstd is a pretty good compressor, but does any one think it should be called a pretty good “AI”?

londons_explore · on Aug 16, 2023

Read the paper...

Zstd, while it might be better than gzip or bzip, is still a very poor compressor compared to an ideal compressor (which hasn't yet been discovered).

That is why zstd acts like a rather bad AI. Note that if you wanted to use zstd as an AI, you would patch out of the source code checksum checks, and you would then feed it a file to decompress (The cat sat on the mat), followed by a few bytes of random noise.

A great compressor would output: The cat sat on the mat. It was comfortable, so he then lay down to sleep.

A medium compressor would output: The cat sat on the mat. bat cat cat mat sat bat.

A terrible compressor would output: The cat sat on the mat. D7s"/r %we

See how each is using knowledge at different levels to generate a completion. Notice also how that few bytes generates different amounts of output depending on the compressors level of world understanding, and therefore compression ratio.

astrange · on Aug 16, 2023

> A terrible compressor would output: The cat sat on the mat. Dsr %we3 9T23 }£{D:rg!@ !jv£dP$

LLMs are sort of unable to do this because they use a fixed tokenizer instead of raw bytes. That means they won't output binary garbage even early on + saves a lot of memory, but it may hurt learning things like capitalization, rhyming, etc we think are obvious.

londons_explore · on Aug 16, 2023

Even if you trained an LLM with a simplified tokenizer that simply had 256 tokens for each of 256 possible ascii characters, you would see the same result.

canjobear · on Aug 17, 2023

Concretely gzip will output something like "theudcanvas. ;cm,zumhmcyoetter toauuo long a one aay,;wvbu.mvns. x the dtls and enso.;k.like bla.njv"

https://github.com/Futrell/ziplm

twitch_checksum · on Aug 18, 2023

Well, Zstandard can be used as a very fast classification engine, which is a task traditionally done by AI : https://twitter.com/abhi9u/status/1683141215871705088 https://github.com/cyrilou242/ftcc

dragonwriter · on Aug 16, 2023

Wasn't there a paper recently about a fairly simple wrapper around a common conpression algorithm beating LLMs on classification tasks?

jmholla · on Aug 16, 2023

I thought that was about LLMs being trained on compressed data. But I might be thinking about a different paper.

dragonwriter · on Aug 16, 2023

Gzip + kNN for text classification:

https://aclanthology.org/2023.findings-acl.426.pdf

junipertea · on Aug 16, 2023

Not LLM, just BERT, also did not actually outperform it.

source: https://kenschutte.com/gzip-knn-paper/

willis936 · on Aug 16, 2023

Someone in sales with a product that uses zstd.

YeGoblynQueenne · on Aug 16, 2023

>> Reminder that compression = intelligence:

If I conclude from this that you mean that gzip is intelligent, will I be accused of bad faith?

canjobear · on Aug 17, 2023

Compression is equivalent to learning but it's not clear to me that that gets you all the way to intelligent action. In reinforcement learning terms, it can get you asymptotically to a perfect model of the environment, but it's not clear to me that it would tell you how to find the highest-value action in that environment.

burtonator · on Aug 16, 2023

There must be a relationship here between Shannon's estimation that english is 1 bit of entropy per character and is highly redundant and 'easy' to predict.

A highly advanced AI could compress the text and predict the next sequences easily.

This seems like a direct connection like electricity and magnetism.

And maybe that's why English needs to be about 1 bit because we're not very intelligent.

bbstats · on Aug 16, 2023

I would say compression has some congruency with intelligence.

hn_throwaway_99 · on Aug 16, 2023

My hope is that one day I'll be reincarnated and the new me will produce 1/1000 of the output of Fabrice Bellard. Whenever I see new posts from him I just think "How can he do so much?"

quickthrower2 · on Aug 17, 2023

Because we believe in "I did this this big thing X, that computer scientists love, which is public" > "I worked on a team and contributed to big thing Y, which is profitable but meh, and is private"

hn_throwaway_99 · on Aug 18, 2023

Not at all. I'm just in awe about the sheer amount of incredible code that Bellard is able to produce.

As a direct refutation to what you're saying, I feel similarly about Jeff Dean at Google. His productivity is pretty legendary, and all of the stuff I know that he's done is private/proprietary (and very profitable) to Google.

dekhn · on Aug 16, 2023

The SI unit for productivity is measured in millibellards

ducktective · on Aug 16, 2023

Man fears AGI, AGI fears Fabrice Bellard.

optimalsolver · on Aug 16, 2023

Woman inherits the Earth.