Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ts_zip: Text Compression Using Large Language Models (bellard.org)
262 points by Deeg9rie9usi on Aug 16, 2023 | hide | past | favorite | 141 comments


The reason this works is because LLMs are trained to minimize the empirical risk over a large set of data (https://en.wikipedia.org/wiki/Empirical_risk_minimization), and it is where the log-likelihood-based measure comes from, oftentimes represented simply as the perplexity.

This is precisely because LLMs learn a representation of the nth-order Markov Chain, which by its nature is extraordinarily sparse. This allows for excellent compression, and I would assume (and have for a while, I suppose) the nearly-linear nature of LLMs allows for excellent interpolation between these very sparse states.

This lets us 'break' the hard-combinatorial problem into one that can be softly, albeit with the fuzzy error that comes with any estimator.

Since computers cannot tractably compress markov chains past a certain point, this allows LLMs some opportunity to outpace traditional methods for compression, at least in theory.

Also, this method technically cheats since you need the language model to decompress, which counts towards your dictionary size. In a truly unbounded case, I would be interested to see which method wins. I'm assuming it's the LLM.

Apologies for any factual inaccuracies or such, I am still learning many of these things. Many thanks and much love! <3 :))))


I would note you can externalize your dictionary in a lot of schemes. For tight protocol level encoding in my custom protocols I often build the dictionary as part of compilation and deploy it compiled into both sides. It saves a lot of overhead on small message protocols. In this case the dictionary is probably absurdly large, but I think we are at a point where new compression techniques are going to have weird and unexpected trade offs.


Yes, in this case I'm counting the mathematical definition for a decoder that's required when compressing large corpii of text.


I don't know whether this level of accuracy is imortant for you or not (if it isn't please excuse the correction) but the plural for "corpus" is "corpora"


Oh my gosh, thank you so much. <3 :)


If you want to learn something cool and related, checkout Autoencoders (or VAEs). They effectively compress information by forming representations of some data.


Indeed. They are extremely cool! <3 :) I think in a way, every neural network works on compression! Though AEs & VAEs are a bit more blatant in how they do it. It's just always turned around a little, here and there. I have a pet theory that no neural network can work without compression (even generative ones that have constantly increasing layer depths! :D :)))) )


The link between compression and intelligence is a popular theory. Indeed the linked work here came out of the Hutter Prize:

> The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal-seeking agent in an unknown but computable environment is to guess at each step that the environment is probably controlled by one of the shortest programs consistent with all interaction so far.

https://en.wikipedia.org/wiki/Hutter_Prize


NNCP (Bellard's prelude to ts_zip, using similar techniques) is not qualified for Hutter Prize, btw, because hardware and speed limitations specified by Hutter Prize.

"Must run in ≲50 hours using a single CPU core and <10GB RAM and <100GB HDD on our test machine." Which is an Intel Core i7-620M


I agree that it is most certainly necessary to have compression to have intelligence, after all, it is the bridge from empirical examples to a learned policy. Oftentimes the subtle switch comes when people say that compression IS intelligence.

Similar to how metabolism in many ways is required for life, yet metabolism itself isn't life.

One of the challenges is that the minimum MDL (Minimum Descriptor Length -- https://en.wikipedia.org/wiki/Minimum_description_length) is intractable to prove directly, we can only prove that we are a bit closer to it than we were before. This of course becomes even more difficult in the temporal regime, as the amount of information to prove that an iterative mapping (i.e. a decision-making algorithm or what have you, in this case) over each time slice in a temporal system is nigh-impossible. We can tell _something_ about it because of the attractors generated by such a system, but even then, I think it's something basically impossible to do in a closed-form manner.

That being said, I do believe compression is required for intelligence, and that deliciously drags in all of the info theory stuff, which is very fun indeed. Just seems like it gets really messy with that time component, but that's just my 2 cents at least. :'(

I didn't know that was related to the Hutter Prize. Very cool. I'd read a little bit about that prize before, and I'll take another look at it now. Probably won't start any work on anything because I really, really, really do not want to Collatz Conjecture myself again.


In a very loose sense, even plain linear regression is a form of lossy compression in the form of an orthogonal projection.


Oh, I appreciate this point, thanks for making it! I like it a lot. <3 :))))

I think, perhaps, if inputs > outputs and there is some dimensionality reduction (though I think orthogonality would be a trait of an ideal system [i.e. an emergent property approached in the limit], not one that is explicitly enforced each step).


I guess they've worked out to always get exactly the same text back? Hopefully this won't be a repeat of the photocopiers that used a similar compression technique and occasionally substituted letters in the copy.

https://www.bbc.com/news/technology-23588202


The author, Fabrice Bellard, is well-known for his work on QEMU (among other things), but I'd treat the rest of the posts as "idea sharing" at an early stage. For example, while BPG (Better Portable Graphics) didn't take off, the core idea (image formats based on video codecs) can be found in WebP (VP8, VP9) and AVIF (AV1) and HEIF (HEVC/H.265).

For this, I'd look at comparable work in the audio codec space, including Lyra (https://arxiv.org/abs/2102.09660) and SoundStream (https://arxiv.org/abs/2107.03312).

For text compression, I think it would be novel if you could get a lossy but semantically equivalent decompression that was resilient to the exact hardware used for inference. I don't think that's what happened here, though, given the requirement for "exact same GPU and model".


Fabrice is also well known for creating Ffmpeg and discovering an algorithm for computing digits of PI https://en.m.wikipedia.org/wiki/Fabrice_Bellard

He has also held the ‘world record’ for data compression since 2019 with nncp http://mattmahoney.net/dc/text.html#1085


He is simply one of the most important programmers alive today. He should win a Turing award in his lifetime.


I am afraid not. Turing awards are all theorists game, not engineers. He could get in the IEEE hall of fame though, I recommends.


>Turing awards are all theorists game, not engineers

Robert Metcalfe just won the award for Ethernet. Go scan through the list of Turing winners again. It is not just awarded to pure theorists. There’s a huge slant to academics who also found (or caused) massive commercial success, especially in recent years.

Bellard could get hit by a bus tomorrow and have

>the most important video tool ever

>the most important emulation tool ever

>the best text compression tool ever

…on his resume. And with those tools he’s not just wrapping algorithms cooked up by others, but also innovating on fundamental theory. VirtualBox and the Android emulator? That’s QEMU. Every video platform in existence? Thin wrappers to FFmpeg. Seems deserving to me, and certainly seems as or more impactful that many names on that winners list.


Don’t forget about QuickJS!


"Semantically equivalent decompression" is a very interesting idea, but I take issue with the idea that different words can share semantic equivalence. But I've also been reading a lot of poetry today, so subtle word choice feels important, even though I can recognize it may not be that important...a LLM could easily produce "better" or "worse" text than a human (while being lossy in either case), but then I have questions about the reason for using words in the first place.


Wow. Lossy text compression is a... horrifying concept.


But why?

A summary is a form of lossy text compression, and it's extremely useful.

That's not what this is but if it can actually produce "semantically equivalent" decompressed text it's fascinating.


Yes, although, it brings some Philip K Dick-ian possibilities to computers, 'ghosts' slipping into books and that sort of thing. Imagine, sending a highly compressed copy of Wikipedia to a mars colony. The signal got distorted during transit and upon decompressing it some weird entity slipped in.



Ww. Lssy txt cmprsn is hrffyng cncpt.

Or is it actly ok.


Lossy text comparison is a horrifying concept?


Jeez. Text squeezing in the manner of loss is a scary idea.


Y ur account b lanc is $36 2.17


lossy text horror


JBIG2 is what the NSO group exploited as part of their zero-click iMessage exploit. Google zero has a good write up if anyone hasn’t heard about it yet—it’s super cool:

https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-i...


>Practical circuits

>JBIG2 doesn't have scripting capabilities, but when combined with a vulnerability, it does have the ability to emulate circuits of arbitrary logic gates operating on arbitrary memory. So why not just use that to build your own computer architecture and script that!? That's exactly what this exploit does. Using over 70,000 segment commands defining logical bit operations, they define a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations. It's not as fast as Javascript, but it's fundamentally computationally equivalent.

>The bootstrapping operations for the sandbox escape exploit are written to run on this logic circuit and the whole thing runs in this weird, emulated environment created out of a single decompression pass through a JBIG2 stream. It's pretty incredible, and at the same time, pretty terrifying.

Wow... that's really out of the box. They built their own VM from a JBIG compression stream vuln and wrote an exploit to run on it... Not bad.


This is so very much worth a read. It is almost literally mind blowing once you realize how deep the exploit goes.


If you can predict text very well, it can still be a good lossless compressor - you just have to encode the errors you make. Fewer errors, less data to store


More specifically, you can use entropy coding to spend fewer bits the more likely you rated the token and more bits the less likely you rated the token. So "errors" aren't a black and white thing. And since LLMs output log probabilities, they're basically perfect for this.


Even if you haven't, passing along a checksum of the correct original would still allow it to be used safely. Decompression would either be correct or fail to decompress (outside of considerations of intentional hash collision attacks).


Yeah, using a probabilistic predictor to do lossless compression has been known for a few decades.



From the link: "The same exact GPU model and program versions must be used for compression and decompression". So it seems to me that you need a standardized and deterministic LLM execution model before this becomes useful.


See also: https://bellard.org/nncp/

NNCP: Lossless Data Compression with Neural Networks

https://news.ycombinator.com/item?id=27244004 (397 points | May 22, 2021 | 160 comments)


Here is a very basic attempt at implementation of the method idea in Python, successfully compressing and de-compressing a file without any neural weights available:

https://github.com/Magnushhoie/weightless_NN_decompression


Impressive! But an order of a magnitude slower then non-LLM compression techniques.

I try to compare with the results from the top 9 of this enwik8 compression test by Matt Mahoney: https://www.mattmahoney.net/dc/text.html

The durilca compressor by Dmitry Shkarin (what happened to him?) has been the fastest compressor in the top since it's debut in 2006.

The original size, enwik8, is 10^8 bytes. The compressed size is the compressend enwik8 plus the size of a zip archive containing the decompressor.

The bpb value of rwkv_430M on enwik8 is 0.948 bpb, the 7B models will be lower, perhaps around 0.800. So if my calculations are correct, the LLM's can perform 50% better then the best performing conventional compressors excluding the zipped LLM decompressor code (I am unsure about the size).

The bpb ratio for each existing program can be calculated as: bpb ratio=(Original size (enwik9)Total size (enwik9+prog) )×8

nncp v3.1: Given compressed size is not provided, we cannot calculate bpb for nncp v3.1 using enwik8.

cmix v19: bpb=(14,837,987+223,485108)×8bpb=(10814,837,987+223,485 )×8 bpb≈1.205

tensorflow-compress v4: bpb=(15,905,037+55,283108)×8bpb=(10815,905,037+55,283 )×8 bpb≈1.272

cmix-hp 10 Jun 2021: Given compressed size is not provided, we cannot calculate bpb for cmix-hp using enwik8.

fast-cmix: Given compressed size is not provided, we cannot calculate bpb for fast-cmix using enwik8.

starlit 31 May 2021: bpb=(15,215,107108)×8bpb=(10815,215,107 )×8 bpb≈1.217

phda9 1.8: bpb=(15,010,414+42,944108)×8bpb=(10815,010,414+42,944 )×8 bpb≈1.205

paq8px_v206fix1: bpb=(15,849,084+402,949108)×8bpb=(10815,849,084+402,949 )×8 bpb≈1.265

durilca'kingsize: bpb=(16,209,167+407,477108)×8bpb=(10816,209,167+407,477 )×8 bpb≈1.317


> The bpb value of rwkv_430M on enwik8 is 0.948 bpb

This is an unfair test, because the rwkv_430M model was almost certainly trained on wikipedia. Testing using training data is a big no-no in the ML world - you get unrealistically good results.

To test this properly, it needs to be run on a bit of newly written text which is nowhere on the internet. Obviously writing 100 MB of human written text which is not already on the internet is rather a big task...


It's fair because the model itself is included in the compressed size. You're not trying to extrapolate predictions. It makes perfect sense for compression "at rest" applications to intentionally overfit on the data you're trying to compress if that gives a smaller total size(model ++ compressed text).

If you were going to standardize on a foundational model that gets delivered to billions of devices to be used in network transfers (where you just transfer the compressed text and assume the other side already has the model), then it makes sense to optimize for generalizability. In that case it would make sense to exclude the model itself from the compressed size. Basically like how brotli gets to "cheat" by including a 120KB pre-defined dictionary.

Edit: Actually that can't be right. The numbers on https://www.mattmahoney.net/dc/text.html use that methodology, but the numbers in OP here can't be, since a zip compressed version of rwkv_430M is ~650 MB, so obviously it's not being included or Alice in Wonderland would have an abysmal ratio.


Newly declassified material maybe? Or just newly digitized?


Was it run on any text that was not feasibly training data for the LLMs? It wouldn't be a fair comparison otherwise


Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

The goal of compression, then, is to make common strings shorter while making uncommon strings longer. You can think of, say, UTF-8 as a simple compression algorithm: it would take 20 bits per character to encode all Unicode code points with a fixed-width encoding, but UTF-8 takes advantage of the fact that the most commonly used characters have a lot of leading zeroes (at least in English). So the characters of the English alphabet require only 8 bits to encode, but uncommon characters require up to 32 bits.

Thus, I would expect an LLM-based compression algorithm to do well on strings that were common in its training data, and make strings that were uncommon or absent slightly longer. If it did not do that, it would not be a lossless compression algorithm.


I don't disagree with anything you said.

I'm saying more that if the compression algorithm is benchmarking against "Alice in Wonderland" and has consumed the entirety of "Alice in Wonderland" in training the LLM (along with popular paragraphs and sentences quoted elsewhere), then it might do particularly well at reciting lines from that book and thus be able to compress it extremely well. I'd be more interested in seeing the compression algorithm's performance on new or unreleased works that would have no way of being training data.

As an extreme hypothetical, I could make a compression algorithm that is a table mapping an ID to an entire book and fill it with all the popular works. "Alice in Wonderland" would be replaced with a single short identifier string and achieve a ~0.001% compression ratio. An unseen work would be replaced with an <unknown> ID followed by the entire work and be slightly bigger. Then, I benchmark only the popular works and show insanely impressive results!

I have no doubt the LLM compressor would do really well on unseen works based on what you said above, but it's not a fair look at its performance to run it on works it may have been explicitly trained on.


Well, unless the weights are part of the compressed file.


Gotcha, sorry I completely misunderstood what you were asking! That’s a really insightful question that didn’t cross my mind at all.


> there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

This is due to the pigeonhole principle


>Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

I think I'm missing a point here. Is it a provable fact that all lossless compression algorithms are neutral at best? Or just something that occurs in the real world?

I can't think of a good example, so I'll give a trivial example to explain my thinking.

It seems to me that you could have an algorithm which did not change any text, except the sentence "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo", which it encodes as something much shorter like "<Buffalo*8>". The algorithm would be (very marginally) better than neutral. If this is true, then surely other improvements would be possible to make a lossless algorithm with real reduction.

(Edited a typo)


The impossibility of a general lossless compression algorithm is a consequence of the pigeon-hole principle.

https://en.m.wikipedia.org/wiki/Pigeonhole_principle


Then how do you encode an input of "<Buffalo*8>"? If your answer is "I'm using characters outside the alphabet" then, there's your answer, you're wasting bits on them.


Exactly. This is id always a pitfall when benchmarking LLM based techniques. The enwiki8 dataset they use, for example, is for sure in the training data.

To know how the method performs on novel data, the authors have to come up with entirely new datasets, since anything already existing must be assumed probably contaminated.


But doesn't the size in benchmarks include the size of the binary decoder ? So the embedded trained data is accounted for (preventing a plain copy of wikipedia to be included in the decoder)


I don't think the compressed size statistics on this webpage include the size of the LLM needed to decode. Some of these inputs are only a few 100 kB -- LLMs absolutely dwarf that.


A similar project by Bellard is https://bellard.org/nncp/

It's a transformer that trains itself on the user's input and nothing else. The problem is that for the first N megabytes of input the compression ratio is going to be really bad because the model is only just starting to figure things out. Once you get a couple GB of training data things get much better. That's why in practice grabbing a pre-trained model is more practical.


Idea: lossy text compression, where the LLM replaces unlikely words with likelier synonyms.


That's a very good approach also when you want to preserve anonymity. Writing idiosyncrasies can be used to distictely fingerprint us (with enough effort), llm's could in theory enhance anonymity as long as we can run these things locally on cheap hardware.


Stylometry is one thing, but your thoughts/ideas will also give you away. For true anonymity, you'd have to let the thing do your thinking for you too!



That very last chapter got me curious, so I looked at the footnote:

> A brief note for the curious. "Lena" or "Lenna" is a digitized Buck centerfold. Lena Soderberg (nee Sjooblom) was be popped keep in her folk Sweden, well married and three boys and a an and the air grog lock. In 1988, she was seen by a Swedish computer akin tome, and she was fairly charmed by what had done to her art. That was the top she knew of the way of that oil in the computer job. The item in the January 1992 end of Optical Engineering (v. 31 no. 1) data how Buck has lastly caught at to the life that their claim on Lenna Sjooblom's slide is man bigly defied. It arms as if you wish get to nab grant from Stud to blaze it in the next.

Now, English is not my first language, but I can hardly grasp what this footnote is about. Is it due to my language barrier, or did the authors just applied their lossy compression method to this footnote?


This is actually a very interesting idea, I hope someone would implement it and how. Would be fun to read the same text with stronger and stronger lossy compression.


up-goer five: explained using only the ten hundred words people use the most often


Oh, this would be super neat for (our) language learning chatbot.. restricting the vocab the model uses.. if someone knows how this can be done. :)


I guess you try to just zero the values that corresponds to unwanted tokens before the softmax.


Carried to us by the Podcaster of Phd comics!


What? Isn't it Randall Munroe (XKCD)?


Well, it was before that nnet compression.


Oops- I left out the link! https://xkcd.com/1133/

"you will not go to space today"


Similar to human memory for text.


There are plenty of usecases for very slow and very good text compression.

Imagine you run a big email server for millions of users.

Any email that hasn't been accessed for a month is probably a good candidate for compressing with this method.

In the unlikely event the user wishes to read an old email again, I'm sure they won't mind waiting 0.1 seconds for 10 kilobytes of email (ie. a screenfull) to decompress.


As soon as someone uses a POP or IMAP client to fetch all of their mail your server is going to catch on fire. Even if you had POP/IMAP disabled, the computation costs to decode data are enormous with this method, which is precisely why people don't use the most maximally compressing codecs for anything other than true archival applications (e.g. database backups which, unlike old email, really are expected to be accessed almost never).


Sounds like a good reason to just ship the compressed data to the customer and say "decompress it yourself - this bellard guy makes a decompressor!"


That sounds like recipe for really poor relations with the sizeable fraction of customers who are more concerned with getting their data back than delighted to be introduced to a cutting-edge text compression method...


A full IMAP sync of a 20 gigabyte inbox from gmail takes ~12 hours.

Thats 400 kilobytes per second. And much of that is image attachments and stuff - perhaps only 10 kilobytes/second of text.

Maybe Google already does this?


> Any email that hasn't been accessed for a month is probably a good candidate for compressing with this method.

You don't think "the same exact GPU model and program versions must be used for compression and decompression" rules it out as more than an interesting experiment, especially given that storage is cheap and continues to get cheaper?


I think they mean the exact same model data (for the GPU). I don't think it requires the exact same model of GPU.


I’m also reading it as "the same exact GPU [LLM] and program versions must be used for compression and decompression".


Old email needs to be indexed for search, and those indexes might be as large as the original text.


Right now it's more of an idea to keep on the back of your head than anything else.

LLMs are still being created and refined, token dictionaries are also changing. For this to work, you'd need something future proof, to avoid having and ultra efficiently compressed file that can never be opened because part of the decompressor is an extremely big and outdated data file that may or may not disappear from the most acessible sites for being heavy and obsolete.

That being said, if we ever make an "absolute" language model with a "definitive" dictionary or make a cheap and efficient way of grounding future models to make them "backwards compatible", this coould be used to preserve data for longevity. HDs and SSDs don't have the same durability of books, floppy discs can demagnetize, etc.

I think microsoft was working on a "true long term" storage medium using a crystal engraving technique to ensure the data could live for thousands of years and survive the elements. For that extreme case it might be useful.


This is hilarious.

In compression contest circles, there's long been this concept of "AI compression" (long before the current LLM wave), considered the upper limit of compression algorithms. This is based on the idea of the Dictionary... IIRC, some compression systems include the Dictionary as part of the decompressor, to save space in the payload, but then that Dictionary is fixed and may be suboptimal for the data. Others include it in the payload, so that it's optimal for the data, which most do.

But you could achieve incredible compression of, say, The Complete Works Of Shakespeare if you knew your peer knew about The Complete Works Of Shakespeare (had a rich enough dictionary) and just send them that designator of the payload! Or, you could send them a particular high-resolution picture of a wheat field with cypresses under a blue sky with an appropriately specific "description".

What Fabrice did is gone ahead and put this theoretical idea in practice. And that's hilarious.


Notably, I think this would fail on the Wikipedia compression contest because that counts the size of the decompression program (model in this case), so that it measures Kolmogorov complexity. And RWKV is surely too big. A neat thing to think about though, and maybe a winning idea if a more space-efficient model can be found.


Aren't the LLM supposed to have "seen" those text corpus before ? I would expect that a very small prompt could generate the whole requested text


"The same exact GPU model ... must be used for compression and decompression."

This seems like a massive issue for actual use. Are there really not some workarounds to get deterministic output?


Unfortunately GPUs (and even CPUs' SIMD) floating point math is riddled with cross-platform determinism issues; hardware manufacturers intentionally trade that in order to get faster math operations in general, because although behavior of floating point is defined in IEEE 754, you get rounding errors for each operation.

Compiler optimizations (and remember, GPU drivers each use their own compiler behind the scenes to translate to their actual hardware architecture) can alter rounding errors of each operation, and parallel execution - which differs from hardware to hardware - also affects it.

Some APIs (Cuda?) let you disable all optimizations and there are ways to get cross-platform determinism, but in general it's much much slower if you want bit-for-bit equality across different hardware.

SPIR-V/Vulkan for example[0] only define an error range based in ULP for some operations - not bit-for-bit equality.

[0] https://registry.khronos.org/vulkan/specs/1.2-extensions/htm...


Reproducible results across GPUs are theoretically possible, but it would take some effort to implement and be slower. At least the primitive operations (addition, multiplication, etc.) are there: https://docs.nvidia.com/cuda/floating-point/index.html


One way to get deterministic output is to use integer/fixed point math. Quantised models already do that for matrix multiplication, but things like softmax may still be implemented using some floating point math. It's possible to replace that, just takes a bit of extra work and is probably slower than using the GPU's native float ops.


How does this work? I know that LLMs predict the log-likelihood of the next token in a sequence.. so I guess you initialize the LLM from scratch, say that each token of the input text is the Nth most likely possibility, and record what N is for that token? And then you do some combination of run-length encoding and Huffman trees or something to record the Ns?

Since this is not open-source, but seems simple to do, perhaps I'll make my own.


A slightly more elegant method would be to throw an arithmetic coder on the output vector (normalized and represented as an interval) and record the interval represented by the real token to be compressed. This would allow for better compression ratios than your approach, I think (due to arbitrary probability support).

My potential issue with NNs for compression is that sometimes they predict near zero for some probabilities that actually occur in real data - in which case the number of bytes needed to encode would blow up. But that can be somewhat guarded against (perhaps by limiting the lower bound of output probabilities).


Zstandard (zstd), with a properly trained dictionary, could do well against larger (than alice29) texts. I doubt you'd see ~0.4 bpb, but 1-1.5 seems achievable. And it would be ~100% accurate with no GPU or other overhead necessary.

A quick experiment using a small selection of truncated texts yielded 1.8 bpb against alice29. I imagine other texts could do better.


> The same exact GPU model and program versions must be used for compression and decompression.

This makes the whole approach really difficult to be useful in practice. Basically whenever you have a method that relies on 100% reproducible float numbers.

What are ways around that? Why do the quantized variants actually have the same problem?


Is it reasonable to imagine a future where most devices come with a, say, 500MB file used for decompressing?

Imagine the potential bandwidth savings. I bet this has applications as a modern "Dial-up accelerator" for people with slow connections and fast hardware.


It seems likely that to me we will eventually start using AI models for compression and decompression of text, images, 3d models, programs, video, and streaming. AI has the potential for much higher compression. It's actually quite feasible for most PCs and laptops to keep in memory multiple GBs models. And the typical PC game is dozens of GBs.


You run into a problem where the model needs to be exactly the same on both ends. If your model is a couple of MB then updating it on occasion is no problem, but if it's several GB then you DO have a problem.

Then if you want to decompress something that was archived 10 years ago you need to make sure you can find and install the old compression model.


And could you use that file to compress itself?


It's a bunch of weights, it's not really compressible. Obviously you can't decode the model without the model so it's useless as well.


I'm curious how cmix 19 compares to rwkv_169M (the LLM with the fastest compression speed and also low GPU memory use), since cmix 19 still had a better compression ratio at the moment. Anyone know?


Intriguing promise, but I can't find technical details or source, only binary exectuables. Anyone else have luck?

The "(and hopefully decompress)" would usually make me think it was like when people would "compress" sequences into emoji using ChatGPT...but its bellard


Is there significantly more to this than just tokenizing a document's words and then ZIP-ing the stream of tokens?


I wrote a dictionary compression type thing a while ago, should I post it somewhere?


I'd be curious to check it out!

I've been wondering this last year about the use case of compressing highly repetitive logs in a streaming fashion, and whether fine-tuning a model or combination of LLM / datastore might make a sort of adaptive online compression perform well (e.g. allowing central versioned coordination of compression steps/layers as they evolve)


https://github.com/RbtsEvrwhr-Riley/CellSol here ya go it's on my SO's github


Interesting application for the RWKV family of models.


When I first read about RWKV on zhihu and the author said Bellard was working with him on RWKV. I am so glad ts_zip is out and bring a new application to RWKV.


Bellard still does great work... But no longer releases anything opensource.

In turn, that makes it far less interesting to me.

Especially since the majority of what he releases is more of a cool tech demo than a product in it's own right. A closed source tech demo really isn't of much use to anyone who doesn't have the source code to extend it.


Why did he stop?


I can't speak for him...

But I stopped open sourcing some of my work when I got aggressive users demanding I support them for free. I realised I was pouring hours into triaging bugs, explaining why PR's weren't good enough, fixing other people's corner cases, and settling disputes, without really getting much back. Sure, my projects were getting a following and there were lots of happy users, but the actual benefit to me of that happening was negative.

One middle ground I used for a while was releasing my projects as a tar file only, not a git repo. Someone else can then make a git repo and maintain the project. Looks like Bellard did the same for some projects.


I recently came across Curio (https://github.com/dabeaz/curio) & really appreciated their FAQ section on contributing. I plan to use it in any future open source projects:

Q: Can I contribute?

A: Curio is not a community-based project seeking developers or maintainers. However, having it work reliably is important. If you've found a bug or have an idea for making it better, please file an issue.


Code is a tool. Often you build a tool to complete a task.

Open Sourcing the tools lets others do their tasks. But you run the risk of becoming an unpaid toolmaker and unpaid tool maintainer, and getting none of your own tasks done.

In a way, github's biggest opensource committers are in a way the 'gig economy' workers of the tech world. Paid in just the occasional few bucks of donation for work which would otherwise be worth hundreds of thousands of dollars done with a salary.


A man goes to prison, and the first night while he's laying in bed, he hears someone yell out, "44!", followed by laughter from the other prisoners.

Puzzled, he laid back down, but then he heard someone else yell out, "72!", followed by even more laughter.

"What's going on?" he asked his cellmate.

"Well, we've all heard every joke so many times, we've given them each a number to make it easier."

"Oh," he says, "can I try?"

"Sure, go ahead."

So, he yells out "102!" and the place goes nuts. People are whooping and laughing in a hysteria. He looks at his cellmate rolling on the ground laughing.

"Wow, good joke, huh?"

"Yeah! We ain't never heard that one before!"


I've not heard that version before, I like it. The way I usually hear it end:

So, he yells out "102!" and... Crickets.

"What'd I do wrong?"

"Ehh, you must not have told it right."

(...Or, in this case, "you're probably not using the right model GPU")


Reminder that compression = intelligence:

https://mattmahoney.net/dc/rationale.html

...which leads to the theory of mind known as "compressionism":

https://ceur-ws.org/Vol-1419/paper0045.pdf


No. Compression == information, not intelligence. This is a common mistake that people make, and is a popular expression making the rounds right now, but it is incorrect.

Intelligence _may arise_ from information. Information _necessarily arises_ from compression.

It is very similar, to me, to the concept of humors (https://en.wikipedia.org/wiki/Humorism). Empirically, it was observed that these things had some correlation, and they occurred together, but they are not directly causative. Similarly with the theory of spontaneous generation (https://en.wikipedia.org/wiki/Spontaneous_generation), which was another theory spawned (heh, as it were), from casual causal correlation (again, as it were).

This can be shown rather easily when you show that the time-dynamics of an exemplar system with high information diverge from the time dynamics of a system with high intelligence, i.e. there is a structural component to the information embedded in the system such that the temporal aspects of applied information are somehow necessary for intelligence.

I hope to release some work at least tangentially related to this within the next few years (though of course it is a bit more high-flying and depends on some other in-flight work). If we are to attempt to move towards AGI, I really think we need to stick with the mathematical basics. Neural networks, although they've been made out to be really complicated, are actually quite simple in terms of the concepts powering them, I believe. That's something I'm currently working on putting together and trying to distill to its essence. That will also be at least 1-2 years out, and likely before any temporally-related work, all going well. In my experience, this all is just a personal belief from a great deal of time and thought with the minutia of neural networks. That said, I could perhaps be biased with some notion of simplicity, since I've worked with them long enough for some concepts to feel comfortably second-nature.


I think what the GP meant is not that compressed data has intelligence, but rather that a compressor is necessarily "intelligent" in correlation to the degree that it compresses well on arbitrary novel datasets it wasn't tuned for. At least insofar as we define "intelligence" with measures like IQ, that relate to "number of examples of a pattern required to notice the pattern."

To compress data, you have to derive some interesting insight about that data that allows the data to be projected into a lower-dimensional, lower-entropy embedding. (In pop-neurological terms, you have to "make a neuron" that represents a concept, so that you can re-phrase your dataset into a star-schema expressed in its connections to that concept.)

The more different kinds of things a single compressor can do that for — without the compressor itself being made larger — the more "intelligent" the compressor is, no?


Yes, that is the point I'm addressing, I'm using Shannon's definition of information and making an argument about the _temporal_ aspects of it. If you're looking for a hint, it's in the Lyapunov exponents of the system, which do not come necessarily from compression alone, though compression I feel is a critical part of the process.

Hence, the set vs superset argument.


OTOH true intelligence can still poorly compress things.


Any ERM-based method will have this flaw (if it is a flaw -- I'd possibly consider it a tradeoff) -- it is the nature of compression itself. I would make the argument, I think, that I believe that this disrupts the 'information' layer of the hierarchy, not the 'intelligence', though that is just my personal 2 cents of course.


No, compressability == complexity.

Kolmogorov actually


To be a little blunt, this comment does not make that much sense.

Kolmogorov complexity is based directly off of Shannon's information theory. It's the definition of the shortest possible length of a program that can output a particular string.

It is an extension of information theory. If you just take either of the words 'Kolmogorov' or 'complexity' out of the phrase 'Kolmogorov complexity', it loses all meaning. We measure the programs for a given string against that string's Kolmogorov complexity by their entropy. Hence, why at the core, the very pith of the matter at hand is information, which is a measurable quantity. Additionally, K complexity is a lower bound, not a quantity.

This is a similar kind of swap to saying "No, those baked goods are not food, because goods == items bought and sold for currency", if this recontextualizes it a bit.


I would say compression and intelligence are closely related, not equal. For example, zstd is a pretty good compressor, but does any one think it should be called a pretty good “AI”?


Read the paper...

Zstd, while it might be better than gzip or bzip, is still a very poor compressor compared to an ideal compressor (which hasn't yet been discovered).

That is why zstd acts like a rather bad AI. Note that if you wanted to use zstd as an AI, you would patch out of the source code checksum checks, and you would then feed it a file to decompress (The cat sat on the mat), followed by a few bytes of random noise.

A great compressor would output: The cat sat on the mat. It was comfortable, so he then lay down to sleep.

A medium compressor would output: The cat sat on the mat. bat cat cat mat sat bat.

A terrible compressor would output: The cat sat on the mat. D7s"/r %we

See how each is using knowledge at different levels to generate a completion. Notice also how that few bytes generates different amounts of output depending on the compressors level of world understanding, and therefore compression ratio.


> A terrible compressor would output: The cat sat on the mat. Dsr %we3 9T23 }£{D:rg!@ !jv£dP$

LLMs are sort of unable to do this because they use a fixed tokenizer instead of raw bytes. That means they won't output binary garbage even early on + saves a lot of memory, but it may hurt learning things like capitalization, rhyming, etc we think are obvious.


Even if you trained an LLM with a simplified tokenizer that simply had 256 tokens for each of 256 possible ascii characters, you would see the same result.


Concretely gzip will output something like "theudcanvas. ;cm,zumhmcyoetter toauuo long a one aay,;wvbu.mvns. x the dtls and enso.;k.like bla.njv"

https://github.com/Futrell/ziplm


Well, Zstandard can be used as a very fast classification engine, which is a task traditionally done by AI : https://twitter.com/abhi9u/status/1683141215871705088 https://github.com/cyrilou242/ftcc


Wasn't there a paper recently about a fairly simple wrapper around a common conpression algorithm beating LLMs on classification tasks?


I thought that was about LLMs being trained on compressed data. But I might be thinking about a different paper.


Gzip + kNN for text classification:

https://aclanthology.org/2023.findings-acl.426.pdf


Not LLM, just BERT, also did not actually outperform it.

source: https://kenschutte.com/gzip-knn-paper/


Someone in sales with a product that uses zstd.


>> Reminder that compression = intelligence:

If I conclude from this that you mean that gzip is intelligent, will I be accused of bad faith?


Compression is equivalent to learning but it's not clear to me that that gets you all the way to intelligent action. In reinforcement learning terms, it can get you asymptotically to a perfect model of the environment, but it's not clear to me that it would tell you how to find the highest-value action in that environment.


There must be a relationship here between Shannon's estimation that english is 1 bit of entropy per character and is highly redundant and 'easy' to predict.

A highly advanced AI could compress the text and predict the next sequences easily.

This seems like a direct connection like electricity and magnetism.

And maybe that's why English needs to be about 1 bit because we're not very intelligent.


I would say compression has some congruency with intelligence.


My hope is that one day I'll be reincarnated and the new me will produce 1/1000 of the output of Fabrice Bellard. Whenever I see new posts from him I just think "How can he do so much?"


Because we believe in "I did this this big thing X, that computer scientists love, which is public" > "I worked on a team and contributed to big thing Y, which is profitable but meh, and is private"


Not at all. I'm just in awe about the sheer amount of incredible code that Bellard is able to produce.

As a direct refutation to what you're saying, I feel similarly about Jeff Dean at Google. His productivity is pretty legendary, and all of the stuff I know that he's done is private/proprietary (and very profitable) to Google.


The SI unit for productivity is measured in millibellards


Man fears AGI, AGI fears Fabrice Bellard.


Woman inherits the Earth.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: