Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Was it run on any text that was not feasibly training data for the LLMs? It wouldn't be a fair comparison otherwise


Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

The goal of compression, then, is to make common strings shorter while making uncommon strings longer. You can think of, say, UTF-8 as a simple compression algorithm: it would take 20 bits per character to encode all Unicode code points with a fixed-width encoding, but UTF-8 takes advantage of the fact that the most commonly used characters have a lot of leading zeroes (at least in English). So the characters of the English alphabet require only 8 bits to encode, but uncommon characters require up to 32 bits.

Thus, I would expect an LLM-based compression algorithm to do well on strings that were common in its training data, and make strings that were uncommon or absent slightly longer. If it did not do that, it would not be a lossless compression algorithm.


I don't disagree with anything you said.

I'm saying more that if the compression algorithm is benchmarking against "Alice in Wonderland" and has consumed the entirety of "Alice in Wonderland" in training the LLM (along with popular paragraphs and sentences quoted elsewhere), then it might do particularly well at reciting lines from that book and thus be able to compress it extremely well. I'd be more interested in seeing the compression algorithm's performance on new or unreleased works that would have no way of being training data.

As an extreme hypothetical, I could make a compression algorithm that is a table mapping an ID to an entire book and fill it with all the popular works. "Alice in Wonderland" would be replaced with a single short identifier string and achieve a ~0.001% compression ratio. An unseen work would be replaced with an <unknown> ID followed by the entire work and be slightly bigger. Then, I benchmark only the popular works and show insanely impressive results!

I have no doubt the LLM compressor would do really well on unseen works based on what you said above, but it's not a fair look at its performance to run it on works it may have been explicitly trained on.


Well, unless the weights are part of the compressed file.


Gotcha, sorry I completely misunderstood what you were asking! That’s a really insightful question that didn’t cross my mind at all.


> there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

This is due to the pigeonhole principle


>Every lossless “compression” algorithm makes some strings shorter and other strings longer, such that the average space savings against all possible strings is neutral at best. This is a fundamental consequence of the way information works — there’s no way to losslessly map (say) 8 bits of input into 7 bits of output.

I think I'm missing a point here. Is it a provable fact that all lossless compression algorithms are neutral at best? Or just something that occurs in the real world?

I can't think of a good example, so I'll give a trivial example to explain my thinking.

It seems to me that you could have an algorithm which did not change any text, except the sentence "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo", which it encodes as something much shorter like "<Buffalo*8>". The algorithm would be (very marginally) better than neutral. If this is true, then surely other improvements would be possible to make a lossless algorithm with real reduction.

(Edited a typo)


The impossibility of a general lossless compression algorithm is a consequence of the pigeon-hole principle.

https://en.m.wikipedia.org/wiki/Pigeonhole_principle


Then how do you encode an input of "<Buffalo*8>"? If your answer is "I'm using characters outside the alphabet" then, there's your answer, you're wasting bits on them.


Exactly. This is id always a pitfall when benchmarking LLM based techniques. The enwiki8 dataset they use, for example, is for sure in the training data.

To know how the method performs on novel data, the authors have to come up with entirely new datasets, since anything already existing must be assumed probably contaminated.


But doesn't the size in benchmarks include the size of the binary decoder ? So the embedded trained data is accounted for (preventing a plain copy of wikipedia to be included in the decoder)


I don't think the compressed size statistics on this webpage include the size of the LLM needed to decode. Some of these inputs are only a few 100 kB -- LLMs absolutely dwarf that.


A similar project by Bellard is https://bellard.org/nncp/

It's a transformer that trains itself on the user's input and nothing else. The problem is that for the first N megabytes of input the compression ratio is going to be really bad because the model is only just starting to figure things out. Once you get a couple GB of training data things get much better. That's why in practice grabbing a pre-trained model is more practical.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: