Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article emphasizes the wrong thing, in my view. The interesting part is that compression -- without learning a model -- can be used for classification. This raises the question of what other information-theoretic measures can be used; cheaper, lossy ones.

To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review https://arxiv.org/abs/2304.09355\*



I remember seeing an example of using zip to classify languages. You take a set of documents of equal size where you know the languages, then individually concatenate and zip them with the unknown text. The smallest compressed output is likely to be the target language.

I can't find the original blog, but there's a note about it here - https://stackoverflow.com/questions/39142778/how-to-determin...


Ideally, you'd take all the documents in each language, and compress them in turn with the unclassified text, to see which compresses it better. But this won't work very well with gzip, since it compresses based on a 32KB sliding window. You might as well truncate the training data for each class to the last 32KB (more or less). So to get any performance at all out of a gzip-based classifier, you need to combine a ton of individually quite bad predictors with some sort of ensemble method. (The linked code demonstrates a way of aggregating them which does not work at all).


How much better would that get if you append all but one of the equal size documents? (or other combinations like 2 of the top results after using a single one)


Better, if the compressor can use all that extra context. Gzip, and most traditional general purpose compressors, can't.

It's hard to use distant context effectively. Even general purpose compression methods which theoretically can, often deliberately reset part of their context, since assuming a big file follows the same distribution throughout as in its beginning often hurts compression more than just starting over periodically.


Now that you mention it, I vaguely recall writing a language classifier based on character histograms as a youth. Good times.


I mean, they’re using knn underneath. You can apparently get 97% accuracy with normal knn, at n=4 if you compare pixel distance.

So another way to frame this might be that gzip costs a lot of accuracy but may lead to better performance.

https://newpblog.netlify.app/2018-01-24-knn-analysis-on-mnis...


An increasingly common refrain in machine learning is “intelligence is compression.” Folks who believe that might bristle at the distinction between learning and compression.


I doubt it because learned features already constitute lossy compression. The question is what kind of compression; lossy vs. lossless, learned vs. unlearned.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: