I mean how does it know that though? How would you know if the set of possible texts is garbage without running them? Honestly feels like your saying LLMs solved the halting problem as programs which seems to be dishonest granted you could probably guess with high efficiency
Not a clue. But apparently it does. Try a few nonsense texts yourself, see if it rejects them.
I'm saying that if you're spidering the whole web, then training an LLM on that corpus, asking an existing LLM "does this page make sense?" is a comparatively small additional load.
> guess with high efficiency
Yes, I think that's basically what's happening. Markov nonsense is cheap to produce, but easy to classify. A more subtle strategy might be more successful (for example someone down-thread mentions using LLM-generated text, and we know that's quite a hard thing to classify).