> LLMs can now detect garbage much more cheaply than humans can.
Off the top of my head, I don't think this is true for training data. I could be wrong, but it seems very fallible to let GPT-5 be the source of ground truth for GPT-6.
I dotn think an LLM even can detect garbage during a training run. While training the system is only tasked with predicting the next token in the training set, it isn't trying to reason about the validity of the training set itself.
Off the top of my head, I don't think this is true for training data. I could be wrong, but it seems very fallible to let GPT-5 be the source of ground truth for GPT-6.