Synthetic data doesn't have to come from an LLM. And that paper only showed that if you train on a random sample from an LLM, the resulting second LLM is a worse model of the distribution that the first LLM was trained on. When people construct synthetic data with LLMs, they typically do not just sample at random, but carefully shape the generation process to match the target task better than the original training distribution.
Even when you do get a sync conflict, Syncthing will rename one of the copies and then you can have KeePassXC merge the two files back into one. So that's still pretty much hassle-free.
It's true that to train more information into the model you need more trainable parameters, but when people ask for small models, they usually mean models that run at acceptable speeds on their hardware. Techniques like mixture-of-experts allow increasing the number of trainable parameters without requiring more FLOPs, so they're large in one sense but small in another.
And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.
The US lost in the gambling case because their restrictions on foreign websites were stricter than those on domestic ones. The GATS doesn't prohibit countries from regulating trade, they only have to do so in a non-discriminatory manner. Spain isn't blocking foreign websites for copyright infringement that would be legal domestically, so they're in compliance with their obligations.
The "attention is all you need" paper did not invent attention mechanisms. It showed that existing models that were already using attention could have their non-attention parts removed and still worked. So those other parts were unnecessary and only attention was needed.
reply