Synthetic data will play I big role, yes. There's other challenges though, like how verbal descriptions of objects would affect their spatial behavior. Building a generalized simulator that combines those modalities is hard.
In this particular case with Factorio, I suspect generating the synthetic data would be easier, since the rules of the environment are relatively simple and well defined, with quantifiable outcomes.
Isnt it literally infinite via even the simplest simulator?
You could generate an unlimited training set just by implementing tik tac toe on an unbound grid, for example, in like 10 lines of code.