This is interesting, one of our findings was that the Claude was capable of essential tasks & simple automation (i.e iron gear wheel factory in lab-play) but didn't even try to do it during the "build the biggest factory" game episodes. So the models can do these essential tasks but when given a general goal, i.e "complete the game", they don't have a good level of long-term planning to even try to attempt them. Often they just did un-coordinated small-scale constructs without attempting to scale up existing factories
That was also one of our goals, to find out how do the models act when given a very vague and general objective
That was also one of our goals, to find out how do the models act when given a very vague and general objective