Pokémon is increasingly used to evaluate modern large language models, but current practices lack standardization, and depend heavily on game-specific harness. The Pokémon Red involves three major tasks—navigation, combat control and training a competitive Pokémon team. We find they come with limitations: navigation tasks are too hard, combat control is too simple, and Pokémon team training is too expensive. We address these issues in Lmgame Bench, a new framework offering standardized evaluations and initial results across diverse games.
When super intelligence comes, it would be very interesting to see multi-party game play among AI too. What role humans play in this story is unclear. Maybe humans can't directly engage in the games neither as they are too naive and will be immediately identified and exploited by AI :)