Hacker Newsnew | past | comments | ask | show | jobs | submit | zhisbug's commentslogin


Pokémon is increasingly used to evaluate modern large language models, but current practices lack standardization, and depend heavily on game-specific harness. The Pokémon Red involves three major tasks—navigation, combat control and training a competitive Pokémon team. We find they come with limitations: navigation tasks are too hard, combat control is too simple, and Pokémon team training is too expensive. We address these issues in Lmgame Bench, a new framework offering standardized evaluations and initial results across diverse games.


where other models tops out in a few moves



We find that spatial perception and spatial reasoning remain very difficult even for the strongest models like o3 or Claude 3.7




Sliding tile attention accelerates Hunyuan video generation by 3x with no quality drop and no need for training


Try our demo and let us know


This is pretty clever and seems to have high potential, but it still relies on humans. What if some day all humans cannot outsmart AI?


When super intelligence comes, it would be very interesting to see multi-party game play among AI too. What role humans play in this story is unclear. Maybe humans can't directly engage in the games neither as they are too naive and will be immediately identified and exploited by AI :)


So transparent.

https://news.ycombinator.com/item?id=43017857

zhisbug 1 day ago | next [–]

We hope to redefine ai evaluation via our gamified AI evaluation platform: game arena!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: