I'd be cautiuous with such general statements given the rapid pace of development in this area.
Benchmark shelf lives aren't that long.
You ommitted the fact that tuning bumped it to 26% vs random.
Sure, questionable what effort is involved in that step, but at the same time, that hints to me that tuning will be the new baseline within the next 12-24 months.
Sure I would expect it to improve. But it was a bit fishy how 'it took an IQ test!' is in all the highlights but then they mumble quietly about the score that it actually got and hope no-one is listening to that bit.
Its notable that it was able to attempt it at all I suppose.