I am thoroughly unimpressed by GPT-5. It still can't compose iambic trimeters in ancient Greek with a proper penthemimeral cæsura, and it insists on providing totally incorrect scansion of the flawed lines it does compose. I corrected its metrical sins twice, which sent it into "thinking" mode until it finally returned a "Reasoning failed" error.
There is no intelligence here: it's still just giving plausible output. That's why it can't metrically scan its own lines or put a cæsura in the right place.
It once again completely fails on an extremely simple test: look at a screenshot of sheet music, and tell me what the notes are. Producing a MIDI file for it (unsurprisingly) was far beyond its capabilities.
Interpreting sheet music images is very complex, and I’m not surprised general-purpose LLMs totally fail at it. It’s orders of magnitude harder than text OCR, due to the two-dimensional-ness.
> I am thoroughly unimpressed by GPT-5. It still can't compose iambic trimeters in ancient Greek with a proper penthemimeral cæsura, and it insists on providing totally incorrect scansion of the flawed lines it does compose
It's well-known at this point that LLMs don't handle spelling, syllables, rhythm, meter, or other word-form-based questions well due to tokenization -- sometimes sheer scale (or leaning on code) can get the right answer if they're lucky, but they're literally blind to the individual letters.
(Incidentally, go back in time even five years and this specific expectation of AI capability sounds comically overblown. "Everything's amazing and nobody's happy.")
No, it’s easy if the kid already knows the alphabet. Latin scansion was standard grade school material up until the twentieth century. Greek less so, but the rules for it are very clear-cut and well understood. An LLM will regurgitate the rules to you in any language you want, but it cannot actually apply the rules properly.
is ancient greek similar enough to modern day greek that an elementary school kid could learn to compose anything not boilerplate in an hour? Also, do you know that if you fed the same training material you need to train the kid in an hour into the LLM it can't do it?
To outperform GPT-5 in this case, all the kid needs to do is correctly recognize the syllable stress constraint. Even if they can't quickly compose many such poems, they could still be able to tell when something they've written doesn't match the constraints.
AI looks like it understands things because it generates text that sounds plausible. Poetry requires the application of certain rule to that text, and the rules for Latin and Greek poetry are very simple and well understood. Scansion is especially easy once you understand the concept, and you actually can, as someone else suggested, train a child to scan poetry by applying these rules.
An LLM will spit out what looks like poetry, but will violate certain rules. It will generate some hexameters but fail harder on trimeter, presumably because it is trained on more hexametric data (epic poetry: think Homer) than trimetric (iambic and tragedy, where it’s mixed with other meters). It is trained on text containing the rules for poetry too, so it can regurgitate rules like defining a penthemimeral cæsura. But, LLMs do not understand those rules and thus cannot apply them as a child could. That makes ancient poetry a great way to show how far LLMs are from actually performing simple, rules-based analysis and how badly they hide that lack of understanding by BS-ing.
This is not a useful diversion, it's like arguing if a submarine swims.
LLMs are simple, it doesn't take much more than high school math to explain their building blocks.
What's interesting is that they can remix tasks they've been trained very flexibly, creating new combinations they weren't directly trained on: compare this to earlier smaller models like T5 that had a few set prefixes per task.
They have underlying flaws. Your example is more about the limitations of tokens than "understanding", for example. But those don't keep them from being useful.
They do stop it from being intelligent though. Being able to spit out cool and useful stuff is a great achievement. Actual understanding is required for AGI and this demonstrably isn't that, right?
I too can't compose iambic trimeters in ancient Greek but am normally regarded as of average+ intelligence. I think it's a bit of an unfair test as that sort of thing is based of the rhythm of spoken speech and GPT-5 doesn't really deal with audio in a deep way.
Most classicists today can’t actually speak Latin or Greek, especially observing vowel quantities and rhythm properly, but you’d be hard pressed to find one who can’t scan poetry with pen and paper. It’s a very simple application of rules to written characters on a page, but it is application, and AI still doesn’t apply concepts well.
There is no intelligence here: it's still just giving plausible output. That's why it can't metrically scan its own lines or put a cæsura in the right place.