Those higher level kinds of mode collapse are hard to quantify in an automated way. To fix that, you would need interventions upstream, at pre & post training.
This approach is targeted to the kinds of mode collapse that we can meaningfully measure and fix after the fact, which is constrained to these verbal tics. Which doesn't fix higher level mode collapse on semantics & creativity that you're identifying -- but I think fixing the verbal tics is still important and useful.
> but I think fixing the verbal tics is still important and useful.
I don't. I think they're useful for flagging the existence of mode-collapse and also providing convenient tracers for AI-written prose. Erasing only the verbal tics with the equivalent of 's/ - /; /g' (look ma! no more 4o em dashes!) is about the worst solution you could come up with and if adopted would lead to a kind of global gaslighting. The equivalent of a vaccine for COVID which only suppresses coughing but doesn't change R, or fixing a compiler warning by disabling the check.
If you wanted to do useful research here, you'd be doing the opposite. You'd be figuring out how to make the verbal expressions even more sensitive to the underlying mode-collapse, to help research into fixing it and raising awareness. (This would be useful even on the released models, to more precisely quantify their overall mode-collapse, which is poorly captured by existing creative writing benchmarks, I think, and one reason I've had a hard time believing things like Eqbench rankings.)
None of those factors go into the scoring? What's the point of showing them then? What information are these factors providing if they aren't being used in the rubric? Without knowing the rubric (which isn't provided), these scores are baseless to me, and I don't know how to use them. Blackbox numbers are not useful.
These are important things to look at because they're where LLMs typically have trouble. People complain that their stories are too short, that they use stock phrases and purple prose, that they start to repeat themselves at a certain point, and that quality tends to fall off. Doing well in these problem areas doesn't guarantee good writing, but it does allow for it.
Personally what I find interesting is getting insight into the trajectory of model abilities over time. Over the time I've been running these benchmarks, the writing has gone from pure slop, to broadly competent (at short form at least) and occasionally compelling.
I don't think it's much longer until they will be generating content you will actually want to read.
Meanwhile a lot of people are find use for LLMs for partner-writing or lower stakes prose or roleplay.
The old version of the creative writing eval had several "in the style of" prompts actually! But I got tired of reading bad Hemingway impersonations so I cut them out of the new version.
Not internal consistency exactly, but there are criteria checking how well the chapter plan was followed (which is all the way up at the top of the context window).
This is done per chapter, and the score trendline is what you see in the "degradation" column.
I would say this could be a reasonable proxy for internal consistency since it is measuring more or less the same ability, i.e. how well it's keeping track of details as context window increases.
This approach is targeted to the kinds of mode collapse that we can meaningfully measure and fix after the fact, which is constrained to these verbal tics. Which doesn't fix higher level mode collapse on semantics & creativity that you're identifying -- but I think fixing the verbal tics is still important and useful.