Some cursory clicking about didn't reveal to me the actual corpus they used, only that it is several trillion tokens 'divided across the languages'. I'm curious mainly because Irish (among some other similarly endangered languages on the list) typically has any large corpus come from legal/governmental texts that are required to be translated. There must surely be only a relatively tiny amount of colloquial Irish in the corpus. It be interesting to see some evals in each language particularly with native speakers.
I think LLMs may be on the whole very positive for endangered languages such as Irish, but before it becomes positive I think there's an amount of danger to be navigated (see Scots Gaelic wikipedia drama for example)
Can you provide a link about the “Scots Gaelic Wikipedia drama” you reference? I've heard of drama related to the Scots Wikipedia but that has nothing to do with Gaelic.
I think LLMs may be on the whole very positive for endangered languages such as Irish, but before it becomes positive I think there's an amount of danger to be navigated (see Scots Gaelic wikipedia drama for example)
In any case I think this is a great initiative.