Is Part-of-Speech Tagging a Solved Problem for Icelandic?

Örvar Kárason, Hrafn Loftsson

Rannsóknarafurð: Kafli í bók/skýrslu/ráðstefnuritiRáðstefnuframlagritrýni

Útdráttur

We train and evaluate four Part-of-Speech tagging models for Icelandic. Three are older models that obtained the highest accuracy for Icelandic when they were introduced. The fourth model is of a type that currently reaches state-of-the-art accuracy. We use the most recent version of the MIM-GOLD training/testing corpus, its newest tagset, and augmentation data to obtain results that are comparable between the various models. We examine the accuracy improvements with each model and analyse the errors produced by our transformer model, which is based on a previously published ConvBERT model. For the set of errors that all the models make, and for which they predict the same tag, we extract a random subset for manual inspection. Extrapolating from this subset, we obtain a lower bound estimate on annotation errors in the corpus as well as on some unsolvable tagging errors. We argue that further tagging accuracy gains for Icelandic can still be obtained by fixing the errors in MIM-GOLD and, furthermore, that it should still be possible to squeeze out some small gains from our transformer model.
Upprunalegt tungumálEnska
Titill gistiútgáfuProceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
ÚtgáfustaðurTórshavn, Faroe Islands
ÚtgefandiUniversity of Tartu Library
Síður71-79
Síðufjöldi9
ÚtgáfustaðaÚtgefið - 1 maí 2023

Fingerprint

Sökktu þér í rannsóknarefni „Is Part-of-Speech Tagging a Solved Problem for Icelandic?“. Saman myndar þetta einstakt fingrafar.

Vitna í þetta