Named entity recognition for Icelandic: Annotated corpus and models

Svanhvít L. Ingólfsdóttir*, Ásmundur A. Guðjónsson, Hrafn Loftsson

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million tokens. Furthermore, we have used the corpus to train three machine learning models: first, a CRF model that makes use of shallow word features and a gazetteer function; second, a perceptron model with shallow word features and externally trained word clusters; and third, a BiLSTM model with external word embeddings. Finally, we applied simple voting to combine the model outputs. The voting method obtains an $$F:{1}$$ score of 85.79, gaining 1.89 points compared to the best performing individual model. The corpus and the models are publicly available.

Original languageEnglish
Title of host publicationStatistical Language and Speech Processing - 8th International Conference, SLSP 2020, Proceedings
EditorsLuis Espinosa-Anke, Irena Spasic, Carlos Martín-Vide
PublisherSpringer Science and Business Media Deutschland GmbH
Pages46-57
Number of pages12
ISBN (Print)9783030594299
DOIs
Publication statusPublished - 2020
Event8th International Conference on Statistical Language and Speech Processing, SLSP 2020 - Cardiff, United Kingdom
Duration: 14 Oct 202016 Oct 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12379 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference8th International Conference on Statistical Language and Speech Processing, SLSP 2020
Country/TerritoryUnited Kingdom
CityCardiff
Period14/10/2016/10/20

Bibliographical note

Publisher Copyright:
© Springer Nature Switzerland AG 2020.

Other keywords

  • BiLSTM
  • Clustering
  • Corpus annotation
  • CRF
  • Machine learning
  • Named entity recognition

Fingerprint

Dive into the research topics of 'Named entity recognition for Icelandic: Annotated corpus and models'. Together they form a unique fingerprint.

Cite this