Risamálheild: A very large icelandic text corpus

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

24 Citations (Scopus)

Abstract

We present Risamálheild, the Icelandic Gigaword Corpus (IGC), a corpus containing more than one billion running words from mostly contemporary texts. The work was carried out with minimal amount of work and resources, focusing on material that is not protected by copyright and sources which could provide us with large chunks of text for each cleared permission. The two main sources considered were therefore official texts and texts from news media. Only digitally available texts are included in the corpus and formats that can be problematic are not processed. The corpus texts are morphosyntactically tagged and provided with metadata. Processes have been set up for continuous text collection, cleaning and annotation. The corpus is available for search and download with permissive licenses. The dataset is intended to be clearly versioned with the first version released in early 2018. Texts will be collected continually and a new version published every year.

Original languageEnglish
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages4361-4366
Number of pages6
ISBN (Electronic)9791095546009
Publication statusPublished - 2018
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: 7 May 201812 May 2018

Publication series

NameLREC 2018 - 11th International Conference on Language Resources and Evaluation

Conference

Conference11th International Conference on Language Resources and Evaluation, LREC 2018
Country/TerritoryJapan
CityMiyazaki
Period7/05/1812/05/18

Bibliographical note

Publisher Copyright:
© LREC 2018 - 11th International Conference on Language Resources and Evaluation. All rights reserved.

Other keywords

  • Icelandic
  • Text corpora

Fingerprint

Dive into the research topics of 'Risamálheild: A very large icelandic text corpus'. Together they form a unique fingerprint.

Cite this