The Icelandic Parsed Historical Corpus (IcePaHC)

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

20 Citations (Scopus)

Abstract

We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12th century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern Icelandic spelling. We explain why we choose to use a phrase structure Penn style annotation scheme and briefly describe the syntactic annotation process. We also describe a spin-off project which is only in its beginning stages: a parsed historical corpus of Faroese. Finally, we advocate the importance of an open source policy as regards language resources.

Original languageEnglish
Title of host publicationProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012
EditorsMehmet Ugur Dogan, Joseph Mariani, Asuncion Moreno, Sara Goggi, Khalid Choukri, Nicoletta Calzolari, Jan Odijk, Thierry Declerck, Bente Maegaard, Stelios Piperidis, Helene Mazo, Olivier Hamon
PublisherEuropean Language Resources Association (ELRA)
Pages1977-1984
Number of pages8
ISBN (Electronic)9782951740877
Publication statusPublished - 2012
Event8th International Conference on Language Resources and Evaluation, LREC 2012 - Istanbul, Turkey
Duration: 21 May 201227 May 2012

Publication series

NameProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012

Conference

Conference8th International Conference on Language Resources and Evaluation, LREC 2012
Country/TerritoryTurkey
CityIstanbul
Period21/05/1227/05/12

Bibliographical note

Funding Information:
The building of IcePaHC was supported by the Icelandic Research Fund (Rannsóknasjóður), grant no 090662011, Viable Language Technology beyond English – Icelandic as a test case; the U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, Evolution of Language Systems: a comparative studyof grammatical change in Icelandic and English; the University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki); and the EU ICT Policy Support Programme as part of the Competitiveness and Innovation Framework Programme, grant agreement no 270899 (META-NORD). The Faroese Parsed Historical Corpus is funded by the Universityof Iceland Research Fund (Rannsóknasjóður Háskóla Íslands), grant Faroese treebank (Frumgerð færeysks trjábanka). Thanks are due to several colleagues who generously gave us access to unpublished texts that they are editing. Thanks are also due to authors of copyrighted material who allowed us to use and distribute their texts. Thanks to Hrafn Loftsson who wrote most of the IceNLP software, to Brynhildur Stefánsdóttir and Hulda Óladóttir who assisted in parsing the texts, and to several students who keyed in a number of texts. Thanks to Victoria Rosén, Koenraad de Smedt and Paul Meurer at the University of Bergen for making IcePaHC a part of the INESS repository. Thanks to anonymous reviewers for useful comments. Much of this material has previously been published in Rögnvaldsson et al. (2011), and IcePaHC has been presented at various occasions, such as the RILiVS workshop in Oslo in September 2009 (Rögnvaldsson, Ingason and Sigurðsson, 2011), talks at the University of Pennsylvania in Philadelphia, the University of Massachusetts at Am-herst and New York University in May 2010, the annual conferences of the Institute of Humanities at the University of Iceland in Reykjavík in March 2011 and 2012, the MENOTA general assembly in Reykjavík in August 2011, the ACRH workshop in Heidelberg in January 2012, etc. We thank the audiences at these occasions for valuable discussion and comments. Last but not least, we would like to thank our collaborators at the University of Pennsylvania, especially Tony Kroch and Beatrice Santorini, for their invaluable contributions to this work.

Other keywords

  • Annotation
  • Faroese
  • Icelandic
  • Parsed corpus
  • Treebank

Fingerprint

Dive into the research topics of 'The Icelandic Parsed Historical Corpus (IcePaHC)'. Together they form a unique fingerprint.

Cite this