Web Corpora of Volga-Kama Uralic Languages

Timofey Arkhangelskiy


This paper presents corpora of five minority Uralic languages that belong or are adjacent to the Volga-Kama area, which has been characterized as a Sprachbund (Bereczki 1983, Helimski 2003). A total of 11 corpora contain written and, in one case, spoken texts in Udmurt, Komi, Meadow Mari, Erzya and Moksha languages. The described resources are “web corpora” both in terms of their accessibility (all of them are accessible through a web-based query interface) and, in most cases, in terms of the medium (almost all texts come from web resources, such as digital newspapers and social media). The paper describes the corpora from the user perspective. The main focus is on the search capabilities and on certain research questions that can be studied with the help of these corpora. All corpora are available at http://volgakama.web-corpora.net/.

Full Text:



