ArmSpeech: Armenian Spoken Language Corpus

Varuzhan H. Baghdasaryan

Abstract

The Armenian language is an independent branch of the Indo-European language family and the official language of the Republic of Armenia and the Republic of Artsakh. According to various reliable sources, an average of 3 million people in Armenia and 10-12 million people in the Armenian Diaspora use the Armenian language as their native language. The largest communities outside of Armenia are in the United States of America, Canada, the Russian Federation, the Islamic Republic of Iran, the French Republic,  the Syrian Arab Republic and the Lebanese Republic. This paper presents the ArmSpeech speech corpus. ArmSpeech is a collection of annotated Armenian speech intended for natural language processing (NLP) technologies research and development. ArmSpeech is designed for speech-to-text and text-to-speech purposes but can be used in other domains also (e.g. language identification). Corpus contains 6206 high-quality audio samples: 11 hours 46 minutes and 26 seconds (11.77 hours) of annotated native Armenian speech of multiple speakers of any age, gender and accent. According to the research results, this is the most extensive Armenian speech corpus in the public domain for speech recognition, speech synthesis and spoken language identification systems.

Keywords

Armenian speech corpus; speech recognition; speech-to-text; speech synthesis; text-to-speech; spoken language identification

Cite This Article

Baghdasaryan, V. H. (2022). ArmSpeech: Armenian Spoken Language Corpus. International Journal of Scientific Advances (IJSCIA), Volume 3| Issue 3: May-Jun 2022, Pages 454-459, URL: https://www.ijscia.com/wp-content/uploads/2022/06/Volume3-Issue3-May-Jun-No.283-454-459.pdf

Volume 3 | Issue 3: May-Jun 2022 

 

ISSN: 2708-7972

สัญญาอนุญาตของครีเอทีฟคอมมอนส์

This work is licensed under a Creative Commons Attribution 4.0 (International) Licence.(CC BY-NC 4.0).

Download

Support