A method for quantifying mispronunciation in L2 speech

Activity: Talk or presentationOral presentation

Description

Currently, there are no mobile applications nor websites or other online tools on the market that would enable learners of Icelandic as a foreign and second language practice real speaking skills. Many of these online tools, however, claim that they help learners with practising spoken Icelandic but what they offer in reality are only prerecorded sentences that learners can listen to and, if they wish, can repeat them. But in this case, no one listens to them to correct their mispronunciation. This article presents a development of a computer-assisted pronunciation training (CAPT) in Icelandic. This system gives the learners not only selected examples of vowels, consonants, words, phrases, and sentences, to listen to but also to repeat these sounds. The system listens to learners‘ pronunciation and in gives immediate feedback on mispronounced parts. The method we develop is of scoring pronunciation accuracy, using dynamic time warping (DTW) to quantify how closely learners' speech matches both native speakers (L1) and second-language speakers (L2). The DTW method measures similarity between samples of speech by aligning and comparing corresponding elements. It accurately quantifies L2 accent strength, and has been used for similar CAPT systems (Yue et al. 2017, Bartelds et al. 2020). However, DTW distance of L2 test speakers to L1 references is unreliable for individual utterances, because of acoustic mis/matches in irrelevant aspects like the speaker's vocal tract size or recording environment. This reflects that accurate pronunciations are always much closer to L1 than L2 references. Similar two-way comparisons are often beneficial for CAPT (Fu et al. 2020, Jia et al. 2014). The pronunciation score is evaluated on its ability to distinguish between native and non-native speakers, using the NBTale corpus of Norwegian. The method is run on both L1 and L2 test speakers, and the overlapping coefficient (OVL) of their score distributions expresses the method's sensitivity. Preliminary findings suggest that a score threshold is set to accept at least 95% of L1 input, non-nativelike features are identified in 97% of L2 speakers' sentences: 63% of words and 46% of phones. We also evaluate the method of Bartelds et al., measuring distance to L1 references only. Here the overlapping coefficient of L1 and L2 target scores is 0.33 for sentences, 0.62 for words, and 0.76 for phonetic segments, indicating much less discrimination between L1 and L2 speech. A threshold that accepts 95% of L1 input detects errors in only 68% of L2 sentences, 30% of words, and 17% of phones. In conclusion, the DTW difference ratio is sensitive to non-native-like pronunciations of L2 speakers, without rejecting valid L1 variation, and it performs better than the Bartelds et al. baseline.
PeriodAug 2022
Event titleEUROCALL 2022: Intelligent CALL, granular systems and learner data
Event typeConference
LocationReykjavík, IcelandShow on map
Degree of RecognitionInternational

Keywords

  • Computer Assisted Language Learning
  • Computer assisted pronunciation training
  • Icelandic