You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties

1School of Computing Science, 3Department of Linguistics, Simon Fraser University, Canada
2Université de Franche-Comté, SUPMICROTECH, CNRS, Institut FEMTO-ST, France
Speech Synthesis Workshop 2025
Code arXiv Supplementary

Abstract

We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.

Previous Work

results

In our paper, psychophysical reverse-correlation was used to reconstruct L1-English and L2-English listener's mental representation of duration and pitch in tense-lax vowel contrasts in a data-driven manner. For L2-English speakers who spoke French, Mandarin and Japanese as an L1, it was found that vowel duration, but not pitch, affected perception of tense/lax vowels. This suggests that we can apply a duration mechanism to improve comprehension of this English vowel contrast when an L2 listener struggles to use the primary formant cues.

Clarity Mechanism

To enable L2 clarity mode in Matcha-TTS, we added a clarity flag that can be set to ``True'' or ``False'' at synthesis, along with a markup to control which words are emphasized. The user (or large language model driving a dialogue system) surrounds difficult words with exclamation points, e.g., ``!peel!'', allowing the TTS to parse the words to be treated for clarity through several steps: 1) Parse each flagged word to see if it contains a tense or lax vowel, 2) If the word contains tense vowels but no lax vowels, the clarity modification is applied to the tense vowel containing word, 3) If the word contains both tense and lax vowels, and if the tense vowel has primary stress, the clarity modification is applied to the tense vowel containing word.

Audio Examples Single Word

"The phrase had fool somewhere in the middle of it", "I saw full written on the notepad"

Base

Emphasis

Stretch

Clarity

Audio Examples Double Word

"The sign mentioned sin, but the person said scene", "He wrote down bot, but remembered it as but", "In his talk he kept using could, but I am pretty sure he meant cooed"

Base

Emphasis

Stretch

Clarity

Results Word Error Rate

We observed that, through our "clarity" mode stretch applied to tense-vowel-containing words, we could overcome the bias towards lax vowels. By lengthening the target tense-vowel-containing word, the WER on the baseline was reduced. Importantly, we also confirmed that, indeed, stretching lax vowels, as is typical in L2-directed speech, results in more errors. Images left single target word, right double target word.

Single target word
Double target word

Results MOS Scores

Surprisingly, despite the objective improvements seen in the WER, the participants perceived a stretch as significantly more intelligible for lax vowels. Moreover, slowing down also required less listening effort. Furthermore, slowing the entire sentence was found to be significantly less respectful and encouraging.

results

Whisper Test

Speech synthesis studies often use human MOS scores but rely on ASR for transcription accuracy. We used Whisper ASR with 72 phrases and calculated overall WER (WERt) and WER on only the target words. Similar to L2 speakers, we saw for the "base" TTS, Whisper struggled to predict the correct target words that lack context in the phrase. We did not see the same improvements in WER with the "clarity" TTS that we saw in the L2 participants. Instead, we saw an overall slowing down of the TTS decreases the WER in the target words. Yet, we also saw that while the WER on the target words decreased with this slowing down, the proportion of errors in the target word stemming from the minimal pair substitutions was much higher for the "base" TTS while the overall difference in WER both on the whole phrase and the target words was within 3% for all TTS styles. Therefore, the ASR does not use the same duration mechanisms as humans when facing difficult predicting words.

Acknowledgements

This work was supported by the Simon Fraser University FASS Breaking Barriers Interdisciplinary Incentive Grant, the Social Sciences and Humanities Research Council of Canada Grant (SSHRC Insight Grant 435–2019–1065), and the NSERC Discovery Grant (RGPIN-2024-06519). The authors thank Paul Maublanc for always being our first pilot French speaker, as well as the Rajan Family for their support.