EmojiVoice: Towards long-term controllable expressivity in robot speech

Paige Tuttösí,1,2,3 Shivam Mehta,4 Zachary Syvenky,1 Bermet Burkanova,1 Gustav Eje Henter,4 Angelica Lim1
1School of Computing Science Simon Fraser University, Canada
2Enchanted Tools, France
4 the Division of Speech Music and Hearing, KTH Royal Institute of Technology, Sweden
RO-MAN 2025
Code arXiv Supplementary

Abstract

Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with “expressive” joyful voices, they lack this long-term variation found in human speech. Foundation model text-to-speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We present EmojiVoice, a free, customizable text-to-speech (TTS) toolkit that allows social roboticists to build temporally variable, expressive speech on social robots. We introduce emoji-prompting to allow fine- grained control of expressivity on a phase level and use the lightweight Matcha-TTS backbone to generate speech in real- time. We explore three case studies: (1) a scripted conversation with a robot assistant, (2) a storytelling robot, and (3) an autonomous speech-to-speech interactive agent. We found that using varied emoji prompting improved the perception and expressivity of speech over a long period in a storytelling task, but expressive voice was not preferred in the assistant use case.

HRI 2025 Demo Video

TTS Toolbox

Miroka speaking with emojis

We propose a free, open-source toolbox with documentation that allows social roboticists to use emoji prompting to easily deploy an expressive TTS offline and customize the voices to their own use cases. The toolbox extends Matcha-TTS for ease-of-use by HRI researchers by including: 1) Training files setup: examples, raw data, and 3 checkpoints (with and without optimizers), 2) Additional information on the amount of data needed to fine-tune, 3) Scripts to record the data, 4) Wrappers to parse emojis in text to prompt the voices, and 5) A conversational agent. Our toolbox, data, and voices are available free and open source

Emoji Prompting

We suggest emojis as a means to prompt expressive styles that can be used for temporal variational control, i.e., selecting a different emoji style for each phrase generated. The variation provided by the standard Unicode emoji set is especially useful in that they can express emotions: 😍, 😡, attitudes: 😎, 👍, mental states: 🤔, 🙄, and bodily states: 🥶, 😴.

emoji in text example emoji in prompt example

Model Training

Matcha TTS in excellent for social robotics applications as it requires few resources to train and can be be used to create use-case custom voices. Our model was fine-tuned with only 50 phrases (about 3 minutes of audio) per emoji. The model was trained for 85 epochs, approximately 20 minutes. With the original data (VCTK) it is also possible to insert the custom data as a new voice and re-train from scratch for higher quality audio, but this takes several hours.

model architecture

Results

We found that through the use of phrase-by-phrase emoji prompting, we were able to create a variable, expressive voice that improved impressions over a singularly expressive “joyful” voice in the longer storytelling use case. However, in a line-by-line assistant interaction, varied expressivity was not preferred. As such, researchers need to put care into considering all aspects of the task and interaction context when selecting and designing their robot voices.

results

Acknowledgements

We would like to thank Mohammed Hafsati and Waldez Gomes for their help in running the Miroka experiments, Paul Maublanc for listening endlessly to our voices, Jean-Julien Aucouturier for his useful feedback, and the Rajan Family for their support.