EmojiVoice: Towards long-term controllable expressivity in robot speech

Abstract

Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with “expressive” joyful voices, they lack this long-term variation found in human speech. Foundation model text-to-speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We present EmojiVoice, a free, customizable text-to-speech (TTS) toolkit that allows social roboticists to build temporally variable, expressive speech on social robots. We introduce emoji-prompting to allow fine- grained control of expressivity on a phase level and use the lightweight Matcha-TTS backbone to generate speech in real- time. We explore three case studies: (1) a scripted conversation with a robot assistant, (2) a storytelling robot, and (3) an autonomous speech-to-speech interactive agent. We found that using varied emoji prompting improved the perception and expressivity of speech over a long period in a storytelling task, but expressive voice was not preferred in the assistant use case.

HRI 2025 Demo Video

TTS Toolbox

We propose a free, open-source toolbox with documentation that allows social roboticists to use emoji prompting to easily deploy an expressive TTS offline and customize the voices to their own use cases. The toolbox extends Matcha-TTS for ease-of-use by HRI researchers by including: 1) Training files setup: examples, raw data, and 3 checkpoints (with and without optimizers), 2) Additional information on the amount of data needed to fine-tune, 3) Scripts to record the data, 4) Wrappers to parse emojis in text to prompt the voices, and 5) A conversational agent. Our toolbox, data, and voices are available free and open source

Emoji Prompting

We suggest emojis as a means to prompt expressive styles that can be used for temporal variational control, i.e., selecting a different emoji style for each phrase generated. The variation provided by the standard Unicode emoji set is especially useful in that they can express emotions: 😍, 😡, attitudes: 😎, 👍, mental states: 🤔, 🙄, and bodily states: 🥶, 😴.

Emojivoice Examples

Audio order is Baseline - Pleasant - Emojivoice

😎 "No worries. I'll handle it. Coffee emergencies are my specialty!"

🤔 "Let's see... First thing: is there enough water in the tank? The machine can't work without that sweet H2O."

😍 "Excellent! Now the heating element needs some love. We can't rush greatness, but it should heat up soon."

🤣 "Haha, I feel you. Machines and lines—they both have their own mysterious ways of testing human patience."

😮 "Wait a second… The brewing chamber! It's not fully locked. Give it a twist until you hear that click."

🙄 "Seriously, Angelica? Did you forget to plug it in? The machine needs power, you know."

😅 "It happens to the best of us. Just plug it in, and let's see the magic happen."

😡 "About time! I was ready to throw hands with that machine."

😁 "Always here for you, Angelica. Coffee emergencies or otherwise, I've got your back!"

🙂 "It will be. Because I helped."

😭 "I can't do this anymore, let's just get a new machine."

Storytelling Audio

Examples are being generated in real time on a NVIDIA GeForce RTX 4070

Model Training

Matcha TTS is excellent for social robotics applications as it requires few resources to train and can be be used to create use-case custom voices. Our model was fine-tuned with only 50 phrases (about 3 minutes of audio) per emoji. The model was trained for 85 epochs, approximately 20 minutes. With the original data (VCTK) it is also possible to insert the custom data as a new voice and re-train from scratch for higher quality audio, but this takes several hours.

Results

We found that through the use of phrase-by-phrase emoji prompting, we were able to create a variable, expressive voice that improved impressions over a singularly expressive “joyful” voice in the longer storytelling use case. However, in a line-by-line assistant interaction, varied expressivity was not preferred. As such, researchers need to put care into considering all aspects of the task and interaction context when selecting and designing their robot voices.

Acknowledgements

We would like to thank Mohammed Hafsati and Waldez Gomes for their help in running the Miroka experiments, Paul Maublanc for listening endlessly to our voices, Jean-Julien Aucouturier for his useful feedback, and the Rajan Family for their support.