Tokenizing Nonverbal Communication in Salsa Dance

Bermet Burkanova*,1 Payam Jome Yazdian*,1 Chuxuan Zhang,1 Trinity Evans,1 Paige Tuttösí,1,2 Angelica Lim1
1School of Computing Science, Simon Fraser University, Canada
2Enchanted Tools, France

*Indicates Equal Contribution
Code arXiv Video Dataset
3D MOCAP Animations

Abstract

Partner dance offers a compelling testbed for studying tokenization in multimodal, bidirectional communication. In salsa, a lifted hand may signal a turn; musical accents may shape both dancers' motion. These interactions are continuous and improvisational, and hinge on discrete, interpretable cues—gestures, beats, and movement segments—that can be modeled as tokens. In this paper, we introduce a language model and tokenization framework for social dance, treating salsa as a form of embodied dialogue grounded in motion, music, and role-based interaction. To support this, we present CoMPAS3D, a large-scale motion capture dataset of improvised salsa dancing, capturing over 3 hours of leader-follower interaction across three skill levels. The dataset includes frame-level annotations of moves, styling, and execution errors, created through over 120 hours of expert effort. We use tokens as a foundation for generative and classification tasks, including follower motion prediction and move recognition, demonstrating the utility of token-based models for interactive, expressive virtual agents.