Hugging Face Daily Papers · June 2, 2026 · 7 min read

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

\n <img src=\"https://img.shields.io/badge/arXiv-Paper-b31b1b.svg\" alt=\"arXiv\" style=\"height: 20px;\">\n</a>","html":"We're excited to share Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures! If you've ever wondered why text-to-gesture retrieval models keep defaulting to generic beat gestures and miss the meaning behind semantically rich co-speech gestures, this paper is for you.\nThe core problem: directly contrasting transcripts with continuous motion embeddings overemphasizes low-level kinematics and washes out the symbolic content of semantic gestures, which are sparse and live in the long tail of human motion. Our fix is semantic motion anchors: structured natural-language abstractions that re-express gesture motion in terms of physical form (handedness, spatial position, trajectory, hand configuration) and communicative intent (listing, self-reference, uncertainty, and more).\nOur work covers:\n<ul>\n<li>A three-stage anchor pipeline: We discretize 3D gestures into body–hand motion primitives with a two-stream RVQ-VAE, deterministically verbalize each primitive into structured physical-form descriptions, and ground them in the transcript via LLM structured reasoning to produce semantic motion anchors.</li>\n<li>Anchor-supervised contrastive learning: A modality-matched framework that routes physical-form descriptions to the motion branch and intent descriptions to the transcript branch, used as auxiliary supervision during training and discarded at inference.</li>\n<li>Strong empirical gains: On BEAT2, we improve text-to-gesture R@1 from 39.1 to 42.3 (an 8.2% relative gain over a direct text-motion baseline) and outperform GestureDiffuCLIP, TMR, and JEGAL on both retrieval directions, with gains concentrated at the top rank where it matters most.</li>\n<li>A new dataset (SEMANTIX): 878 human-annotated TED and BEAT2 clips with gold descriptions of physical form and communicative intent, for evaluating semantic gesture understanding.</li>\n<li>Downstream impact: In a perceptual user study, participants significantly preferred gestures retrieved by our approach over RAG-Gesture (72.2% vs. 27.8%, p < 0.0001), showing that semantically grounded retrieval translates into gestures that better convey communicative intent.</li>\n</ul>\nWe will release our code and models soon: stay tuned!\nWe invite you to read, share, and build on our work as we push toward gestures that mean what they move. Let's start a conversation!\n<a href=\"https://arxiv.org/abs/2605.30608\" rel=\"nofollow\">\n <img src=\"https://img.shields.io/badge/arXiv-Paper-b31b1b.svg\" alt=\"arXiv\" style=\"height: 20px;\">\n</a>","updatedAt":"2026-06-02T18:42:23.623Z","author":{"_id":"64ba58d377dd483716aba098","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba58d377dd483716aba098/6VASAUkFpDC-PR01yUJWj.png","fullname":"Mahdi Abootorabi","name":"aboots","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8287398815155029},"editors":["aboots"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64ba58d377dd483716aba098/6VASAUkFpDC-PR01yUJWj.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30608","authors":[{"_id":"6a1f20f0e292c1c78ecb11e2","name":"Varsha Suresh","hidden":false},{"_id":"6a1f20f0e292c1c78ecb11e3","name":"Mohammad Mahdi Abootorabi","hidden":false},{"_id":"6a1f20f0e292c1c78ecb11e4","name":"Mohamed Salman","hidden":false},{"_id":"6a1f20f0e292c1c78ecb11e5","name":"M. Hamza Mughal","hidden":false},{"_id":"6a1f20f0e292c1c78ecb11e6","name":"Christian Theobalt","hidden":false},{"_id":"6a1f20f0e292c1c78ecb11e7","name":"Ashwin Ram","hidden":false},{"_id":"6a1f20f0e292c1c78ecb11e8","name":"Jürgen Steimle","hidden":false},{"_id":"6a1f20f0e292c1c78ecb11e9","name":"Vera Demberg","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures","submittedOnDailyBy":{"_id":"64ba58d377dd483716aba098","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba58d377dd483716aba098/6VASAUkFpDC-PR01yUJWj.png","isPro":false,"fullname":"Mahdi Abootorabi","user":"aboots","type":"user","name":"aboots"},"summary":"Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.","upvotes":0,"discussionId":"6a1f20f1e292c1c78ecb11ea","ai_summary":"Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance.","ai_keywords":["contrastive alignment","motion embeddings","semantic gestures","semantic motion anchors","3D gestures","motion primitives","retrieval-augmented generation","text-to-gesture retrieval","gesture synthesis","communicative intent"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30608.md"}">

Papers

arxiv:2605.30608

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Published on Jun 1

· Submitted by

Mahdi Abootorabi on Jun 2

Upvote

Authors:

Abstract

Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.

View arXiv page View PDF Add to collection

Community

aboots

Paper submitter about 7 hours ago

We're excited to share Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures! If you've ever wondered why text-to-gesture retrieval models keep defaulting to generic beat gestures and miss the meaning behind semantically rich co-speech gestures, this paper is for you.

The core problem: directly contrasting transcripts with continuous motion embeddings overemphasizes low-level kinematics and washes out the symbolic content of semantic gestures, which are sparse and live in the long tail of human motion. Our fix is semantic motion anchors: structured natural-language abstractions that re-express gesture motion in terms of physical form (handedness, spatial position, trajectory, hand configuration) and communicative intent (listing, self-reference, uncertainty, and more).

Our work covers:

A three-stage anchor pipeline: We discretize 3D gestures into body–hand motion primitives with a two-stream RVQ-VAE, deterministically verbalize each primitive into structured physical-form descriptions, and ground them in the transcript via LLM structured reasoning to produce semantic motion anchors.
Anchor-supervised contrastive learning: A modality-matched framework that routes physical-form descriptions to the motion branch and intent descriptions to the transcript branch, used as auxiliary supervision during training and discarded at inference.
Strong empirical gains: On BEAT2, we improve text-to-gesture R@1 from 39.1 to 42.3 (an 8.2% relative gain over a direct text-motion baseline) and outperform GestureDiffuCLIP, TMR, and JEGAL on both retrieval directions, with gains concentrated at the top rank where it matters most.
A new dataset (SEMANTIX): 878 human-annotated TED and BEAT2 clips with gold descriptions of physical form and communicative intent, for evaluating semantic gesture understanding.
Downstream impact: In a perceptual user study, participants significantly preferred gestures retrieved by our approach over RAG-Gesture (72.2% vs. 27.8%, p < 0.0001), showing that semantically grounded retrieval translates into gestures that better convey communicative intent.

We will release our code and models soon: stay tuned!

We invite you to read, share, and build on our work as we push toward gestures that mean what they move. Let's start a conversation!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30608

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30608 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30608 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30608 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers