Hugging Face Daily Papers · May 22, 2026 · 4 min read

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

The project page is <a href=\"https://github.com/DurYi/SAMOSA\" rel=\"nofollow\">https://github.com/DurYi/SAMOSA</a></p>\n","updatedAt":"2026-05-22T04:15:20.090Z","author":{"_id":"659bb678e57c59004625c624","avatarUrl":"/avatars/32b395c3504acb1fe29cceb65508b351.svg","fullname":"Voyage_Wang","name":"VoyageWang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8276442885398865},"editors":["VoyageWang"],"editorAvatarUrls":["/avatars/32b395c3504acb1fe29cceb65508b351.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22538","authors":[{"_id":"6a0fd815a53a61ce2e422d74","name":"Deyi Zhu","hidden":false},{"_id":"6a0fd815a53a61ce2e422d75","name":"Yuji Wang","hidden":false},{"_id":"6a0fd815a53a61ce2e422d76","name":"Yong Liu","hidden":false},{"_id":"6a0fd815a53a61ce2e422d77","name":"Yansong Tang","hidden":false},{"_id":"6a0fd815a53a61ce2e422d78","name":"Bingyao Yu","hidden":false},{"_id":"6a0fd815a53a61ce2e422d79","name":"Jiwen Lu","hidden":false},{"_id":"6a0fd815a53a61ce2e422d7a","name":"Jie Zhou","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking","submittedOnDailyBy":{"_id":"659bb678e57c59004625c624","avatarUrl":"/avatars/32b395c3504acb1fe29cceb65508b351.svg","isPro":false,"fullname":"Voyage_Wang","user":"VoyageWang","type":"user","name":"VoyageWang"},"summary":"Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.","upvotes":3,"discussionId":"6a0fd815a53a61ce2e422d7b","githubRepo":"https://github.com/DurYi/SAMOSA","githubRepoAddedBy":"user","ai_summary":"SAMOSA adapts SAM 2 for visual object tracking by incorporating motion prediction, semantic detection, and geometric constraints to improve robustness and generalization in complex scenarios.","ai_keywords":["visual object tracking","vision foundation models","SAM 2","motion predictor","semantic cues","geometric constraints","tracking stability","generalization","anti-UAV datasets"],"githubStars":1,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"659bb678e57c59004625c624","avatarUrl":"/avatars/32b395c3504acb1fe29cceb65508b351.svg","isPro":false,"fullname":"Voyage_Wang","user":"VoyageWang","type":"user"},{"_id":"698304ccdb1504b0939bf8f1","avatarUrl":"/avatars/aa1626d5d1c0189f43721fcc3b98d316.svg","isPro":false,"fullname":"Clara Beauregard","user":"internet-age","type":"user"},{"_id":"67441d04de9997dd26931935","avatarUrl":"/avatars/ae1a6684aaf796a06adea9237001980e.svg","isPro":false,"fullname":"Zhu Deyi","user":"DurYi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22538.md"}">

Papers

arxiv:2605.22538

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Published on May 21

· Submitted by

Voyage_Wang on May 22

Tsinghua University

Upvote

Authors:

Abstract

SAMOSA adapts SAM 2 for visual object tracking by incorporating motion prediction, semantic detection, and geometric constraints to improve robustness and generalization in complex scenarios.

AI-generated summary

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.

View arXiv page View PDF GitHub 1 Add to collection

Community

VoyageWang

Paper submitter about 8 hours ago

The project page is https://github.com/DurYi/SAMOSA

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.22538

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22538 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22538 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22538 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers