Track What You Need in Videos via Text Prompts. A text-driven object tracking framework that lets users specify tracking targets using natural language.
Bridging language and vision for intuitive, prompt-based video object tracking.