Skip to yearly menu bar Skip to main content


Poster

VideoSetBench: Identifying and Reasoning Similarities and Differences in Similar Videos

YUE QIU · Yanjun Sun · Takuma Yagi · Shusaku Egami · Natsuki Miyata · Ken Fukuda · Kensho Hara · Ryusuke Sagawa


Abstract:

Recognizing subtle similarities and differences among sets of similar activities is central to many real-world applications, including skill acquisition, sports performance evaluation, and anomaly detection. Humans excel at such fine-grained analysis, which requires comprehensive video understanding and cross-video reasoning about action attributes, poses, positions, and emotional states. Yet existing video-based large language models typically address only single-video recognition, leaving their capacity for multi-video reasoning largely unexplored.We introduce VideoSetBench, a curated benchmark designed to test detail-oriented recognition across diverse activities, from subtle action attributes to viewpoint transitions. Our evaluation of current video-based LLMs on VideoSetBench reveals critical shortcomings, particularly in fine-grained detail recognition and multi-video reasoning. To mitigate these issues, we propose an automatically generated dataset for instruction tuning alongside a novel multi-video recognition framework. While instruction tuning and specialized multi-video reasoning improve performance, all tested models remain far from satisfactory. These findings underscore the need for more robust video-based LLMs capable of handling complex multi-video tasks, enabling diverse real-world applications.

Live content is unavailable. Log in and register to view live content