Skip to yearly menu bar Skip to main content


Poster

CARIM: Caption-Based Autonomous Driving Scene Retrieval via Inclusive Text Matching

Minjoo Ki · Dae Jung Kim · Kisung Kim · Seon Joo Kim · Jinhan Lee


Abstract:

Text-to-video retrieval serves as a powerful tool for navigating vast video databases. This is particularly useful in autonomous driving to retrieve scenes from a text query to simulate and evaluate the driving system in desired scenarios. However, traditional ranking-based retrieval methods often return partial matches that do not satisfy all query conditions. To address this, we introduce Inclusive Text-to-Video Retrieval, which retrieves only videos that meet all specified conditions, regardless of additional irrelevant elements. We propose CARIM, a framework for driving scene retrieval that employs inclusive text matching. By utilizing Vision-Language Model (VLM) and Large Language Model (LLM) to generate compressed captions for driving scenes, we transform text-to-video retrieval into a more efficient text-to-text retrieval problem, eliminating modality mismatches and heavy annotation costs. We introduce a novel positive and negative data curation strategy and an attention-based scoring mechanism tailored for driving scene retrieval. Experimental results on the DRAMA dataset demonstrate that CARIM outperforms state-of-the-art retrieval methods, excelling in edge cases where traditional models fail.

Live content is unavailable. Log in and register to view live content