Poster Exhibit Hall I #291

AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs

Sanjoy Chowdhury ⋅ Hanan Gani ⋅ Nishit Anand ⋅ Sayan Nag ⋅ Ruohan Gao ⋅ Mohamed Elhoseiny ⋅ Salman Khan ⋅ Dinesh Manocha

2025 Poster

Project Page [ Slides] [ Poster]

Abstract

Recent advancements in reasoning optimization havegreatly enhanced the performance of large language models(LLMs). However, existing work fails to address the com-plexities of audio-visual scenarios, underscoring the needfor further research. In this paper, we introduce AURE-LIA, a novel actor-critic based audio-visual (AV) reasoningframework that distills structured, step-by-step reasoninginto AVLLMs at test time, improving their ability to processcomplex multi-modal inputs without additional training orfine-tuning. To further advance AVLLM reasoning skills, wepresent AVReasonBench, a challenging benchmark compris-ing 4500 audio-visual questions, each paired with detailedstep-by-step reasoning. Our benchmark spans six distincttasks, including AV-GeoIQ, which evaluates AV reasoningcombined with geographical and cultural knowledge. Evalu-ating 18 AVLLMs on AVReasonBench reveals significant lim-itations in their multi-modal reasoning capabilities. UsingAURELIA, we achieve up to a 100% relative improvement,demonstrating its effectiveness. This performance gain high-lights the potential of reasoning-enhanced data generationfor advancing AVLLMs in real-world applications. Our codeand data will be publicly released.

Chat is not available.