Poster
$\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Model
Chia-Wen Kuo · Sijie Zhu · Fan Chen · Xiaohui Shen · Longyin Wen
[
Abstract
]
Abstract:
Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities.However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency.In this paper, we propose Decomposed Attention (\method{}), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention.\method{} decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived weighting strategy, namely $\alpha$-weighting. Taking advantage of the flexibility, we are able to introduce two critical improvements in visual token processing while maintaining the capacity of pre-trained LLMs: 1) We rectify the biased positional encoding in textual-to-visual attention to boost visual understanding performance. 2) We diagonalize visual-to-visual attention to reduce computation complexity from $\mathcal{O}(|V|^2)$ to $\mathcal{O}(|V|)$ for $|V|$ visual tokens without compromising performance. Extensive experiments and analysis validate the effectiveness of \method{}, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs (\eg, $5\times$ faster).Code, data, and models will be publicly available.
Live content is unavailable. Log in and register to view live content