Skip to yearly menu bar Skip to main content


Poster

Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi · Sanghyeok Lee · Byungoh Ko · Eunseo Kim · Jihyung Kil · Hyunwoo Kim


Abstract: Transformers have demonstrated remarkable success across various vision tasks, yet the quadratic complexity of self-attention remains a challenge for efficient inference.To address this, previous works such as FlashAttention optimize GPU memory access, and token compression techniques have been explored to reduce computational cost by reducing the number of tokens.However, conventional token importance measures rely on additional learnable modules or attention maps, making them impractical in training-free settings and incompatible with FlashAttention due to the inaccessibility of intermediate attention maps to minimize memory access.Here, we propose a novel training-free, model-agnostic token importance criterion, representation shift, which quantifies the information injected by each operation.Combined with the proposed representation shift, we can apply token compression on FlashAttention to further boost inference speed without requiring additional training or attention maps. This method also extends naturally beyond Transformers, e.g., convolutional neural networks (CNNs).Extensive experiments demonstrate that our representation shift, allowing token compression with FlashAttention and CNNs, results in up to 5.5$\times$ speed-up in video understandings.Through quantitative and qualitative experiments, we have shown that representation shift is a more robust alternative to conventional attention-based scores.

Live content is unavailable. Log in and register to view live content