Skip to yearly menu bar Skip to main content


Poster

Always skip connection

Yiping Ji · Hemanth Saratchandran · Peyman Moghadam · Simon Lucey


Abstract: We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (eg. CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that self-attention mechanism is fundamentally ill-conditioned and is therefore uniquely dependent on skip connections for regularization. Additionally, we propose $T$oken $G$raying($TG$) -- a simple yet effective complement (to skip connections) that further improves the conditioning of input tokens. We validate our approach in both supervised and self-supervised training methods.

Live content is unavailable. Log in and register to view live content