Skip to yearly menu bar Skip to main content


Poster

From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

Anthony Bisulco · Rahul Ramesh · Randall Balestriero · Pratik Chaudhari


Abstract:

Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning across factors such as masking ratio, patch size, number of encoder and decoder layers, as researchers use these methods for different applications. While prior theoretical work has analyzed MAEs through the lens of attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and the performance on downstream tasks is relatively unexplored. In this work, we investigate the perspective that "MAEs learn spatial correlations in the input image". We analytically derive the features learnt by a linear MAE and show that masking ratio and patch size can be used to select between features capturing short- and long-range spatial correlations. Extending this analysis to nonlinear MAEs, we show that learned representations in MAEs adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.

Live content is unavailable. Log in and register to view live content