Machine learningDeep Learning, Self-Supervised Learning

Masked Autoencoders

Masked Autoencoders are Scalable Vision Learners · Also known as: MAE, Vision MAE

Masked Autoencoders (MAE) is a self-supervised learning approach introduced by He et al. in 2021 that masks random patches of an image and trains a model to reconstruct the missing content. Adapting the masked language modeling paradigm from NLP to vision, MAE learns rich visual representations by solving a challenging reconstruction task without requiring labels.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Masked Autoencoders

Latent Diffusion Models SimCLR Swin Transformer Vision Transformer DETR (Detection Transfor…Direct Preference Optimi…GraphRAG Kolmogorov-Arnold Networ…Mamba (State Space Model)Neural Radiance Fields (…

+2 more

When to use it

Masked Autoencoders are ideal for self-supervised pre-training when labeled data is scarce or expensive to obtain. They work exceptionally well for learning visual representations that transfer to downstream tasks like classification, detection, and segmentation. MAE is preferred over supervised pre-training when privacy concerns limit data labeling. Use supervised methods when large labeled datasets are available, as they typically achieve higher downstream task performance with less pre-training.

Strengths & limitations

Strengths

Enables learning from unlabeled image data at scale, making it practical for domains lacking annotations
Learns transferable representations that improve performance on downstream tasks when fine-tuned with limited labels
Asymmetric encoder-decoder design makes pre-training efficient, requiring less computation than symmetric architectures
Achieves competitive performance with supervised pre-training on ImageNet while using no labels

Limitations

Reconstruction loss may not always align with downstream task objectives, requiring careful fine-tuning
Mask ratio and masking strategy are critical hyperparameters that require tuning for different domains
Pre-trained models may not transfer well to tasks with drastically different distributions from pre-training data

Frequently asked

How does masking in vision differ from masking in NLP?

In NLP, tokens represent discrete words. In vision, patches represent spatial regions containing diverse content. Vision masking typically uses higher mask ratios (60-75%) because images contain local structure; masking 75% patches still leaves sufficient visible context. NLP typically masks 15% of tokens. Images also require reconstruction of continuous pixel values, not discrete token prediction.

Why is the encoder-decoder asymmetric?

The encoder only processes visible patches, reducing computation. The decoder, which runs once per training sample, can be heavier without significantly affecting training efficiency. After pre-training, the decoder is discarded and only the encoder is used. This design achieves better efficiency-accuracy tradeoffs than symmetric architectures.

What is the right mask ratio?

For natural images, mask ratios of 60-75% work best. Higher masks (80%+) provide strong learning signals but may be impossible for the model. Lower masks (30-50%) are less challenging and learn less discriminative features. The optimal value depends on the image domain; medical images with sparse structures may need lower ratios.

Sources

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000-16009). DOI: 10.1109/CVPR52688.2022.01553 ↗

How to cite this page

ScholarGate. (2026, June 3). Masked Autoencoders are Scalable Vision Learners. ScholarGate. https://scholargate.app/en/deep-learning/masked-autoencoders

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Latent Diffusion ModelsDeep learning↔ compare
SimCLRDeep learning↔ compare
Swin TransformerDeep learning↔ compare
Vision TransformerDeep learning↔ compare

Compare side by side →

Referenced by

DETR (Detection Transformer)Direct Preference Optimization GraphRAG Kolmogorov-Arnold Networks Latent Diffusion Models Mamba (State Space Model)Neural Radiance Fields (NeRF)QLoRA Segment Anything Model SimCLR Swin Transformer

Related reference concepts

Self-Supervised and Representation Learning Unsupervised Learning Object Recognition and Detection Deep Generative Models Computer Vision Deep Learning

Spotted an issue on this page? Report or suggest a fix →

Machine learningDeep Learning, Self-Supervised Learning

Masked Autoencoders

Masked Autoencoders are Scalable Vision Learners · Also known as: MAE, Vision MAE

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Masked Autoencoders

+2 more

When to use it

Strengths & limitations

Strengths

Enables learning from unlabeled image data at scale, making it practical for domains lacking annotations
Learns transferable representations that improve performance on downstream tasks when fine-tuned with limited labels
Asymmetric encoder-decoder design makes pre-training efficient, requiring less computation than symmetric architectures
Achieves competitive performance with supervised pre-training on ImageNet while using no labels

Limitations

Reconstruction loss may not always align with downstream task objectives, requiring careful fine-tuning
Mask ratio and masking strategy are critical hyperparameters that require tuning for different domains
Pre-trained models may not transfer well to tasks with drastically different distributions from pre-training data

Frequently asked

How does masking in vision differ from masking in NLP?

Why is the encoder-decoder asymmetric?

What is the right mask ratio?

Sources

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000-16009). DOI: 10.1109/CVPR52688.2022.01553 ↗

How to cite this page

ScholarGate. (2026, June 3). Masked Autoencoders are Scalable Vision Learners. ScholarGate. https://scholargate.app/en/deep-learning/masked-autoencoders

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Latent Diffusion ModelsDeep learning↔ compare
SimCLRDeep learning↔ compare
Swin TransformerDeep learning↔ compare
Vision TransformerDeep learning↔ compare

Compare side by side →

Referenced by

Similar methods

Related reference concepts

Self-Supervised and Representation Learning Unsupervised Learning Object Recognition and Detection Deep Generative Models Computer Vision Deep Learning

Spotted an issue on this page? Report or suggest a fix →

Masked Autoencoders

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Related methods

Which method?

Referenced by

Similar methods

Related reference concepts

Masked Autoencoders

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Related methods

Which method?

Referenced by

Similar methods

Related reference concepts