MAE¶

Masked Autoencoders Are Scalable Vision Learners

Abstract¶

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3× or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

Models and Benchmarks¶

Algorithm	Backbone	Epoch	Batch Size	Results (Top-1 %)		Links
Algorithm	Backbone	Epoch	Batch Size	Linear Eval	Fine-tuning	Pretrain	Linear Eval	Fine-tuning
MAE	ViT-base	300	4096	60.8	82.8	config \| model \| log	config \| model \| log	config \| model \| log
	ViT-base	400	4096	62.5	83.3	config \| model \| log	config \| model \| log	config \| model \| log
	ViT-base	800	4096	65.1	83.3	config \| model \| log	config \| model \| log	config \| model \| log
	ViT-base	1600	4096	67.1	83.5	config \| model \| log	config \| model \| log	config \| model \| log
	ViT-large	400	4096	70.7	85.2	config \| model \| log	config \| model \| log	config \| model \| log
	ViT-large	800	4096	73.7	85.4	config \| model \| log	config \| model \| log	config \| model \| log
	ViT-large	1600	4096	75.5	85.7	config \| model \| log	config \| model \| log	config \| model \| log
	ViT-huge-FT-224	1600	4096	/	86.9	config \| model \| log	/	config \| model \| log
	ViT-huge-FT-448	1600	4096	/	87.3	config \| model \| log	/	config \| model \| log

Evaluating MAE on Detection and Segmentation¶

If you want to evaluate your model on detection or segmentation task, we provide a script to convert the model keys from MMClassification style to timm style.

cd $MMSELFSUP
python tools/model_converters/mmcls2timm.py $src_ckpt $dst_ckpt

Then, using this converted ckpt, you can evaluate your model on detection task, following Detectron2， and on semantic segmentation task, following this project. Besides, using the unconverted ckpt, you can use MMSegmentation to evaluate your model.

Citation¶

@article{He2021MaskedAA,
  title={Masked Autoencoders Are Scalable Vision Learners},
  author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
  Piotr Doll'ar and Ross B. Girshick},
  journal={arXiv},
  year={2021}
}